Help with Biopython for Beginner

Hello! I am an absolute beginner to python, attempting to learn it for genomics purposes, and I’ve been self-teaching through an online course. From the course, and many, many other examples of the internet I have made the below mess of a code. It works perfectly until the has_start_codon part. I have been working through Visual Studio Code’s python extension.
I am attempting to answer these questions as these are the last ones I haven’t been able to solve with my program:
“Identify all ORFs present in each sequence of the FASTA. What is the length of the longest ORF? What is the identifier of the longest ORF. For a given identifier, what is the longest ORF contained in that sequence? What is the starting position of the longest ORF in that identified sequence? Idenfity all repeats in a sequence for all sequences in the FASTA, along with how many times each repeat occurs and which is the most frequent repeat.”

The primary problem I think I have, is that I don’t know to reference the sequences inside a FASTA file beyond what I have already, so my has_codon section of code isn’t working like I think it should be, and the last section (findlongestrepeat) I understand even less. Similar to my first section where I use “dna” to reference a line of code inputting into the program, I assumed “sequence” would direct to the lines of sequence within the FASTA file, but clearly I’m wrong about that.
I’ve been stuck on this section for a week, and have tried many, MANY different iterations and different methods to answer the questions that I know but none have worked for me and the course I have been learning from is vague on working from a FASTA file on these points. Thank you so much in advance!

#!/usr/bin/python
from Bio.Seq import Seq
from Bio import SeqIO
import re

dna=Seq(input('Enter DNA sequence:'))
print(dna)
print(dna.complement())
print(dna.reverse_complement())
print(dna.translate())
def gc(dna):
    nbases=dna.count('n')+dna.count('N')
    gcpercent=float(dna.count('c')+dna.count('C')+dna.count('g')+dna.count('G'))*100/(len(dna)-nbases)
    return gcpercent
print(gc(dna))
pos=dna.find('GT',0)
while pos>-1:
    print("Donor splice site candidate at position %d"%pos)
    pos=dna.find('GT',pos+1)

for sequence in SeqIO.parse(input('Enter FASTA File here:'), "fasta"):
    from Bio.Seq import Seq
    print(sequence.id)
    print(repr(sequence.seq))
    print(len(sequence))
def has_start_codon(sequence,frame=0):
    start_codon_found=False
    start_codon=['ATG','atg']
    for i in range(frame,len(sequence),3):
        codon=sequence[i:i+3].lower()
        if codon in start_codon:
            start_codon_found=True
            print(start_codon_found)
            break
    return start_codon_found
def has_stop_codon(sequence,frame=0):
    stop_codon_found=False
    stop_codons=['tga','tag','taa','TGA','TAG','TAA']
    for i in range(frame,len(sequence),3):
        codon=sequence[i:i+3].lower()
        if codon in stop_codons:
            stop_codon_found=True
            print(stop_codon_found)
            break
        return stop_codon_found

import string
import sys

def findLongestRepeat(text):
    max = 1
    maxPos = -1
    maxDup = -1
    for pos in range(len(text)):
        dup = text.find(text[pos:pos+max], pos+1, len(text))
    while (dup > 0):
        maxPos = pos
        maxDup = dup
        max = max + 1
        dup = text.find(text[pos:pos+max], dup, len(text))
    return [maxPos, maxDup, max-1]
if (len(sys.argv) != 2):
    print("Usage: python", sys.argv[0], "<filename>")
else:
    text = sequence.readFastaFile(sys.argv[1])
    [pos, dup, ln] = findLongestRepeat(text)
    print("Found duplicate of length", ln)
    print(pos, text[pos:pos+ln])
    print(dup, text[dup:dup+ln])
1 Like