# Help with Biopython for Beginner

Hello! I am an absolute beginner to python, attempting to learn it for genomics purposes, and I’ve been self-teaching through an online course. From the course, and many, many other examples of the internet I have made the below mess of a code. It works perfectly until the has_start_codon part. I have been working through Visual Studio Code’s python extension.
I am attempting to answer these questions as these are the last ones I haven’t been able to solve with my program:
“Identify all ORFs present in each sequence of the FASTA. What is the length of the longest ORF? What is the identifier of the longest ORF. For a given identifier, what is the longest ORF contained in that sequence? What is the starting position of the longest ORF in that identified sequence? Idenfity all repeats in a sequence for all sequences in the FASTA, along with how many times each repeat occurs and which is the most frequent repeat.”

The primary problem I think I have, is that I don’t know to reference the sequences inside a FASTA file beyond what I have already, so my has_codon section of code isn’t working like I think it should be, and the last section (findlongestrepeat) I understand even less. Similar to my first section where I use “dna” to reference a line of code inputting into the program, I assumed “sequence” would direct to the lines of sequence within the FASTA file, but clearly I’m wrong about that.
I’ve been stuck on this section for a week, and have tried many, MANY different iterations and different methods to answer the questions that I know but none have worked for me and the course I have been learning from is vague on working from a FASTA file on these points. Thank you so much in advance!

#!/usr/bin/python
from Bio.Seq import Seq
from Bio import SeqIO
import re

dna=Seq(input('Enter DNA sequence:'))
print(dna)
print(dna.complement())
print(dna.reverse_complement())
print(dna.translate())
def gc(dna):
nbases=dna.count('n')+dna.count('N')
gcpercent=float(dna.count('c')+dna.count('C')+dna.count('g')+dna.count('G'))*100/(len(dna)-nbases)
return gcpercent
print(gc(dna))
pos=dna.find('GT',0)
while pos>-1:
print("Donor splice site candidate at position %d"%pos)
pos=dna.find('GT',pos+1)

for sequence in SeqIO.parse(input('Enter FASTA File here:'), "fasta"):
from Bio.Seq import Seq
print(sequence.id)
print(repr(sequence.seq))
print(len(sequence))
def has_start_codon(sequence,frame=0):
start_codon_found=False
start_codon=['ATG','atg']
for i in range(frame,len(sequence),3):
codon=sequence[i:i+3].lower()
if codon in start_codon:
start_codon_found=True
print(start_codon_found)
break
return start_codon_found
def has_stop_codon(sequence,frame=0):
stop_codon_found=False
stop_codons=['tga','tag','taa','TGA','TAG','TAA']
for i in range(frame,len(sequence),3):
codon=sequence[i:i+3].lower()
if codon in stop_codons:
stop_codon_found=True
print(stop_codon_found)
break
return stop_codon_found

import string
import sys

def findLongestRepeat(text):
max = 1
maxPos = -1
maxDup = -1
for pos in range(len(text)):
dup = text.find(text[pos:pos+max], pos+1, len(text))
while (dup > 0):
maxPos = pos
maxDup = dup
max = max + 1
dup = text.find(text[pos:pos+max], dup, len(text))
return [maxPos, maxDup, max-1]
if (len(sys.argv) != 2):
print("Usage: python", sys.argv[0], "<filename>")
else:
text = sequence.readFastaFile(sys.argv[1])
[pos, dup, ln] = findLongestRepeat(text)
print("Found duplicate of length", ln)
print(pos, text[pos:pos+ln])
print(dup, text[dup:dup+ln])
1 Like