Consecutive repeats with Regex

Hello
I am working on a project which i have to find consecutive repeats of the DNA STR to match it with the owner. For that i want to use Regex( Regular expression ).

What i did is :

    import re
    string = "AATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGGTTAAAGCCAAGTGGAAGTTGACGAGCTACGGCACAGGTACCCTATACATACGGTAAATGAGTCGGAGGTTGTGGGTTTAAAGTAAGTCCCCGCTCAACATTCAGCAGACCCTCGAAGTGGGCCCTAAAATCGTGTTGCTAACGCTCCGGACCTGACCCCGAGCTTGGCTCCTAATTGTGTACTCTCTCCAACCAAGCAGCGTACCAACGCGGCAACCAGAGCGAAGCTGTACACGTCGATCATCGTTACGCCTCTACTCGATAGTCGTAGAAACTTGTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTGCGGTTGGTAGCTCTAACTGTCATCGTATTCGCGAATACCTCAGATATAAGCTCCAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAAGTGAATGCACGAGAGTGTTATAGCAGATATCCCCGCTGATCCGGCTGCCGAGGAGGTGGGCATGTGACGTTATGCACTACACAGCTACTACCAAGGTCTTCTGCGGGAAAGGATAGACAAACCGGCAACTCCGCGAGGTCGCGGACTTAGTATTGCGACGGCGTCCTAATCGGCTGGATTTGCGGTTTGTTGGCGTTAGTCCAAAGGTGCCGCTAATGTGGCCATATTTACGATCCACCCTATAGGGCTCCAGGTCGTTTTAAGTCGAGTCGTGTCTAGGGGCCATTCCTGGCCTTGAACGAAAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCGGCCTTAATGCTCAGATTCATATGCTGTGAGGCCGAGGGTGGCGTCATATCTTCGATGATGTTGAACATACGGTCCGGTATTTCGACTTGCCACCTGGTACTGCTTTAAAAGATGATACCATCAACAAAAGGGCACGGCGTGCCTCATGCAGGACGGGACGTTGCCTGCCTACAGCGCTCTACGTAGCAATGTCCGTCTTTCTTCATACACGTATGCTCCTAAAGAAATTGTAGTCTAACAGCTTCCAAACTGTAATCGCCGTTAGGTTCGTCTAAAGTAAAAATGATTGCAAGACGCAATCGAAGGAGCCATCGTTTCGAGGTGACTTCTAATATAACTACCTATGGTCATAGCATGCCCCAACATTGAACGAGGTAAGATCACGGGATCGACTGTCCTGGCGAGGGCCTTACGTTAGTCGTGTAATGCTCCGCGCGTCCCAAATATATGAAAGGCACGACACTCCCCACAATTTAACCCTCCCGCCAAATAAGTACCTAGCGGAGATAAGAATCTGGTCGGTCAGAAAAGGGTCTATGTCCTACAGAGTAGGGCGAAGTCCGCATACCGCAACAGTGCGGTGGCAAACGCTTTAATGACCAGGATCGTGCTAGGCAGTGGAATTTCATGTGGATTGGCCCGCGAATGGACAGGGAGCTATGTCTGAACTCTGTTGACGCTGAACTGTATCCGGATCGTCATGTGAATCGTAGCTATGGGAGTGGTGGTACTGTAAGTCAGGGCTACTTACTGCGGGGTATCTATCTATCTATCTATCTATCCTCACAGTTCATGATTATACGGATGTAATTTGCCGCTGGCTCACGATACGGCTATACAGCGTTGGCTCCTAACGTTGCCACCTACAGTCTGCACTTGGGCACTCGGTATGGTATAAAATATATGACGGCAGACGTTGCGATAAGTAAAAGATCGAACAATCTCGCAGCAAATCTTAAAGCGCATCTAACATCGGGCGTGCGAATGGACCGTTCCGAGGGACACTAGTCGAGCCCCTCTTACAGCTCACAGGTAAATCGATTATCGTACGTAAGTCAAGTCGGCACTGCTTTACGGCAGGTAGTAATGGCTGCGTGCTGCGCAGACCTTCTGCCCCTCAGTTAGTCACGGCCACTAGCCCGGGAAAATATAGTTCGGACAGAAAAATCAGTACCCAGCACCCAACTAAAACAAGTTCTATTCCGAGACGCCTGCGGAGAGCCTCACTCGTTATAACTATGTACGGCGGATGGGGGTAGGGTATAAAGGGCATGCGTCTACACCGATTTCCTGGTTAATGATAATCTAGTTCTTAAAGCACTACTAGGCGCTGCGAATAGGGGTATTGGGCAATAGGCCCTGAATTAACCTTGTTTAGGGTTAGCCTATGCAGCGACCGTAGTACAATAATATCTATAAACGGGTACTCTCCAGACGTATTCATTAACTTCTCAATGAGGAACTATCTACAAAATCAATGAGTGATAACAGCGCATATGAAAAGTATGCAGTTGTTTCAAGCTGTTAACGGCCATTTCCACGAACGTGTTCACAGAGTAGAAGAAACGTAAAGCGTTACTCATCTCCGATACGGTGCGTGCGATGGGGCGTATTGCTTGTAATGTCGAGGGACGGGCATTGAAAAGAGTGCCACAGCATATCGGAGCAATTCACTAGTGAGCGTACCTTGATAAAGCAAAAGGATTACCTATTTTGCACACGTGTGCTAACCCCCAAGACCTGTTGAAACCGCCGAGCATCCGCCAATTTCTAGCACAACATTTCCATCTGCAACTAGCCGTAGAGCACTCAGGAATTTGATCTTAACATGATCGTGGAGGCAAGAAAAAAGGATGCAACAGCACCTTAGAGCACGAGATCATTCCTGGTTAATATTATGCTGTACGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGATCGCCATCATTAAGTACTTATATCTGCATAGAACATTAAGCGAGACGTTTGTGAGATATTCCCCTCTGGGCCCTTAGCTTCGCAGTTCCTCAGCGCCCTAAGATAAACGGGTGTAGCAGAAGAATCGGCGTGCTTTTTACAAGTCCTGCCGGCGATTCAGCATCAATTATAAACGGCCCCTAATAGAAATAGGGCGGCAGGAGTCAATTGGTATCGTTTTGGAGCCATTCACCGCCAAGGGTCAGATAACCCGGCATTCACTGCTGTATTCCCGGATTAACGGATCTCGGATCCAATGGCCCTCTGTGCCGATCTAATACTGCACGCTTAGTGGGCGGGATCAGATGAATGGCACCTCAGCCCCCCGAATTCAGTTGCTGGCCAGACGAGGGCGGGGACTGTTTGGAATTATTTGCTCAGTCCTTTTATCATCCCGATGCTATGACTCAATCCTCTAGATCCTTGGATGTCTCAGGAAATCTCACACATCATAGTCAACAAGAAACGAGACAAACTCGACTTGAGACTTCATCGCCTACAGTGTTTTATTGTAACGGGCACCTCTATATGTCGTCTTGATGGCATCAACAGCGCATGGTGATACATCGCTAGCGGCATTAGGCTTGATTGGTGCTTGCCGGGCGGGAGGCCATTTGGAGAGAGGCAGACTATCGTGGCATGCCGTAGCGCTTTGCATGCAGGTGGCGCGACCGTAAGGAGTGCAAGATGTAGATTGTCACGCTAAAGTTTATCACGTGATACTAGCTGACGTGTCCATAAGGCACGCAACAGCCTGCTCTAGGTTACTGTAGGGCTTGGCGATAGCATAGATAGGCCTGAGGGAGTTCTGGCGTAATAGTTGTTAGATAAAGCTGCCCAAATCCAACAGCTGGATTTCATGTGTGTTTGATAGCGCAATGCACTCATACTCAGTCCTTGCCAGCATGCTGTCACACGATGTACATCGTTAGCCCTAAGAGCCCCGTCGAGTAGCTAGTAAGCCTCATGAATGATACTCGGGGCCTCCCGACATAGACGCAGCTTGAGTGTCGGACGAGTATAAGCCATCCCAATGATTTGCCACTTAGAGAGTAGCGCCGTTTGGGATTGAGTCGAAGAGCGTGGCCTTAGACCACATATGATTTGCTTGCGCCTCCGTATCGCTTGCATTTGAGATGGAGCCTCATTTCTCTACCATCGCCGACTAGCAAGTTACCGATGGACAAGCCTAGCTTGTGTACTTTGAGAGTGGCTTCGTCACCAAAGGGTAGCCATAACCTCAATGGCTGTGATCTCTTACCCCCGGGGTCGGGCGAGATCTGGGCGAGAAGACTGCACGAGCCCTAGAAACTGCAAGTGGCACGGCTTCTTGTCCCATAGGCTATTGAGGGCATTGTTGAGTCGAAGTTTCTCCTAAAAATGTGAACATAGTTTCCCGCTCAGAGATACTCGCTTAAAACTCATACCATGGATGGCTGGAATGGACAAGCGGTATTCGTGCTGTGTAGGGATCCGCGTTGGTCTATTAACCACTGAGCGGATGCGGATTAAAGGGACAGACGATTACACGCCACGGAAGTCCTCGTCTGTGACGGGTCCCTCGCGTCTCCCCCAGAGGACCTTCATTCCCCGGTGGAGCGTCCATACGGTCTAGCTTGTACGCTTCGGGGTCGGGTATCGGACTGACCTATACGACAGACATATCCTAGAGAGGCCTAGATGGACCGGGAGCACGCGAGGGCAAACTCCCTCGCTATCCCACTTCGATTTCCCGGGGAGGGCGGCGTTTTAACACGTAAGGCACGTCTATTAGATGAGCTTATATATATGCGAACTTTGATCCAATTGGCACAGAACGTCAATTAAGAAAAATAATACGGAGATAGTGCCGCAATTGTCCATTTATACGCACCCTCTTTCTAGTATCTAACGTTCTTGGTACGCGGTCCACTAGACCCGACTCATAGCGTTATAATTTCCTGGTATCTATTAAATCGTCGGCCGTCTTTTCCACTAGTAACCTGCTCTTAGGCCGCAGGCACGGGCGTACGATACCCCCCGTACGGTGTAACATCAGTGCGAAGTAAATACGGGGCCAGCGTGTAGACGATAGTCATGTTAGCTGGAAGGGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATATTCTGAGTATGCCGATCCAGGTTTGGCAGCAACGGAAAATATCTTCTACTTGGGCCCCTATAACGAAATGTCTGCCTAACCACCTTTTTTCTGGACCCTCAACATGCCAGTTAACCCCGCGCGGGAAAAGCGTCTGGCGCGGGCGTCGGGATATACTGACCAGTAGAGCACTGATTAAAGTATTTGTGGTTAAAAATTCACAACGTATTCCATGCGGGACACCGACACGCACGTCAGTTGCTCGCAGGTGATGGTAGAGGGGTGGATCGACCGAGGTCGGGTTGGTGGGTAAAGGTTAGCCTGCACCACGCGAATGTGCTCCATTCAATTTTGGGGGTGCGATTCTCCGTTGCGGGATCCAAGAGGAGTTAAGATGGCCTTGTCCAGTTGAAACTTGGCTGTGGCATGGGCGACAAGATAAAAGGGTTATTACTGATCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGCACTGAGGTCTAGTACGTACGATGAGTGAGCATCGTTATTGGAAAAAGTCATGAACCGG "

    x = re.findall(r"(AATG)+",  string)

The thing i am trying to do is to get all consecutive repeats of AATG
Here the problem is in my Regex . it should return 13 , but it gives me 27 back !

Here is a picture of the Correct values that i should get for that DNA sequence.

The DNA sequence here is for Ron which is highlighted in Blue. So i should get 13 back instead of 27.

The regex and the code is in python !

I would really appreciate it if anyone can help . Thanks,

Hello there,

You need to remember, the letters come in pairs. So, you will need to match the string, provided part of it is not split between another pair. That is, if the line AATG does not start from an even index, then it is not a match. I am sure there are other ways to solve this, but I hope this leads you on the right track.

Just thought I would add this in here aswell: https://www.freecodecamp.org/learn/javascript-algorithms-and-data-structures/intermediate-algorithm-scripting/dna-pairing

I realise you may not know JavaScript, but the solutions are near-identical, and, it may help you to head over to the Guide Solution Post, and you can treat the code like sudo-code.

Also, if you can solve the above, then yours might be even easier.

Hope this helps

1 Like

Well thanks a lot . I do know JS i learned some :stuck_out_tongue: yet not perfect but i know the basics i will take a look and see if they help out.

By that you say there are other ways to solve the problem can you give me some suggestions ? because Regex was the best solution came to my mind actually !

Sure, as I mentioned, the index of where your regex searches is important. To elaborate:

# I have separated this into pairs for convenience
a = "AA, TG, CC, MA, AT, GA"
stripped = "".join(a.rsplit(", "))
print(stripped)
myMatch = re.findall(r"(AATG)+", stripped)

print(stripped.index(myMatch[1],1)) # Starting from index 1, to avoid first match
print(myMatch)

As you can see, there should only be one match, because the final is not a part of the same pair. But, we can see that the second match is found in index 7, which is an odd number, so, we must be aware it is half of a pair.

So, you are absolutely correct - regex is an excellent place to start. However, without some severely complicated patterns (I cannot think of them), you will have to combine it with some other methods to find the correct matches.

Hope this helps

1 Like

It may not solve the solution. Still it is a really great hint :slight_smile: which will help in finding a good way to solve the problem. Thanks a lot.