Using re module

Hi all,
Im new to Regex and trying to solve a problem using it.
I have a file with different columns and I need to check several columns and see if they are correct.
The first column should start with “chr” and end with any number between 1-99 or one of the letters “M,X,Y”.
Second column need to be all int numbers that are bigger than 0.
4th and 5th columns need to be one of the next letters “ATCG” (only one of them).
If one of the statements are wrong even in one row it should return false.
Thats my try:

def isVCF(file):
    with open(file, "r+") as my_file:
        lines = my_file.readlines()
        for line in lines:
            columns = line.split("\t")
            if re.match(r"^chr(?:[1-9][0-9]?|[XYM])$", columns[0]):
                if re.match(r"^[1-9][0-9]*$", columns[1]):
                    if re.match(r"^[ATGC]$", columns[1:3]):
                        return True
                    else:
                        return False

What I get when I try to check the file is:

return _compile(pattern, flags).match(string)

TypeError: expected string or bytes-like object

Im pretty stuck and dont really know what else to try since Im new to regex and will appreciate any kind of help!

Well it won’t because you have nested the return values behind two if’s, meaning the function will only return something, if the first two conditions are true.

Also if you have 3 conditions that have to be true at the same time → chain the conditions in one if, instead of nesting if’s :wink:

Though for testing, you could always check one condition after the other and print something to see what you got.

Then you need to change the slice in the third check.
Also my assumption is, that you cannot re.match() on a list - so you have to turn this into two tests instead of one on columns[3:5]

1 Like

Umm, I think I got you, thats my new try:

def isVCF(file):
    with open(file, "r+") as my_file:
        lines = my_file.readlines()
        for line in lines:
            columns = line.split("\t")
            if re.match(r"^chr(?:[1-9][0-9]?|[XYM])$", columns[0]) and re.match(r"^[1-9][0-9]*$", columns[1]) and re.match(r"^[ATGC]$", columns[4])  and re.match(r"^[ATGC]$", columns[5]): 
                return True
            else:
                return False

Im checking it on a file that should return true but its returning false…
I tried to print columns[0] but it only return the string from the first row and not all the column… maybe this is the problem?

Looks better ^^
Also you can have line-breaks within the chained conditions, like

if(a > b and
   b > c and
   c < a):
   return "b is the biggest"

If it is not properly generating the strings you want to check, that’s certainly a problem.

1 Like

Doesnt this line generates columns from the text?:

with open(file, "r+") as my_file:
        lines = my_file.readlines()
        for line in lines:
            columns = line.split("\t")

or am Im missing something?

Easiest way to test it, would be to print it out :wink:

One thing I just noticed is that because return exits a function, the current code only checks the first line of the text, because then it will hit a return.

What you actually want is to only hit return False when at least one .match fails or return True once the loop is finished without hitting the previous return.
So this might also a reason it only printed the first line.

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.