Using re module

Ranbarrr · May 21, 2021, 1:24pm

Hi all,
Im new to Regex and trying to solve a problem using it.
I have a file with different columns and I need to check several columns and see if they are correct.
The first column should start with “chr” and end with any number between 1-99 or one of the letters “M,X,Y”.
Second column need to be all int numbers that are bigger than 0.
4th and 5th columns need to be one of the next letters “ATCG” (only one of them).
If one of the statements are wrong even in one row it should return false.
Thats my try:

def isVCF(file):
    with open(file, "r+") as my_file:
        lines = my_file.readlines()
        for line in lines:
            columns = line.split("\t")
            if re.match(r"^chr(?:[1-9][0-9]?|[XYM])$", columns[0]):
                if re.match(r"^[1-9][0-9]*$", columns[1]):
                    if re.match(r"^[ATGC]$", columns[1:3]):
                        return True
                    else:
                        return False

What I get when I try to check the file is:

return _compile(pattern, flags).match(string)

TypeError: expected string or bytes-like object

Im pretty stuck and dont really know what else to try since Im new to regex and will appreciate any kind of help!

Jagaya · May 21, 2021, 2:12pm

Well it won’t because you have nested the return values behind two if’s, meaning the function will only return something, if the first two conditions are true.

Also if you have 3 conditions that have to be true at the same time → chain the conditions in one if, instead of nesting if’s

Though for testing, you could always check one condition after the other and print something to see what you got.

Then you need to change the slice in the third check.
Also my assumption is, that you cannot re.match() on a list - so you have to turn this into two tests instead of one on columns[3:5]

Ranbarrr · May 21, 2021, 2:21pm

Umm, I think I got you, thats my new try:

def isVCF(file):
    with open(file, "r+") as my_file:
        lines = my_file.readlines()
        for line in lines:
            columns = line.split("\t")
            if re.match(r"^chr(?:[1-9][0-9]?|[XYM])$", columns[0]) and re.match(r"^[1-9][0-9]*$", columns[1]) and re.match(r"^[ATGC]$", columns[4])  and re.match(r"^[ATGC]$", columns[5]): 
                return True
            else:
                return False

Im checking it on a file that should return true but its returning false…
I tried to print columns[0] but it only return the string from the first row and not all the column… maybe this is the problem?

Jagaya · May 21, 2021, 2:45pm

Looks better ^^
Also you can have line-breaks within the chained conditions, like

if(a > b and
   b > c and
   c < a):
   return "b is the biggest"

If it is not properly generating the strings you want to check, that’s certainly a problem.

Ranbarrr · May 21, 2021, 2:49pm

Doesnt this line generates columns from the text?:

with open(file, "r+") as my_file:
        lines = my_file.readlines()
        for line in lines:
            columns = line.split("\t")

or am Im missing something?

Jagaya · May 21, 2021, 4:49pm

Easiest way to test it, would be to print it out

One thing I just noticed is that because return exits a function, the current code only checks the first line of the text, because then it will hit a return.

What you actually want is to only hit return False when at least one .match fails or return True once the loop is finished without hitting the previous return.
So this might also a reason it only printed the first line.

system · November 20, 2021, 4:49am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Help with regex (why does my regex match this string)	3	398	June 1, 2021
Consecutive repeats with Regex	5	1479	June 1, 2021
Matching a Regex String	2	398	January 16, 2021
Regular Expression - String display issue Python	7	333	January 20, 2022
Regex challenge issue? JavaScript	4	665	February 2, 2021

Using re module

Related topics