Scientific Computing with Python: Regular Expressions

I’m working through the course but hit a spot of confusion with the question posed at the end of the step linked below:

What will the following program print?:

import re
s = 'A message from csev@umich.edu to cwen@iupui.edu about meeting @2PM'
lst = re.findall('\\S+@\\S+', s)
print(lst)

I don’t understand why the additional backslashes are included. Does that not mean that the match is looking for an escaped backslash followed by one or more ‘S’ etc?

However, the output from print(lst) appears identical whether or not I include the extra backslashes…

Put it in here, and it should explain it: https://regex101.com/

This site is endlessly useful for learning regex

Unfortunately, this just confirmed what I thought. I’d already tried it on another site with the same result too. With the additional backslashes, the pattern being matched changes completely.

It is weird… if I test both on google colab, they both work:

import re
s = ‘A message from csev@umich.edu to cwen@iupui.edu about meeting @2PM
lst = re.findall(‘\S+@\S+’, s)
print(lst)
[‘csev@umich.edu’, ‘cwen@iupui.edu’]

import re
s = ‘A message from csev@umich.edu to cwen@iupui.edu about meeting @2PM
lst = re.findall('\\S+@\\S+', s)
print(lst)
[‘csev@umich.edu’, ‘cwen@iupui.edu’]

But on regex101 \\ matches a backslash so it never matches.


Looks like there might be a difference in Python’s RE implementation: https://docs.python.org/3/library/re.html

Regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. This collides with Python’s usage of the same character for the same purpose in string literals; for example, to match a literal backslash, one might have to write '\\\\' as the pattern string, because the regular expression must be \\, and each backslash must be expressed as \\ inside a regular Python string literal. Also, please note that any invalid escape sequences in Python’s usage of the backslash in string literals now generate a DeprecationWarning and in the future this will become a SyntaxError. This behaviour will happen even if it is a valid escape sequence for a regular expression.

The solution is to use Python’s raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with 'r'. So r"\n" is a two-character string containing '\' and 'n', while "\n" is a one-character string containing a newline. Usually patterns will be expressed in Python code using this raw string notation.

1 Like

Many thanks for your reply! I’ve been away for a couple of days so haven’t responded sooner but this is really helpful thanks :slight_smile:

1 Like

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.