Matching and Extracting Data

Hello, thank you for taking the time!:heart:
Here in the curriculum I currently am:
Can’t figure out why there are double backslashes in the excersice:
The excercise test string :

'A message from csev@umich.edu to cwen@iupui.edu about meeting @2PM'

The excercise Regex :

\\S+@\\S+

Thought it is a mistake :scream: and will search for literal backslash rather than non-spaces.
Checked it out in python aaaaand… it’s worked:
image
so in order to investigate the phenomenon I opened regex101 site, aaaaand…


it was actually agreeing with my first assumptions, and didn’t find any matches.
Can anyone shed light why it doesn’t work in theory but will work in practice?

Not sure why they are teaching it that way.

The proper way would be to use:

lst = re.findall(r'\S+@\S+', s)

The reason you see the extra \ is you need to escape the \ to have it interpret it as a literal \, so that together with the S after it, it is like \S as you would normally write.

Using the r in front of the string allows you to write normal regular expressions without having to esacpe the \.

There are times you need to dynamically create a regular expression and pass it to re.findall where you would need to escape the expression but in this case there is no need to do so using the r.

Thank you! and nice trick to know- that ‘r’…
But notice that the regex actually matches “csev@umich.edu” and “cwen@iupui.edu”! (see terminal prtsc)
As though “\\S” (computer-lang) is interpreted as “non-space-character”(humans-lang), and not like “back-slash followed by capital s” as you told it should.
The question in the challenge seem to predict this behaviour, unlike what I expected after watching the video lecture, or by using “regex101” tool.

Maybe I did not make myself clear, '\\S' is the same as r'\S'. Both of these mean non white space character.

Sorry, I edited that part of my response . it’s a double back-slash. now you got my question?

I am not understanding your question. I thought I answered it already.

If I got you right, the regex

\\S+@\\S+

matches the string

"\S\S@\S"

and not the string:

"csev@umich.edu"

It’s turn-out not to be the case, and that is what my question is about, :thinking:

If you are using regex101, then:

\\S+@\\S+

would match the string:

\S\S@\S

and not the string:

csev@umich.edu

in the string:

A message from csev@umich.edu to cwen@iupui.edu about meeting @2PM

That is my point! look above, in the black print-screen, what happend when I run python itself in terminal- it matches “csev@umich.edu”! and that also the right answer in the challenge!

Yes, I know. I already explained this in my first reply.

Using regex101.com is like using the following:

r'\S+@\S+'

Notice the r in front of the string?

If you do not include the r in front, then to accomplish the same thing, you must write it as:

'\\S+@\\S+'

to match the strings “csev@umich.edu” and “cwen@iupui.edu”

Feel free to abandon me if I’m too tiresome :woozy_face:
Anyway-
python terminal will match the said email adressess wether I use

'\\S+@\\S+'

or

'\S+@\S+'

and that make no sense to me.

Now, you said that there are 2 ways to make backslashes to be treated literarely.

  1. prefixing ‘r’
  2. writind double backslashes instead of single.

That to assume that I want to make them be treated literarely.
But my goal is to match the fictional email adressess which contain no backslashes, in order to do that I want them to be treated unliterarely which means

  1. no prefixing ‘r’
  2. no double ‘backslashes’

So why will it match even with the use of double-backslashes?!
And why are you saying that

If you do not include the r in front, then to accomplish the same thing, you must write it as:

'\\S+@\\S+'

to match the strings “csev@umich.edu” and “cwen@iupui.edu”

But according to your first reply

The reason you see the extra \ is you need to escape the \ to have it interpret it as a literal \ , so that together with the S after it, it is like \S as you would normally write.

It will make backslashes to be treated literarely, and hence, to not match the said email adressess, which is opposite to my goal?

That is not what I said.

Using the python terminal, you can use:

re.findall('\\S+@\\S+') or re.findall(r'\S+@\S+')

I said it because that is what happening not because that is what you said :smirk:

And I dont get you, why

re.findall('\\S+@\\S+') or re.findall(r'\S+@\S+')

and not the opposite

re.findall('\S+@\S+')
if want to match non-spaces rather then backslashes?

Unless you add the r in front of '\S+@\S+', it just will not work. That is just how Python regular expressions work. If you insist on not using the r in front of the string passed to re.findall, then you will have to use the double backslash in front of the S for Python to realize you want to match non-white space character.

So why single backslash, no ‘r’ will work too?

I am not sure what you are asking.


Here, I use both single and double backslashes regexes,
both of them will match the said emails.

I apologize for my previous replies. I stand corrected. I guess it does not make any difference. I thought I had tested it before replying.

I guess without including the r prefix to the string, then it treats them equally. If you really want to match the string “\S\S@\S”, then you would need the r prefix:

r'\\S+@\\S+'

I will admit that seems a bit quirky.

1 Like

Well, regexes can really be a one of a headache!
The probalem is probably that the general python interpreter treat the escapes before the re libarary can see the regex or something. played with that a little bit, got just more confussed:


Here I included several backslashes in a string.

  1. for regex of four(!) backslashes it will find two backslashes per one real backslash in the string,
  2. for regex of three, Python will send error " unterminated string literal "
  3. for regex of two (the ‘normal’ one) re libarary will send very long trace-back and finally " bad escape " error.

I really appreciate you patience and help ( already beforehand admires your activity in the forum ), best regards! :smiling_face_with_three_hearts: