Help with exercise 11.1 from the book ''Python for Everybody'''

biancaec · August 7, 2020, 10:56pm

Hi!
I’m having problems with this exercise:

It says that I should find 4175 matches with ‘java$’, and that’s what I find using grep on my computer, but my program finds 4218 matches. Could someone give me some help?

My program:

import re
reg = input('Enter a regular expression: ')
fp = open('mbox.txt')
count = 0
for line in fp:
    line = line.strip()
    if re.search(reg, line):
        count = count + 1
        
print('mbox.txt had %d lines that matched %s' % (count, reg))

sgarz · August 7, 2020, 11:33pm

Basic troubleshoot will be comparing the result of your program versus the real “grep” command in some Linux Shell, freecodecamp uses the “repl(dot)it” site ,(atleast for the Data Analysis projects), which actually include a Linux Bash Virtual Machine that runs on the Cloud if you currently don’t have access to a Linux shell. There’s a possibility that the file actually got the number of matched expressions your python program states, so the book answer could be wrong

biancaec · August 8, 2020, 2:12am

Hey, thank you!
Yes, I tested grep in a linux system and it gave the same result as that in the book.

sgarz · August 8, 2020, 7:06am

Sorry I overlooked the part wher you said you tested grep in your PC
I did a little bit of testing using your code and it works fine , just as ‘grep’, tested on Bash 4.4.20 and Python 3.8.3 respectively, the term “java$” no quotes in regular expressions means all lines finishing in the word “java” will match , an hypothesis I have is that somehow in your pc the term “java$” is expanding to including the literal “java$” as a match. It would be great if you share the txt file you are using to test your program, nonetheless according to my own testing your code works OK.

biancaec · August 8, 2020, 3:22pm

I was doing some tests with a shrinked version of the file and found that grep skipped this line, for example:

MODIFY /site-manage/trunk/site-manage-tool/tool/src/java/org/sakaiproject/site/tool/SiteAction.java

The file is this one: https://www.py4e.com/code3/mbox.txt

My bash version: 4.3.11
Python version: 3.6.1

sgarz · August 9, 2020, 6:20am

Grep isn’t matching because “java$” implies that only a blank space will precede your expression, in this case “/site-manage/trunk/site-manage-tool/tool/src/java/org/sakaiproject/site/tool/SiteAction.” is preceding the word “java” by being concatenated.
A small RegExp lesson here:
The regular expression “.*java$” will include that line you mention, because it means the following , step by step:
’ . ’ The Dot means any Character
’ * ’ The Asterisk means that the previous symbol can be repeated 0 or infinite number of times ( in this case the previous symbol is Dot, so it means any other character can be repeated 0 or infinite number of times)
’ java$ ’ The CASH symbol at the end means that it only will match where the word java is at the end of the line
So joining all these rules means that “.*java$” will match all lines that end with the word java, all while being preceded by 0 or More characters including a blank space

sgarz · August 9, 2020, 6:55am

I made several tests with the mbox file and I noticed the following iteraction with your code:
The “line = line.strip()” can be commented or deleted out and now the program will work exactly as ‘grep’.

Notice that this doesn’t mean that this line is wrong , it was causing the program to read line by line including completely blank lines, while ‘grep’ , only iterates through lines with actual data by removing alot of blank lines and spaces, thus affecting searches with strict regular expressions.

A little thing to notice is that “your original code vs grep” has the same difference than the UNIX commands “wc vs nl”

command ‘nl’ in bash numbers and output the lines of the whole file, resulting in 117530 lines (so it’s actually removing the blank lines)
while command ‘wc -l’ states that the file really has 132045 lines (so it’s including blank lines too)
nlvswc

Now we use the regular expression “.*” to match all the lines of the given file (thus, counting all the lines of the file ) in your code and with ‘grep’ and watch the results:
pythvsgrep
notice how your program and ‘wc -l’ matches, just as ‘nl’ and ‘grep’ respectively

biancaec · August 9, 2020, 6:08pm

I found out what it was. The problem was that some lines in the file have a whitespace in the end, including the line 325, which I gave as an example of the lines ‘grep’ wasn’t matching. So when I put the line “line=line.strip()” in my code, I was getting rid of these blank spaces, so increasing the number of matches. Now, when I ask ‘grep’ to find ‘java\s*$’, he finds exactly the number of 4218 matches.

Thank you for being so helpful!