Python for Everybody - ERROR - Pagerank Spider Exercise

davy.jones · June 21, 2023, 10:07pm

Hi everyone,

(Reposting this question that was asked before but no answer yet)

I’m trying to complete the Page Spider Exercise linked in the last lesson of the Python for Everybody course, so I can better understand the learnings.

However, when I try to run the program (spider.py) it comes up with an error, like this:

users:$ python3 spider.py
Enter web url or enter:
[This is where you enter dr-chuck dot com url]
How many pages:5
1 h…dr-chuck dot com Unable to retrieve or parse page
No unretrieved HTML pages found

I’ve watched the exercise video and Dr Chuck says that if this error comes up it’s likely an issue with BeautifulSoup or something, but I have version 4 installed which is the latest version.

Does anyone have any ideas why this won’t run for me?

The file I’m using is spider.py located on the PY4E website

I have been trying to troubleshoot a look-up a solution and my last resort is post this message again and see if anyone else was facing this issue?

Thanks!

D

pkdvalis · June 22, 2023, 1:51am

Can you paste the code here please or link to it?

davy.jones · June 22, 2023, 2:05am

Hello, I’m not able to post the link, lets see if this one works of the last time this question was posted

Python for Everybody - ERROR: Page Spider Exercise - Python - The freeCodeCamp Forum

pkdvalis · June 22, 2023, 11:31am

You need to type http://www.dr-chuck.com when it asks for a URL

Enter web url or enter: http://www.dr-chuck.com
[‘http://www.dr-chuck.com’]
How many pages:5
1 http://www.dr-chuck.com (8386) 2

davy.jones · June 22, 2023, 5:32pm

Hi, I tried that but no avail. I also tried this link: python-data.dr-chuck dot net

But I get the same output: No unretrieved HTML pages found

pkdvalis · June 22, 2023, 5:42pm

Can you past the error in here as you did before?

You can see my output above is successful. Maybe try a different URL?

pkdvalis · June 22, 2023, 5:43pm

Are you typing “dot net” or “.net” ?

davy.jones · June 22, 2023, 5:51pm

This forum isn’t letting me post external links so I was writing that way. In the pycharm program for spider.py when I run, I enter .net.

pkdvalis · June 22, 2023, 5:52pm

Can you screenshot the error and post it?

pkdvalis · June 22, 2023, 5:54pm

Might have better luck pasting the errors between backticks or with a > blockquote

like this

Try the blockquote or premformatted text options to paste an error

pkdvalis · June 22, 2023, 6:47pm

Ok, this is a different error message, read it again. You need to go to the next step of the excercise

pkdvalis · June 22, 2023, 6:50pm

The spider is going out and looking for pages, and comparing it to a local database. It’s already retrieved what it’s looking for.

davy.jones · June 22, 2023, 7:08pm

(Sorry I’m not able to post that youtube link here) I found this youtube link of the walkthrough of this code by the author of the tutorial. It is at ‘Exercise: Page Spider’ in this freecodecamp link pasted below.

Python for Everybody - Data Visualization: Mailing Lists | Learn | freeCodeCamp.org

At 11:37 mark you can see the output should be something else.

My error message is

No unretrieved HTML pages found

It couldn’t find any pages at that link but it should have.

pkdvalis · June 22, 2023, 7:15pm

Delete or rename spider.sqlite to spider.sqlite.old

pkdvalis · June 22, 2023, 7:18pm

Is this your full error?

Enter web url or enter: http://www.dr-chuck.com
[‘dr-chuck dot com’, ‘http://www.dr-chuck.com’]
How many pages:5
No unretrieved HTML pages found

pkdvalis · June 22, 2023, 7:21pm

Compare to the previous time I ran the program:

How many pages:15
23 http://www.dr-chuck.com/Sakai_ Building an Open Source Community - Charles R. > Severance.epub Unable to retrieve or parse page
25 http://www.dr-chuck.com/html Unable to retrieve or parse page
26 http://www.dr-chuck.com/errata.txt Unable to retrieve or parse page
24 http://www.dr-chuck.com/Sakai_ Building an Open Source Community - Charles R. Severance.PDF Unable to retrieve or parse page
No unretrieved HTML pages found

See the message at the end? It means it has all the pages it can find. The next time I run it, I only get this message:
No unretrieved HTML pages found
Because I already ran it and the pages are stored in the database.

davy.jones · June 22, 2023, 7:28pm

Yes, this is the full error. In the code you have option to just hit enter and it will load the default dr-chuck url still get the same error.

davy.jones · June 22, 2023, 7:30pm

I did the renaming like you suggested and ran it again. The program created a new spider.sqlite file and still doesn’t run past the error I’ve been getting. I checked the contents of the sqlite file and they’re same as before

davy.jones · June 22, 2023, 7:31pm

But when you open spider.sqlite from dbbrwoser, do you see 15 entries like it is supposed to? I’m able to only see 1 which is the default page.

pkdvalis · June 22, 2023, 8:24pm

Can you show me the full output after deleting (or renaming) the sql database?
And show me the database contents.

No unretrieved HTML pages found isn’t an error, that’s what it shows after it’s complete.

Also, can you check your firewall and just make sure Pycharm is allowed through?

What happens when you run spdump.py ?