I am doing a scraping project, and have tried using cheerio and find it very challenging. Several folks have told me that python is much better for scraping.
I have managed to update my python version and follow some tutorials for setting up a local enviroment, but It’s still very confusing to me…I’m not sure what’s going on when I am in that local environment running commands.
with a javascript project, package.json contains a list of all the packages that get installed for that project using ‘npm init’.
Is there anything similar for python?
python’s equivelent of npm and package.json are pip and requirements.txt. (here’s some docs on the topic)
They are almost exact synonyms, with the obvious cavet there are technically 2 version of python (2.x and 3.x) which means there are 2 version of most libs, 2 version of docs, and 2 version of “valid” syntax. That, combined with the language being rather “old” when compared to JS, means some things are a little rough around the edges, when compared to the “fast moving” web development realm.
Easily the most “annoying” thing is dealing with runtime environments, there are like 5 different tools (here’s a list of what you can run accross) to deal with runtime environments. If your project will only be ran by you, and you don’t want to worry about the “portability” of it, then I wouldn’t worry about it, but do keep it in your mind when using python.
Also, if your working on a webscraping project you’d probably want to look into the BeautifulSoup lib for python. (2 and 3 versions are supported)
2 Likes
Thanks, that’s awesome, so helpful. I did use beautiful soup for one simple project. I’m also trying scrapy (did I mention that before?) One reason is that I found this pre-made script for recursively archiving a vbulliten forum
If you have any additional guidance or links to other vbulletin projects, I’d love to know! So far I have not been able to get the detectorist’s project to work.
I also have to run a custom pip command to get scrapy to work with python 3.7, which is annoying
pip install git+https://github.com/scrapy/scrapy@master --no-dependencies —upgrade
My flow so far is to
-open a project window in webstorm
-create a virtual environment (I’ve been subbing 3.7 for 3.6, not sure if that is causing any issues)
python3.6 -m venv my_env && source my_env/bin/activate
-install scrapy with that long command
pip install git+https://github.com/scrapy/scrapy@master --no-dependencies —upgrade
And then begin using scrapy following the tutorials, etc.
I’m just kind of making this up as I go along figuring that if I can learn node & get projects going, I can figure out python, but I’d love to figure out a good workflow.