I am making a web scraper on 2 sites with job listings and follow the example of Beautiful Soup: Build a Web Scraper With Python – Real Python, and adding a URL and refining the location. For now I made 2 separate scrapers, the goal is to make one where both sites are scraped, to use as much as possible one piece of code for both sites and separate specific lines where needed. The output is a dataframe and a CSV (I will look into the nicest and most usable output later, for now it was nice to practice it).
A question that might be academic (it does not seem to influence the functionality):
I noticed that in ICTergezocht the type of title_elem (type = str) is different from the type it is in Monsterboard (type = class ‘BS4 element.Tag’), while the source where this selection is made from, job_elem, has the same type in both scrapers (class ‘bs4.element.ResultSet’).
Can anybody tell me what causes the types of title_elem to be different?
Furthermore, in Monsterboard on the title_elem , .text.strip() could not be applied until after the NoneTypes were removed (if None…: ) (result was a NoneType error, which makes sense, since 3 of the elements were not containing the data this scraper is looking for).
In ICTergezocht there are no NoneTypes, which enables me to use longer lines to select, text and strip() all in one line, which makes the code shorter. Am I just lucky with this site and should I stick to the method in Monsterboard to prevent errors later when adding a 3rd website of when ICTergezocht starts getting messy and introduces NoneTypes (also for being able to use the same code for both sites when using 1 scraper for both sites)?
The code is below and on Brain150/web-scraper: Python web scraper for 2 websites (github.com)
Thanks very much for looking into this and helping me out!
ICTergezocht
# https://realpython.com/beautiful-soup-web-scraper-python/
import requests
from bs4 import BeautifulSoup
import pandas as pd
# instatiate lists to append to for dataframe/csv
title = []
URL_add = []
URL = "https://www.ictergezocht.nl/ict-vacatures/?what=Data+Analist&where=Amsterdam&r=10&submit_homesearch="
page = requests.get(URL)
# create Beautiful Soup object that takes scraped HTML content as input and use appropriate parser
soup = BeautifulSoup(page.content, "html.parser")
# find specific element by id
results = soup.find(id="list_result")
# find elements by HTM class name
job_elems = results.find_all("div", class_="content_block vacitem") # job_elems has type(class 'bs4.element.ResultSet')
# iterate over job_elems to look at specific details of each one
for job_elem in job_elems:
# returing the normal way returns all HTML, we only want the text content and strip leading and trailing white spaces
title_elem = job_elem.find("a", class_="title").text.strip() # title_elem = type(str); job_elem = type(class 'bs4.element.Tag')
title.append(title_elem)
# all in one line, =short, but readable?:
# title.append(title_elem.find("a", class_="title").text.strip()
URL_elem = job_elem.find_all("a", href=True)
for el in URL_elem:
URL_add.append(el["href"])
'''
alternative for URL (does work for ICTergezocht.nl not on monsterboard.com!?)
URL_elem = job_elem.find("a").get("href") # 'a' = all hyperlinks, extracts only href (=URL), also when 'a' includes more information about the HTML element (class, src, etc.)
URL_add.append(URL_elem)
'''
# solve "AttributeError: 'NoneType' object has no attribute 'text'": some items have value None; Continue skips the iteration for this item and continues with the next item
if None in (title_elem, URL_elem):
continue
# output in dataframe
output = pd.DataFrame({"Title": title, "URL": URL_add})
print(output)
# output in CSV
# CSV: adjust separator, so comma in title does not cause problem when processing in CSV
output.to_csv("icter.csv", sep="*")
Monsterboard
# https://realpython.com/beautiful-soup-web-scraper-python/
import requests
from bs4 import BeautifulSoup
import pandas as pd
# instatiate lists to append to for dataframe/csv
title = []
URL_add = []
URL = "https://www.monster.com/jobs/search/?q=Software-Developer&where=Netherlands"
page = requests.get(URL)
# create Beautiful Soup object that takes scraped HTML content as input and use appropriate parser
soup = BeautifulSoup(page.content, "html.parser")
# find specific element by id
results = soup.find(id="ResultsContainer")
# find elements by HTM class name
job_elems = results.find_all("section", class_="card-content") # job_elems has type(class 'bs4.element.ResultSet')
# iterate over job_elems to look at specific details of each one
for job_elem in job_elems:
title_elem = job_elem.find("h2", class_="title") # title_elem = type(class 'bs4.element.Tag'); job_elem = type(class 'bs4.element.Tag')
# title.append(title_elem) # including NoneType items - > results here in Attribute error when stripping text!
URL_elem = job_elem.find_all("a", href=True)
for el in URL_elem:
URL_add.append(el["href"]) # or (el.get("href"))
# solve "AttributeError: 'NoneType' object has no attribute 'text'": some items have value None; Continue skips the iteration for this item and continues with the next item
if None in (title_elem, URL_elem):
continue
# make list of titles, use just the text part of the HTML data
title.append(title_elem.text.strip())
# output in dataframe
output = pd.DataFrame({"Title": title, "URL": URL_add})
print(output)
# output in CSV
# CSV: adjust separator, so comma in title does not cause problem when processing in CSV
output.to_csv("scrapeMonster.csv", sep="*")