hi im a beginner python user and im trying to write an automation to locate addresses from a specific site and create a log.
Ive tried a few guides already using selenium and beautiful soup
if u have an idea for a different drivers feel free to share i kinda got lost in here haha
thank you
I think it comes down to asking yourself what kind of framweork you’d like to take the time to invest in? There’s a lot of libraries that can probably manage this sort of task, but each one has a slightly different set of skills and ways of doing things.
Depending on how the site of interest is arranged, this may be possible with the Python Standard Library urllib, or the Requests module, which both focus on sending HTTP requests and returning page response objects.
The next step up is probably the Beautiful Soup library; it’s focus is on web traversal & HTML page dissection.
Selenium’s focus is browser automation. It excels at interacting with web pages by doing things like filling out fields, simulating clicks and other user inputs (think web page/application testing).
There’s also Scrapy, which has a focus on data scraping. It has a strong emphasis on crawling through web resources and document dissection/data extraction.
thank you for the detailed explanation
if u have any guide for a code similiar to this that can basically just copy and paste data using bs4 i would appreciate it
thank you for the help
Well the best guide to using bs4 is without a doubt the official documentation
So the first step in any web scrapping or data extraction process is to fully understand the structure of the document or database of interest. In this case, it’s HTML documents. Go to the web page of interest in your browser and open up the developer tools to inspect the document structure. In this example I’m going to use a google search results page because it has lots of URL addresses to extract: https://www.google.com/search?q=free+code+camp
Step 2 is cooking up a system to locate all the data in the page to extract. I want to extract all the search result URLs (links) from the page. In the developer tools page inspector, I followed the document tree of nested DIV elements down to where one of the links of interest was contained. It takes the form:
<a href="https://www.freecodecamp.org/" rel="noreferrer noopener">
<h3 class="LC20lb">freeCodeCamp: Learn to Code and Help Nonprofits</h3>
<br>
<div class="TbwUpd">
<cite class="iUh30">https://www.freecodecamp.org/</cite>
</div>
</a>
What I need to extract is that href
portion of the main <a>
tag. Fortunately Google has organized all the search results in a very consistent pattern. Each result is contained in a nest of DIV elements that takes the shape:
<div class='g'>
<div class='rc'>
<div class='r'>
<a href='https://...'></a>
</div>
</div>
<div>
We could extract every <a>
that appears in the document, but that would also include other links that appear on the page, and we only want the search result links. If the <a>
element had a class we could just look for all elements in the page with that class name, unfortunately there aren’t any identifying attributes attached to the anchor elements therefore we need to look for unique identifiers in parent containers. So a good system might be: Extract href text from the <a>
element contained in each <div class='r'>
element in the page.
It’s important to keep in mind that basically what BeautifulSoup does is transforms an HTML document tree into Python objects. Document navigation is often done with search functions and dictionary like attribute acquisition.
Here’s the code for how I would use BeautifulSoup to go about this task:
# Imports
# Python Standard Library re (regular expressions) used to clean up address
# Requests library used to send HTTP requests & retrieve HTML document
# BeautifulSoup library used to parse HTML document
import re
import requests
from BeautifulSoup import BeautifulSoup, SoupStrainer
def google_search_addresses(url):
""" Function returns a list of search result URL addresses for a
given search page url """
# retrieve web page document text
doc = requests.get(url).text
# Create Strainer object used for filtering
# out unneeded elements of the document.
# Strainer only keeps elements with a class="r"
strainer = SoupStrainer(class_='r')
# Build BeautifulSoup document object from HTML document
# using the filter (strainer) created earlier
soup = BeautifulSoup(doc, "html.parser", parse_only=strainer)
# soup.find_all('a') returns an iterable of all <a> elements in our filtered document
links = soup.find_all('a')
# link['href'] retrieves the string URL from the href attribute of <a> element (link)
links = [link['href'] for link in links]
# The regular expression might not be needed for your situation
# Google has appended some info to the URL which I'm removing as it's not needed here
links = [re.search(r'https?.+?(?=&sa)', link).group() for link in links]
return links
To use:
>>> URL = ''https://www.google.com/search?q=free+code+camp''
>>> google_search_addresses(URL)
['https://www.freecodecamp.org/',
'https://learn.freecodecamp.org/',
'https://learn.freecodecamp.org/responsive-web-design/basic-html-and-html5/say-hello-to-html-elements/',
'https://medium.freecodecamp.org/',
'http://forum.freecodecamp.org/',
'https://www.youtube.com/channel/',
'https://twitter.com/',
'https://github.com/freeCodeCamp/',
'https://www.linkedin.com/school/free-code-camp/',
'https://en.wikipedia.org/wiki/',
'https://freecodecamp.libsyn.com/',
'https://www.forbes.com/sites/quora/2016/12/29/yes-free-code-camp-has-low-completion-rates-and-thats-actually-a-good-thing/']
A more concise function to return a list of search result URL addresses from a Google search URL:
def google_search_links(url):
""" Function to retrieve Google search result URLs from
a given search URL """
try:
doc = BeautifulSoup(requests.get(url).text,
"html.parser",
parse_only=SoupStrainer(class_='r'))
return [re.search(r'https?.+?(?=&sa)', link['href']).group()
for link in doc.find_all('a') if re.search(r'https?.+?(?=&sa)', link['href'])]
except Exception as e: # lazy, unreccomended error handling for demonstration
print(e)
thank you so much i appreciate the very detailed explanation this helped allot
Good to hear. Good luck!