Web Scraping Dilemma part 2!

ajj71310 · March 4, 2020, 12:19am

Okay so I thought I had a grasp on web scraping but apparently not. So, i was given a url to scrape I can get it to work but not like it should.I need it to draw of the title, summary, and url of each article on the main webpage. The results I am getting is weird combination of what I want (unfortunately the uncleaned version, I need the cleaned up version which is also stumping me) with a bunch of other mostly empty lines with the websites url. Here is my exact code:

# Imports required for retrieveing files and webpages and creating a usable
# format for them.
import bs4
import requests
import csv

# Asks the user to choose the urll the want to scrape
url = (" https://cybersins.com/")
# Grabs the website you want and turns it into a text file
res = requests.get(url).text
# Parses the webpage for the soup
soup = bs4.BeautifulSoup(res , 'lxml')
# Sets writer list expectations for csv
writer = ('headline', 'summary', 'url')
# Pulls the attributes wanted from the file 
for article in soup.find_all('article'):
    headline = article.find('h4', 'a', 'href', class_='title')
    print(headline)
    summary = article.find_all('div', class_='post-excerpt')
    print(summary)   
# Grabbing an anchor tag using dictionary formating
    articleUrl = article.find('a')('href')
    print(url)
# Ensures spacing in the final product
    print()
# Placing writer inside the loop and using 'a' (append) makes it grab every article
# and adds it to the text file
    with open('TestScrape.txt', 'a') as file:
        csv_writer = csv.writer(file)
        csv_writer.writerow([headline])
        csv_writer.writerow([summary])
        csv_writer.writerow([url])

Sky020 · March 4, 2020, 2:20pm

Hello, ajj.

Have you tried using the following parameters for a cleaner output:

Add your headers information (type into Google: What is my user agent). Store this as an object with the key being "User-Agent", and the value being a string of your user agent. res = requests.get(url,headers=headers)
Get the content not the text (this will help with parsing). soup = bs4.BeautifulSoup(res.content, 'html.parser')
Use the CSS Selectors. I found this much easier to grab info. Example:

for a in soup.select('article > header > h4 > a'):
  href = a['href']
  print(href)

Hope this helps some.

Topic		Replies	Views
Python Web Scrapping Dilemma? Python	6	1202	June 1, 2021
I am trying to create a scraper using python	2	383	June 1, 2021
Trying to get a script to copy adresses Python	6	1867	February 17, 2021
Webscraper - can I get your opinion on a small and a practical issue? Python	1	490	September 16, 2021
Daniel Chae Article	8	660	September 3, 2021

Web Scraping Dilemma part 2!

Related topics