Create seperate dataframes for each url dataset

xBetina · October 28, 2020, 1:07pm

I am creating a web scraping program using python, BeautifulSoup, pandas and Google Sheets. Up until now I have managed to scrape data tables from a couple of webpages from a list in Google sheets. What I want to achieve is, that for each table in every url, I want to create a dataframe. Right now data from a the last url in the list is being imported to Google Sheets, but it seems like the data from the first url is being imported, but then gets overwritten by the data from the next url - maybe it has something to do with the indexes?

My code so far looks like this:

from google.oauth2 import service_account
from google.auth.transport.requests import AuthorizedSession
from df2gspread import df2gspread as d2g
import pandas as pd
from gspread_dataframe import get_as_dataframe, set_with_dataframe
from bs4 import BeautifulSoup
import requests



credentials = service_account.Credentials.from_service_account_file(
    'credentials.json')

scoped_credentials = credentials.with_scopes(
        ['https://spreadsheets.google.com/feeds',
         'https://www.googleapis.com/auth/drive']
        )

gc = gspread.Client(auth=scoped_credentials)
gc.session = AuthorizedSession(scoped_credentials)
spreadsheet_key = gc.open_by_key('api_key')

worksheet = spreadsheet_key.sheet1



# List of url's from Google Sheets
link_list = worksheet.col_values(5)


def get_info(page_url) :

    page = requests.get(page_url)
    soup = BeautifulSoup(page.content, 'html.parser')

    try :

        tbl = soup.find('table')

        labels = []
        results = []


        for tr in tbl.findAll('tr'):
            headers = [th.text.strip() for th in tr.findAll('th')]
            data = [td.text.strip() for td in tr.findAll('td')]
            labels.append(headers)
            results.append(data)
        

        final_results = []

        for final_labels, final_data in zip(labels, results):
            final_results.append({'Labels': final_labels, 'Data': final_data})


        df = pd.DataFrame(final_results)
        
        set_with_dataframe(worksheet, df, include_index=False)


    except Exception as e:
        print(e)



for link in link_list :
    get_info(link)

And the output:

                  Labels                                               Data
0      [Celebrated Name]                                         [2 Chainz]
1                  [Age]                                         [43 Years]
2            [Nick Name]                              [Tity Boi, Drenchgod]
3           [Birth Name]                                     [Tauheed Epps]
4           [Birth Date]                                       [1977-09-12]
5               [Gender]                                             [Male]
6           [Profession]                                           [Rapper]
7       [Place Of Birth]             [College Park, Georgia, United States]
8          [Nationality]                                         [American]
9            [Ethnicity]                                    [Afro-American]
10           [Horoscope]                                            [Virgo]
11         [High School]                        [North Clayton High School]
12          [University]  [Alabama State University and Virginia State U...
13      [Marital Status]                                          [Married]
14                [Wife]                                       [Kesha Ward]
15            [Children]                        [Heaven, Harmony, and Halo]
16     [Body Build/Type]                                         [Athletic]
17    [Body Measurement]                                  [43-15-34 inches]
18          [Chest Size]                                        [43 inches]
19          [Bicep Size]                                        [15 inches]
20          [Waist Size]                                        [34 inches]
21           [Shoe Size]                                           [14 (US]
22              [Height]                                  [6 feet 5 inches]
23              [Weight]                                            [88 kg]
24           [Net Worth]                                      [$ 6 Million]
25              [Salary]                                        [$ 100,000]
26  [Sexual Orientation]                                         [Straight]
27           [Eye Color]                                       [Dark Brown]
28          [Hair Color]                                            [Black]
29               [Links]             [Wikipedia,Instagram,Twitter,Facebook]
                    Labels                                     Data
0        [Celebrated Name]                              [Don Lemon]
1                    [Age]                               [54 Years]
2              [Nick Name]                              [Don Lemon]
3             [Birth Name]                              [Don Lemon]
4             [Birth Date]                             [1966-03-01]
5                 [Gender]                                   [Male]
6             [Profession]                             [Journalist]
7           [Birth Nation]                          [United States]
8         [Place Of Birth]  [Baton Rouge, Louisiana, United States]
9            [Nationality]                               [American]
10              [Siblings]                 [Leisa Lemon, Yma Lemon]
11             [Ethnicity]                                  [Mixed]
12             [Eye Color]                                  [Brown]
13            [Hair Color]                                  [Black]
14              [Religion]                              [Christian]
15                [Height]                          [5 Feet 6 Inch]
16                [Weight]                              [Not Known]
17           [Working For]                                    [CNN]
18        [Best Known For]                            [CNN Tonight]
19                [School]                      [Baker High School]
20  [College / University]                        [Brookyn College]
21            [University]             [Louisiana State University]
22             [Horoscope]                                 [Pisces]
23             [Net Worth]               [$ 3 million (As of 2018)]
24            [Famous For]  [For hosting the program ‘CNN Tonight’]
25      [Body Measurement]                               [40-32-35]
26                [Awards]                             [Emmy Award]
27                [Salary]                                [$125000]
28                 [Links]      [WikipediaFacebookTwitterInstagram]

So, the data from both urls is being printed, but how do I create a dataframe for each dataset and import it to Google Sheets?

Gerschel · October 28, 2020, 5:49pm

I found these docs, it says that writing occurs by default starting at the top left. So keep track of how many rows and columns if necessary so you can specify where each write occurs at.

gspread_dataframe. set_with_dataframe ( worksheet , dataframe , row=1 , col=1 , include_index=False , include_column_header=True , resize=False , allow_formulas=True )

Sets the values of a given DataFrame, anchoring its upper-left corner at (row, col). (Default is row 1, column 1.)

https://pythonhosted.org/gspread-dataframe/

Topic		Replies	Views
Need help on extract dynamic table data Python	17	4646	November 4, 2021
First attempt to read HTML into DF Python	1	255	July 28, 2022
Loop through all excel files and all worksheets Python	1	6437	June 1, 2023
Lesson-18-get-fifa-players-from-the-web Python	1	313	July 8, 2021
Extract data from table in Google doc	2	234	April 12, 2025

Create seperate dataframes for each url dataset

Related topics