Need help on extract dynamic table data

Hi, all

I am trying to extract table data on below link:

https://oir.centanet.com/en/transaction/…ntindex=-1

I tried many method to extract, but still fail. Below is my code for your reference. I hope to extract the property sell data and insert into pandas dataframe for easy manage

#### Code #####
base_url = str("https://oir.centanet.com/en/transaction/index?daterang=01%2F05%2F2019-29%2F04%2F2021&pageindex=0&pagesize=10&suggestlist=%5B%5D&posttype=0&districts=&sortby=0&usages=&pricetype=0&minprice=&maxprice=&renttype=0&minrent=&maxrent=&minarea=&maxarea=&centabldg=&sellindex=-1&rentindex=-1")

driver.get(base_url)

sleep(30)

tables=WebDriverWait(driver,20).until(EC.presence_of_all_elements_located((By.XPATH,"//*[@id='__layout']/div/div[1]/div/div[3]/div[1]/div[2]/table/tbody")))

for t in range(len(tables)):
print("element in list: ",tables[t].text)

##############


Thank You for your help

Did you look for a python module providing web-scraping and just extracting the table into a nice readable data-structure - which you then can modify using pandas?

Hi, Jagaya,

Yes. I would like to extract the table into a nice readable data-structure and then can modify using pandas. I find that this web site is not easy to extract. I tried beautifulsoup and webdriver, seem cannot extract the data. And also there are many pages in the web site. I don’t know how to read the next page data using python.

Welcome there,

I’ve edited your post for readability. When you enter a code block into a forum post, please precede it with a separate line of three backticks and follow it with a separate line of three backticks to make it easier to read.

You can also use the “preformatted text” tool in the editor (</>) to add backticks around text.

See this post to find the backtick on your keyboard.
Note: Backticks (`) are not single quotes (’).

Hi Dr_Strange,

I tried today and it is indeed possible to scrape it using selenium webdriver to store all information in static HTML code.

The html = driver.page_source results in an UnicodeEncodeError (found it by using print(html), stopping everything unless it is decoded (thank you, oh Google :raised_hands: :wink: ):

Traceback (most recent call last):
File “C:<map location>\dr_strange.py”, line 24, in
print(html)
File “C:<map location>\Python38-32\lib\encodings\cp1252.py”, line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: ‘charmap’ codec can’t encode character ‘\uff5c’ in position 2891: character maps to

At [stackoverflow](# python 3.x - Why I'm getting "UnicodeEncodeError: 'charmap' codec can't encode character '\u25b2' in position 84811: character maps to <undefined>" error? - Stack Overflow) I found the following to decode:
html_dec = html.encode('utf-8').decode('ascii', 'ignore')

After this, the data can be examined with beautifulsoup and runs like a charm.
I did not try to read the next page yet (and the next…and so on), but I did it for another website, so it should be possible.

I have to go and make dinner now, but will do the rest tomorrow morning (add the other data, go through the pages, write to dataframe)! (But please feel free to work on it too and keep us updated of your progress :slight_smile: )

Here is the code so far to get the district of every transaction on the first page:

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import time

# Instantiate lists to append data to for dataframe
District = []

# url of the page
url = "https://oir.centanet.com/en/transaction/index?daterang=01%2F05%2F2019-29%2F04%2F2021&pageindex=0&pagesize=10&suggestlist=%5B%5D&posttype=0&districts=&sortby=0&usages=&pricetype=0&minprice=&maxprice=&renttype=0&minrent=&maxrent=&minarea=&maxarea=&centabldg=&sellindex=-1&rentindex=-1"

# initiating the webdriver. 
driver = webdriver.Chrome(r'C:\Program Files\Chromedriver\chromedriver.exe')
driver.get(url)

# Ensure page is fully loaded
time.sleep(3)

# Get the source of the current page
html = driver.page_source

# decode to work around error
# https://stackoverflow.com/questions/62656579/why-im-getting-unicodeencodeerror-charmap-codec-cant-encode-character-u2
html_dec = html.encode('utf-8').decode('ascii', 'ignore')

# Apply beautifulsoup to html variable
soup = BeautifulSoup(html_dec, "html.parser")

# Make a list of all transactions
all_transactions = soup.find_all('div', class_='m-transaction-list-item')
# Iterate over each transaction and add data to list
for transaction in all_transactions:
    district_elem = transaction.find('p', class_="adress").text.strip()
    District.append(district_elem)

# Close webdriver
driver.close()

Dear Brain

Thank you for your reply, your code work nice. And I have revised a bit to extract the table data, attached the print screen which i am focusing on this table data.


I added below code for your referecne.

# Make a list of all transactions
all_transactions = soup.find_all('div', class_='table_main')

for tr in soup.find_all("tr"):
    for td in tr.find_all("td"):
        print(td.text.strip())type or paste code here

Is there any better way to extract data into pandas dataframe. So that the data output is like on below table format.

Instrument Date Usage District Property Name Floor Unit Area Price Unit Price Source
29/04/2021 Industrial Kwai Chung Global Trade Centre High – 879 sq.ft. approx Leased $7,911 @9 Market News
29/04/2021 Industrial Kwai Chung Global Trade Centre High – 916 sq.ft. approx Leased $8,244 @9 Market News

I still finding out a way to trigger the “next page” :sweat_smile:
Btw, your solution is a very good tutorial for ppl who want to extract web data using Webdriver + beautifulsoup.

1 Like

Dear Brian,

For dataFrame, I added below code, now the looking is nice in CSV file.

####### Extract the data in pandas dataframe (start) #############
# Select header text and turn it into a list for Pandas
header_soup = soup.findAll('th')
row_soup = soup.find_all("tr")

headers = []
for h in header_soup:
  headers.append(h.text)

# Select row cell text and turn it into a list for Pandas
rows = []
for row in row_soup:
  row_data = []
  for cell in row.findAll('td'):
    print(cell.text.strip())
    row_data.append(cell.text.strip())
  rows.append(row_data)


# Create the dataframe from your table data
df = pd.DataFrame(rows, columns=headers)
print(df)
df.to_csv("CP00.csv") ## output the result for investigation if any error
####################################################

Please let me know if you can find out any method to trigger the “next page” data :+1:
Many thanks

Thank you for your compliment! I must admit my solution is formed by searching and taking pieces of examples and adding them together and when it works, I am a happy and proud boy.

Your code is very nice so much more efficient!
My method was to append each element of each transaction to a seperate list and then write those to a dataframe, which is much more cumbersome that your method, so thanks for sharing that.
I have 2 minor additions to the final part:

# Create the dataframe from your table data
df = pd.DataFrame(rows[1:], columns=headers)  # removes empty row
print(df)
# output the result for investigation if any error
df.to_csv("CP01.csv", sep="*")  # transaction price uses a comma, using an alternative seperator might be better

Regarding this:

  • Removing the empty row can also be done when cleaning the data in the dataframe and might be more ‘correct’, since more cleaning might be applied toe the dataframe anyway. It can be left out when appending the data to the rows-list too. In this case I noticed the empty row when analysing the code, probably when doing normal coding, I would not see it and bother with it when cleaning the dataframe.
  • Alternative seperator is not needed when not using the csv for anything else, this is just a habit of mine.

Regarding the “next page”: I scraped a few static websites where the pagenumber was in the URL of the site, could be manipulated with a for-loop and put back into the URL, thus going to all pages. I looked for the maximum pagenumber, so the loop stopped at the right number.
However, the site you are scraping does not have it :frowning: , so I found Selenium is able to interact with the page and able to ‘click’ on elements. Since I am learning this part on the go too, I will look into it and post my solution as soon as I have it working.

So, here we go:

For this site I thought of the following solution (after trial and lots of error on how to identify the arrow to the next page). It’s inserted after appending the data to the list ‘rows:’

###### click to next page #######
import random

# identify arrow to go to next page and select it by using webdriver
element = driver.find_element_by_xpath("//div[@class='right el-icon-arrow-right']")

# click the element
element.click()

# grant page some time to load AND randomise these intervals
time.sleep(random.randint(3, 9))
################################

Then repeat getting the source of the page, decode and apply bs4 again etc., and eventually create the dataframe

Now, since it’s almost dinner time again (hmmm…it looks like I am having dinner a lot when I am online :flushed:), the things I did not solve yet are:

  • How to make it stop (the button is inactive on the final page, but I did not try yet how to stop the programm without errors). We can identify the maximum number of pages, put a counter after each ‘click to next page’ and stop when counter == page_max or have the program check if button is active: click; else stop

  • How to make the correct parts repeat without making the code unreadable (I think by defining functions and for or while loops? ) This seems basic Python, but I focussed on getting all parts to work correctly first and did not worry about other things yet… (For example: You need the url and driver only once, same goes for the headers of the dataframe later on, but getting the html and actually scraping the data needs to be repeated for every page).

  • How much data can be scraped and put into memory BEFORE it’s necessary to write it into a dataframe (because of slowing down from memory issues, or possible crashes that will make all scraped data disappear again)?Data can be appended after scraping each page, which sounds sensible and safe, but it will cost time (although less than the time.sleep() parts).
    Same goes for actually saving the data in a CSV, to have a copy in case of crashes during scraping

So: it can be done and I am happy being able to assist in working towards a solution. Can you help me by giving tips on how to keep the code readable and practical hints on amounts of data to be processed before storing and saving? I can really learn a lot from that!

Nice project to work on, I am having fun with this :smile:

1 Like

Dear Brain,

Your finding is good. I added below code to test the result.

for c in range(1):
    #element = WebDriverWait(driver,20).until(EC.visibility_of_element_located((By.XPATH,"//*[@id='__layout']/div/div[1]/div/div[3]/div[3]/div/div[2]/div[3]")))
    element = driver.find_element_by_xpath("//div[@class='right el-icon-arrow-right']")
    element.click()
    sleep(10)
    html = driver.page_source
    html_dec = html.encode('utf-8').decode('ascii', 'ignore')
    soup = BeautifulSoup(html_dec, "html.parser")
    header_soup = soup.findAll('th')
    row_soup = soup.find_all("tr")
    rows = []
    for row in row_soup:
        row_data = []
        for cell in row.findAll('td'):
            print(cell.text.strip())
            row_data.append(cell.text.strip())
        rows.append(row_data)
    df = pd.DataFrame(rows)
    print(df)
    df.to_csv("CP01.csv") ## output the result for investigation if any error

look nice to try one page, but need to add more cases to handle like those you mentions on previous message and don’t want to repeat the code so frequently (code readable). I am thinking to create a def function to reduce the repeating code. I think i will prefer to append the data to the CSV file before go to the next page. And also clear all array to free up memory. We will soon get a final version of the code. Look forward to run a full set of data :joy: I also learn a lot from this project :grin:.

Many thanks. :+1:

Dear Brian,

I tried to append the data to the dataframe. But the output is so strange.
Below is the code, you can see the output after append. :sweat_smile:

for c in range(2):
    html = driver.page_source
    html_dec = html.encode('utf-8').decode('ascii', 'ignore')
    soup = BeautifulSoup(html_dec, "html.parser")
    header_soup = soup.findAll('th')
    row_soup = soup.find_all("tr")
    rows = []
    for row in row_soup:
        row_data = []
        for cell in row.findAll('td'):
            #print(cell.text.strip())
            row_data.append(cell.text.strip())
        rows.append(row_data)
    print(str(c))
    if c == 0:
        df = pd.DataFrame(rows[1:],columns=headers) #Remove empty row
        print(df)
    else:
        df2 = pd.DataFrame(rows[1:])
        print("Below Append 2")
        print(df2)
        mod_df = df.append(df2,sort=False)
        print("After Amend")
        print(mod_df)
    df.to_csv("CP00.csv",sep="*") ## output the result for investigation if any error
    #element = WebDriverWait(driver,20).until(EC.visibility_of_element_located((By.XPATH,"//*[@id='__layout']/div/div[1]/div/div[3]/div[3]/div/div[2]/div[3]")))
    element = driver.find_element_by_xpath("//div[@class='right el-icon-arrow-right']")
    element.click()
    sleep(10)



# Close webdriver
driver.close()

After append, the table will become below: :sweat_smile:

        0           1           2  ... Unit Unit Price       Usage

0 NaN NaN NaN … – @9 Industrial
1 NaN NaN NaN … – @9 Industrial
2 NaN NaN NaN … – @9 Industrial
3 NaN NaN NaN … – @9 Industrial
4 NaN NaN NaN … – @9 Industrial
5 NaN NaN NaN … – @29 Office
6 NaN NaN NaN … – @24 Industrial
7 NaN NaN NaN … – @22 Office
8 NaN NaN NaN … 247 @16,327 Retail
9 NaN NaN NaN … 50 @15,625 Retail
0 29/04/2021 Office Sheung Wan … NaN NaN NaN
1 29/04/2021 Office Sheung Wan … NaN NaN NaN
2 29/04/2021 Office Sheung Wan … NaN NaN NaN
3 29/04/2021 Office Sheung Wan … NaN NaN NaN
4 29/04/2021 Industrial Kwun Tong … NaN NaN NaN
5 29/04/2021 Industrial Kwun Tong … NaN NaN NaN
6 29/04/2021 Carpark Tsuen Wan … NaN NaN NaN
7 29/04/2021 Industrial Kwun Tong … NaN NaN NaN
8 29/04/2021 Industrial Kwun Tong … NaN NaN NaN
9 29/04/2021 Office Admiralty … NaN NaN NaN

[20 rows x 22 columns] <==== the columns is wrongly added

Let see if you have any experience on using dataframe. :sweat_smile:
Many thanks

Yeah, the appended data (rows) do not have the same column names as were created, so it creates new columns now (explaining why it’s double the amount. So those need to be specified (probably not the correct phrase :flushed: ) to the new data (rows).

How does the following work out for you? Drawback is that in the CSV there are no index numbers. If I change it to index=True, each new page will start at index 0 again. But if you are going to clean the dataframe from the CSV, it’s a small effort to create index numbers I think.

header_soup = soup.findAll('th')
row_soup = soup.find_all("tr")

headers = []
for h in header_soup:
    headers.append(h.text)

df = pd.DataFrame(columns=headers)
print(df)
df.to_csv("CP00.csv", sep="*", mode='a', header=True, index=False)

rows = []  
for row in row_soup:
    row_data = []
    for cell in row.findAll('td'):
        row_data.append(cell.text.strip())
    rows.append(row_data) 

# append with column names 'specified'
df = df.append(pd.DataFrame(rows, columns=headers))
print("Full df", df)

df.to_csv("CP00.csv", sep="*", mode='a', header=False, index=False)

# Clear the dataframe of all data except headers (this does seem a rough method, but I could not find another way):  so the same code for finding and appending the rows can be used for next page without having the need of creating the same headers over and over again.
df = df[0:0]

####### click to next page#######
# get element and when it's not available stop without errors: you don't have to look for a maximum page number (I tested it on a search with 2 pages, it works)
try:
    element = driver.find_element_by_xpath("//div[@class='right el-icon-arrow-right']")
    # click the element
    element.click()
    time.sleep(5)
except:
    print('All done')

# Close webdriver
driver.close()

I wonder, if you could accomplish what you wanted to do using Events functionality of table maintenance generator - Environment → Modification → Events

Dear Brian,

Your code is nice :+1:, now i am trying to do the follow work like remove duplicate and blank line etc. And what do you think about the best practice if i would like to import it to database for further reporting. For example, data transformation using python etc. I will post my work after finished :grin:

Many thanks

Dear Stephanie,

Would you show me more ?
I don’t understand how it is :sweat_smile:

Many thanks

Hi Dr_Strange,

I think (for the method we are trying (scrape 1 page, write to df, write df to csv, scrape next page, etc.)) best practise would be to scrape all data and only when it’s all in the .csv-file use that as a source to make a new df and start worrying about cleaning the data, duplicates and removing blank lines, etc.

What is the purpose of your analysis of these data?

Did you manage to make the parts of the code repeat for more pages?

Looking forward to your work, I will post updates too :smiley:

Hi Brain,

I have added the code to extract the total number of pages, my purpose to extract data is that I will insert the data into the database and then use it to create some report (like properties sell trend etc.) by using PowerBI. So the final version should be that the program will extract the data daily, and transform data and then insert into database if no duplicated record.

I am thinking to separate the program for data importing to database. :grin:

So I will face another problem to compare the duplicated record in database :joy: