Hi Dr_Strange,
I tried today and it is indeed possible to scrape it using selenium webdriver to store all information in static HTML code.
The html = driver.page_source
results in an UnicodeEncodeError (found it by using print(html)
, stopping everything unless it is decoded (thank you, oh Google ):
Traceback (most recent call last):
File “C:<map location>\dr_strange.py”, line 24, in
print(html)
File “C:<map location>\Python38-32\lib\encodings\cp1252.py”, line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: ‘charmap’ codec can’t encode character ‘\uff5c’ in position 2891: character maps to
At [stackoverflow](# python 3.x - Why I'm getting "UnicodeEncodeError: 'charmap' codec can't encode character '\u25b2' in position 84811: character maps to <undefined>" error? - Stack Overflow) I found the following to decode:
html_dec = html.encode('utf-8').decode('ascii', 'ignore')
After this, the data can be examined with beautifulsoup and runs like a charm.
I did not try to read the next page yet (and the next…and so on), but I did it for another website, so it should be possible.
I have to go and make dinner now, but will do the rest tomorrow morning (add the other data, go through the pages, write to dataframe)! (But please feel free to work on it too and keep us updated of your progress )
Here is the code so far to get the district of every transaction on the first page:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import time
# Instantiate lists to append data to for dataframe
District = []
# url of the page
url = "https://oir.centanet.com/en/transaction/index?daterang=01%2F05%2F2019-29%2F04%2F2021&pageindex=0&pagesize=10&suggestlist=%5B%5D&posttype=0&districts=&sortby=0&usages=&pricetype=0&minprice=&maxprice=&renttype=0&minrent=&maxrent=&minarea=&maxarea=¢abldg=&sellindex=-1&rentindex=-1"
# initiating the webdriver.
driver = webdriver.Chrome(r'C:\Program Files\Chromedriver\chromedriver.exe')
driver.get(url)
# Ensure page is fully loaded
time.sleep(3)
# Get the source of the current page
html = driver.page_source
# decode to work around error
# https://stackoverflow.com/questions/62656579/why-im-getting-unicodeencodeerror-charmap-codec-cant-encode-character-u2
html_dec = html.encode('utf-8').decode('ascii', 'ignore')
# Apply beautifulsoup to html variable
soup = BeautifulSoup(html_dec, "html.parser")
# Make a list of all transactions
all_transactions = soup.find_all('div', class_='m-transaction-list-item')
# Iterate over each transaction and add data to list
for transaction in all_transactions:
district_elem = transaction.find('p', class_="adress").text.strip()
District.append(district_elem)
# Close webdriver
driver.close()