How would you scrape this website using beautiful soup?

Here is the website:
https://geoportal.minetur.gob.es/VCTEL/vcne.do

When you zoom into the map, blue blobs appear. When you click on a blue blob, information about it pops up. I want to retrieve this information for each blob. There are about 50,000 blobs I believe

I’ve seen another person scrape this information in the past but don’t know how they did it, I imagine it is relatively simple

Thanks for your time

Jaime

P.S. I have tried this code but it did not give me back anything, just a blank list (I am new to python and beautiful soup):

The reason I chose to search using the word ‘image’ is that when you click a blue blob using ‘inspect’, an image tag appears for each blob with a slightly different id:

So I thought that if I got all the image tags I would get a list of all the blobs. But BeautifulSoup doesn’t seem to find the image tags, why not?

If you want to scrape information from a website, you will have to find where the information is actually coming from. For static websites, all the information is in the html file, but for dynamic websites the information might be coming from some api. Too figure this out, use the Network tab in your browser’s devtools. The first request (with type: HTML) is the one that you are reading with BeautifulSoup.

In your case, this html file does not contain any images, which is why your list is empty. You will find that each time you zoom in, a new set of images is loaded. And when you have zoomed in far enough, a request is made to get the ‘blobs’ in that part of the map. Maybe if you change the parameters of that request, you can get the data you want.

1 Like

Thanks for your reponse @BenGitter

So as you move across the map, the website makes API requests, as shown below:

I would like to get the 3 pieces of information underlined in red for each blue blob

Do you know how I could do this?

Thanks for your time

Jaime

It is JSON data, so python should be able to read that with the json module. json.loads() simply returns a dict version of the json data.

Just to add a bit here, if you head over to the “Headers” subtab of one of those fetch requests from your last image, you’ll find the URL for the API and the query string parameters it’s using.

example params

idCapa: null
bbox: -3.688900642395,40.420963764193,-3.6792768764496,40.429664850237
zoom: 4

I think the parameter to pay attention to here is “bbox” which represents a comma separated sequence of decimal coordinates that make up a geographical bounding box for which the API will search and return “blob” results for. In the parameter, there are 2 pairs of coordinates that each have the East/West coordinate first followed by the North/South coordinate. The first coordinate decimal pair represents the bottom-left corner of the box and the second coordinate pair represents the top-right corner (as far as I can tell).

If you can figure out the coordinates representing the bounding box area you would like to get “blob” results for, you can probably fetch them all in one request to the API endpoint using that parameter and then parse the returned JSON data into a Python dictionary (no beautiful soup required); unless you hit some API imposed restrictions, in which case you may need to divide up the bounding box area into smaller sections and make a call for each.

If you’re using the little coordinate box on the map that follows your mouse you can convert the sexagesimal degrees/minutes/seconds to decimal degrees. Note that values tagged with an “O” at the end of them should have a negative decimal value when entered as a parameter.

1 Like

Hi @kylec

I wrote the code below on jupyter notebook using your advice and it works, however it only retrieves info for 1000 blobs instead of the 50,000-60,000 there are more or less.

What could I do to get all of the data?

BTW I had written code earlier which would call the API for every little zoomed rectangle on the map but it takes several days to run (as it has to call the API about 5,460,000 times in order to search the whole of Spain and its islands)

I know for sure this task can be done in a lot less time than a week because I am aware of someone who wrote the code and retrieved the data en less than 3 days after work using VBA, I just can’t figure out how

Thanks for your time

Jaime

import requests

def get_data_from_api_3():
headers = {
‘Connection’: ‘keep-alive’,
‘sec-ch-ua’: ‘“Google Chrome”;v=“93”, " Not;A Brand";v=“99”, “Chromium”;v=“93”’,
‘sec-ch-ua-mobile’: ‘?0’,
‘User-Agent’: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36’,
‘sec-ch-ua-platform’: ‘“Windows”’,
‘Accept’: ‘/’,
‘Sec-Fetch-Site’: ‘same-origin’,
‘Sec-Fetch-Mode’: ‘cors’,
‘Sec-Fetch-Dest’: ‘empty’,
‘Referer’: ‘Niveles de Exposición’,
‘Accept-Language’: ‘en-GB-oxendict,en;q=0.9,es-ES;q=0.8,es;q=0.7’,
}
params = (
(‘idCapa’, ‘null’),
(‘bbox’, ‘-8.84892992, 35.74657448, 6.49954908, 44.78973632’),
#These coordinates go from the bottom left of the canary islands
# to the top right of Spain and the balearic islands, ensuring we
# capture all mobile towers in a single API call
(‘zoom’, ‘4’),
)
response = requests.get(‘https://geoportal.minetur.gob.es/VCTEL/infoantenasGeoJSON.do’, headers=headers, params=params)
return response.json()

mobile_towers_data = get_data_from_api_3()
#len(mobile_towers_data[‘features’])
jsondata = mobile_towers_data[‘features’]
#jsondata[0]

import csv
import os

data_file = open(‘jsonoutput.csv’, ‘w’, newline=’’)
csv_writer = csv.writer(data_file)

count = 0
for data in jsondata:
if count == 0:
header = data[‘properties’].keys()
csv_writer.writerow(header)
count += 1
csv_writer.writerow(data[‘properties’].values())

os.startfile(“jsonoutput.csv”)

Good job! You’ve definitely got the idea.

So the issue here is that the API is imposing a limit of 1,000 results per request. That’s where we hit this part.

Figuring out the limitations of an API is one of the trickier bits of scraping data from them. Sometimes there is documentation explaining what you can and cannot do with any particular endpoint; however that does not appear to be the case here (or at least I couldn’t find any), which is not particularly surprising as I’m not sure this API was really meant to be used for retrieving the entire dataset.

One option that might be good to think about before diving into more web requests is to contact the data owners/maintainers (which looks to be the Government of Spain: Ministry of Economic Affairs and Digital Transformation in this case) and request the dataset. This is a more common route than a lot of people realize (it’s especially easy if you’re in academia) and doesn’t run the risk of cutting into the maintainer’s bandwidth. This is data they seem to be encouraging the public to look at, so it seems likely they’d be willing to hand the dataset over to an interested researcher.

With that, if you want to continue with this approach, the next question is going to be how do divide up that big bounding box into smaller ones? You’d essentially make one request per little bounding box and then combine all the results together.
The simplest way would just be to divide the big box into smaller equally sized boxes. You’d have to use a box size small enough that in the densest area of cell towers you wouldn’t hit that 1,000 data point limit. Given the area you’d like to cover, I’d say that could be anywhere around 2000 little boxes (just a guesstimate from looking at some spots in Madrid). Many of which will likely contain no results at all (oceans and undeveloped land are not likely to have many cell towers).
If you’ve got access to knowledge about the data’s density distribution, you could create boxes that have area inversely proportional to the density of cell towers. Geographic locations like undeveloped land and oceans could have much larger bounding boxes and not risk hitting that 1,000 result request limit. Areas like dense city which have a lot of cell towers would need to be divided up smaller in order to avoid that limit.

Also, while you can totally do this with the csv module and manually parsing the dictionaries (and there is nothing wrong with that); I would like to point out that each result is already returned in GeoJSON format. There is a library called geopandas that extends the pandas data manipulation library and is made for handling exactly this type of data. As an example that extends you’re pervious code:

# First, modify your original request function to pass the bounding box
# coordinates into the 'bbox' param

import geopandas
from time import sleep

def generate_bounding_boxes(full_coordinates):
    # generate your individual bounding boxes for each request here
    pass

# Make a placeholder DataFrame to fill
gdf = geopandas.GeoDataFrame()

# Loop through each individual bounding box
for bbox in generate_bounding_boxes(full_coordinates):
    # Make an API request for the bbox coordinates
    data = get_data_from_api_3(bbox).json()
    # Create a GeoDataFrame from the response GeoJSON data
    bbox_gdf = geopandas.GeoDataFrame.from_features(data)
    print(f"Results in {bbox}: {len(bbox_gdf)}")
    # Append the data points to the main dataframe
    gdf = gdf.append(bbox_gdf)
    # IMHO, it's always a good idea to space out requests to free/public APIs
    sleep(1)

# Any data points that are on the perimeter of bounding boxes
# (i.e. two boxes touch), or if your bounding boxes have overlapping areas,
# could have been returned in multiple request calls, resulting in duplicates.
# Pandas makes it very easy to drop these duplicates from the DataFrame
gdf = gdf.drop_duplicates()
print(f"Total Results: {len(gdf)}")
print(f"Data Features: {gdf.columns}")
# Save the DataFrame to a CSV file
gdf.to_csv("mobile_tower_geodata.csv")

Edit: Reassessed my guesstimate

1 Like

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.