When you zoom into the map, blue blobs appear. When you click on a blue blob, information about it pops up. I want to retrieve this information for each blob. There are about 50,000 blobs I believe
The reason I chose to search using the word ‘image’ is that when you click a blue blob using ‘inspect’, an image tag appears for each blob with a slightly different id:
If you want to scrape information from a website, you will have to find where the information is actually coming from. For static websites, all the information is in the html file, but for dynamic websites the information might be coming from some api. Too figure this out, use the Network tab in your browser’s devtools. The first request (with type: HTML) is the one that you are reading with BeautifulSoup.
In your case, this html file does not contain any images, which is why your list is empty. You will find that each time you zoom in, a new set of images is loaded. And when you have zoomed in far enough, a request is made to get the ‘blobs’ in that part of the map. Maybe if you change the parameters of that request, you can get the data you want.
Just to add a bit here, if you head over to the “Headers” subtab of one of those fetch requests from your last image, you’ll find the URL for the API and the query string parameters it’s using.
I think the parameter to pay attention to here is “bbox” which represents a comma separated sequence of decimal coordinates that make up a geographical bounding box for which the API will search and return “blob” results for. In the parameter, there are 2 pairs of coordinates that each have the East/West coordinate first followed by the North/South coordinate. The first coordinate decimal pair represents the bottom-left corner of the box and the second coordinate pair represents the top-right corner (as far as I can tell).
If you can figure out the coordinates representing the bounding box area you would like to get “blob” results for, you can probably fetch them all in one request to the API endpoint using that parameter and then parse the returned JSON data into a Python dictionary (no beautiful soup required); unless you hit some API imposed restrictions, in which case you may need to divide up the bounding box area into smaller sections and make a call for each.
If you’re using the little coordinate box on the map that follows your mouse you can convert the sexagesimal degrees/minutes/seconds to decimal degrees. Note that values tagged with an “O” at the end of them should have a negative decimal value when entered as a parameter.
I wrote the code below on jupyter notebook using your advice and it works, however it only retrieves info for 1000 blobs instead of the 50,000-60,000 there are more or less.
What could I do to get all of the data?
BTW I had written code earlier which would call the API for every little zoomed rectangle on the map but it takes several days to run (as it has to call the API about 5,460,000 times in order to search the whole of Spain and its islands)
I know for sure this task can be done in a lot less time than a week because I am aware of someone who wrote the code and retrieved the data en less than 3 days after work using VBA, I just can’t figure out how
Thanks for your time
Jaime
import requests
def get_data_from_api_3():
headers = {
‘Connection’: ‘keep-alive’,
‘sec-ch-ua’: ‘“Google Chrome”;v=“93”, " Not;A Brand";v=“99”, “Chromium”;v=“93”’,
‘sec-ch-ua-mobile’: ‘?0’,
‘User-Agent’: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36’,
‘sec-ch-ua-platform’: ‘“Windows”’,
‘Accept’: ‘/’,
‘Sec-Fetch-Site’: ‘same-origin’,
‘Sec-Fetch-Mode’: ‘cors’,
‘Sec-Fetch-Dest’: ‘empty’,
‘Referer’: ‘Niveles de Exposición’,
‘Accept-Language’: ‘en-GB-oxendict,en;q=0.9,es-ES;q=0.8,es;q=0.7’,
}
params = (
(‘idCapa’, ‘null’),
(‘bbox’, ‘-8.84892992, 35.74657448, 6.49954908, 44.78973632’), #These coordinates go from the bottom left of the canary islands
# to the top right of Spain and the balearic islands, ensuring we
# capture all mobile towers in a single API call
(‘zoom’, ‘4’),
)
response = requests.get(‘https://geoportal.minetur.gob.es/VCTEL/infoantenasGeoJSON.do’, headers=headers, params=params)
return response.json()
count = 0
for data in jsondata:
if count == 0:
header = data[‘properties’].keys()
csv_writer.writerow(header)
count += 1
csv_writer.writerow(data[‘properties’].values())
So the issue here is that the API is imposing a limit of 1,000 results per request. That’s where we hit this part.
Figuring out the limitations of an API is one of the trickier bits of scraping data from them. Sometimes there is documentation explaining what you can and cannot do with any particular endpoint; however that does not appear to be the case here (or at least I couldn’t find any), which is not particularly surprising as I’m not sure this API was really meant to be used for retrieving the entire dataset.
One option that might be good to think about before diving into more web requests is to contact the data owners/maintainers (which looks to be the Government of Spain: Ministry of Economic Affairs and Digital Transformation in this case) and request the dataset. This is a more common route than a lot of people realize (it’s especially easy if you’re in academia) and doesn’t run the risk of cutting into the maintainer’s bandwidth. This is data they seem to be encouraging the public to look at, so it seems likely they’d be willing to hand the dataset over to an interested researcher.
With that, if you want to continue with this approach, the next question is going to be how do divide up that big bounding box into smaller ones? You’d essentially make one request per little bounding box and then combine all the results together.
The simplest way would just be to divide the big box into smaller equally sized boxes. You’d have to use a box size small enough that in the densest area of cell towers you wouldn’t hit that 1,000 data point limit. Given the area you’d like to cover, I’d say that could be anywhere around 2000 little boxes (just a guesstimate from looking at some spots in Madrid). Many of which will likely contain no results at all (oceans and undeveloped land are not likely to have many cell towers).
If you’ve got access to knowledge about the data’s density distribution, you could create boxes that have area inversely proportional to the density of cell towers. Geographic locations like undeveloped land and oceans could have much larger bounding boxes and not risk hitting that 1,000 result request limit. Areas like dense city which have a lot of cell towers would need to be divided up smaller in order to avoid that limit.
Also, while you can totally do this with the csv module and manually parsing the dictionaries (and there is nothing wrong with that); I would like to point out that each result is already returned in GeoJSON format. There is a library called geopandas that extends the pandas data manipulation library and is made for handling exactly this type of data. As an example that extends you’re pervious code:
# First, modify your original request function to pass the bounding box
# coordinates into the 'bbox' param
import geopandas
from time import sleep
def generate_bounding_boxes(full_coordinates):
# generate your individual bounding boxes for each request here
pass
# Make a placeholder DataFrame to fill
gdf = geopandas.GeoDataFrame()
# Loop through each individual bounding box
for bbox in generate_bounding_boxes(full_coordinates):
# Make an API request for the bbox coordinates
data = get_data_from_api_3(bbox).json()
# Create a GeoDataFrame from the response GeoJSON data
bbox_gdf = geopandas.GeoDataFrame.from_features(data)
print(f"Results in {bbox}: {len(bbox_gdf)}")
# Append the data points to the main dataframe
gdf = gdf.append(bbox_gdf)
# IMHO, it's always a good idea to space out requests to free/public APIs
sleep(1)
# Any data points that are on the perimeter of bounding boxes
# (i.e. two boxes touch), or if your bounding boxes have overlapping areas,
# could have been returned in multiple request calls, resulting in duplicates.
# Pandas makes it very easy to drop these duplicates from the DataFrame
gdf = gdf.drop_duplicates()
print(f"Total Results: {len(gdf)}")
print(f"Data Features: {gdf.columns}")
# Save the DataFrame to a CSV file
gdf.to_csv("mobile_tower_geodata.csv")