Code for the following question available at https://repl.it/@kkoutoup/ParliamentUKscraper
Question
I’m trying to create a web scraper with node, request and cheerio.
The scraper works in two phases (so far):
Phase One
The first function scrapeCommitteesPage()
visits a page, scrapes all links, adds them to an array, and passes the array to a second function called getUniqueIDs()
//dependencies
const request = require('request');
const cheerio = require('cheerio');
//uk parliament - select committees page
const committeesListUrl = 'https://bit.ly/2YXZBC7';
//get all links to committee pages
scrapeCommitteesPage=()=>{
request(committeesListUrl, (error, response, body)=>{
if(!error && response.statusCode == 200){
//pass response body to cheerio
const $ = cheerio.load(body),
//get all committee page links - ul.square-bullets-a-to-z li a
committeeLinksArray = [],
linkList = $('.square-bullets-a-to-z a'),
parliamentUrl = 'https://www.parliament.uk';
//push all links to array
for(let i=0;i<linkList.length;i++){
if(linkList[i]){
committeeLinksArray.push(`https://www.parliament.uk${linkList[i].attribs.href}`);
}
}
//pass array to uniqueIDs function
getUniqueIDs(committeeLinksArray);
}else{
console.log(error)
}//end 1st request else
})//end request
}
Phase Two
The second function getUniqueIDs()
loops through the array of links, passed as an argument from the first function, and saves information (name, id, url) into a committeeDetails
object.
//get unique IDs
getUniqueIDs=(committeeLinksArray)=>{
for(let i=0;i<committeeLinksArray.length;i++){
if(committeeLinksArray[i]){
request(committeeLinksArray[i], (error, response, body)=>{
if(!error && response.statusCode == 200){
//pass response body to cheerio
const $ = cheerio.load(body),
committeeDetails = [],
//save committee name, id and url for rss feed
committeeName = $("meta[property='og:title']").attr("content"),
uniqueID = $("meta[name='search:cmsPageInstanceId']").attr("content"),
committeeRSSUrl = `https://www.parliament.uk/g/rss/committee-feed/?pageInstanceId=${uniqueID}&type=Committee_Detail_Mixed`;
//push details object to committeeDetails array
committeeDetails.push({
'committee-name': committeeName,
'committee-ID': uniqueID,
'committee-RSS-URL': committeeRSSUrl
})
//pass details to visitCommitteePage function
visitCommitteePage(committeeDetails)
}else{
console.log(error);
}
})
}
}//end for loop
}
When I’m trying to console.log
the object just to make sure all the information is there I get single arrays with the object properties but what i need is a single array of objects that will later be manipulated in a third function visitCommitteePage()
.
So the output I get looks like:
[ { 'committee-name': 'Defence Sub-Committee',
'committee-ID': '105517',
'committee-RSS-URL':
'https://www.parliament.uk/g/rss/committee-feed/?pageInstanceId=105517&type=Committee_Detail_Mixed' } ]
[ { 'committee-name': 'Business, Energy and Industrial Strategy Committee',
'committee-ID': '115803',
'committee-RSS-URL':
'https://www.parliament.uk/g/rss/committee-feed/?pageInstanceId=115803&type=Committee_Detail_Mixed' } ]
whereas what I want is an array of objects that would look like this:
[ { 'committee-name': 'Defence Sub-Committee',
'committee-ID': '105517',
'committee-RSS-URL':
'https://www.parliament.uk/g/rss/committee-feed/?pageInstanceId=105517&type=Committee_Detail_Mixed' },
{ 'committee-name': 'Business, Energy and Industrial Strategy Committee',
'committee-ID': '115803',
'committee-RSS-URL':
'https://www.parliament.uk/g/rss/committee-feed/?pageInstanceId=115803&type=Committee_Detail_Mixed' }
]
How can I achieve that?
Apologies for the long question.
Code available at https://repl.it/@kkoutoup/ParliamentUKscraper