Pass multiple objects to array

Code for the following question available at https://repl.it/@kkoutoup/ParliamentUKscraper

Question
I’m trying to create a web scraper with node, request and cheerio.
The scraper works in two phases (so far):

Phase One
The first function scrapeCommitteesPage() visits a page, scrapes all links, adds them to an array, and passes the array to a second function called getUniqueIDs()

//dependencies
const request = require('request');
const cheerio = require('cheerio');

//uk parliament - select committees page
const committeesListUrl = 'https://bit.ly/2YXZBC7';

//get all links to committee pages
scrapeCommitteesPage=()=>{  
  request(committeesListUrl, (error, response, body)=>{
    if(!error && response.statusCode == 200){
      //pass response body to cheerio
      const $ = cheerio.load(body),
      //get all committee page links - ul.square-bullets-a-to-z li a
            committeeLinksArray = [],
            linkList = $('.square-bullets-a-to-z a'),
            parliamentUrl = 'https://www.parliament.uk';
      //push all links to array
      for(let i=0;i<linkList.length;i++){
        if(linkList[i]){
          committeeLinksArray.push(`https://www.parliament.uk${linkList[i].attribs.href}`);
        }
      }
    //pass array to uniqueIDs function
    getUniqueIDs(committeeLinksArray);  
    }else{
      console.log(error)
    }//end 1st request else
  })//end request
}

Phase Two
The second function getUniqueIDs() loops through the array of links, passed as an argument from the first function, and saves information (name, id, url) into a committeeDetails object.

//get unique IDs
getUniqueIDs=(committeeLinksArray)=>{
  for(let i=0;i<committeeLinksArray.length;i++){
    if(committeeLinksArray[i]){
      request(committeeLinksArray[i], (error, response, body)=>{
        if(!error && response.statusCode == 200){
          //pass response body to cheerio
          const $ = cheerio.load(body),
                committeeDetails = [],
                //save committee name, id and url for rss feed
                committeeName = $("meta[property='og:title']").attr("content"),
                uniqueID = $("meta[name='search:cmsPageInstanceId']").attr("content"),
                committeeRSSUrl = `https://www.parliament.uk/g/rss/committee-feed/?pageInstanceId=${uniqueID}&type=Committee_Detail_Mixed`;
          //push details object to committeeDetails array      
          committeeDetails.push({
            'committee-name': committeeName,
            'committee-ID': uniqueID,
            'committee-RSS-URL': committeeRSSUrl
          })
          //pass details to visitCommitteePage function
          visitCommitteePage(committeeDetails)
        }else{
          console.log(error);
        }
      }) 
    }
  }//end for loop 
}

When I’m trying to console.log the object just to make sure all the information is there I get single arrays with the object properties but what i need is a single array of objects that will later be manipulated in a third function visitCommitteePage().

So the output I get looks like:

[ { 'committee-name': 'Defence Sub-Committee',
    'committee-ID': '105517',
    'committee-RSS-URL':
     'https://www.parliament.uk/g/rss/committee-feed/?pageInstanceId=105517&type=Committee_Detail_Mixed' } ]
[ { 'committee-name': 'Business, Energy and Industrial Strategy Committee',
    'committee-ID': '115803',
    'committee-RSS-URL':
     'https://www.parliament.uk/g/rss/committee-feed/?pageInstanceId=115803&type=Committee_Detail_Mixed' } ]

whereas what I want is an array of objects that would look like this:

[ { 'committee-name': 'Defence Sub-Committee',
    'committee-ID': '105517',
    'committee-RSS-URL':
     'https://www.parliament.uk/g/rss/committee-feed/?pageInstanceId=105517&type=Committee_Detail_Mixed' },
   { 'committee-name': 'Business, Energy and Industrial Strategy Committee',
    'committee-ID': '115803',
    'committee-RSS-URL':
     'https://www.parliament.uk/g/rss/committee-feed/?pageInstanceId=115803&type=Committee_Detail_Mixed' } 
]

How can I achieve that?

Apologies for the long question.

Code available at https://repl.it/@kkoutoup/ParliamentUKscraper

For each iteration of your loop, you are declaring committeeDetails as an empty array. then you push a single object to the array, before passing the array to visitCommitteePage, where it is logged as an array with a single element.

This means that you are never storing more than the most recent set of details in the array.

Move your declaration/assignment of committeeDetails to before your for loop. Move your visitCommitteePage call to after the loop. That way, you’ll being adding each successive object to the same array, and then logging the whole array as one.


That would solve your problem, but you have a for loop filled with asynchronous calls, which means that the final log statement happens before you actually put the data into committeeDetails, so it always logs as empty. To solve that issue, you need to look deeper into callbacks/promises. It’s essentially the same idea, but you need to tell the log statement not to run until all of your scraping has finished.

1 Like

Many thanks for pointing me to the right direction!

After making the changes you suggested i created a setTimeout() function with a few seconds delay to test if the data was passed to the next step of the process i.e. the visitCommitteePage() function and it worked! I have to wrap my head around callbacks/promises now.

1 Like