Scraper with Puppeteer login returns just one element of the array

This code is supposed to loop through the urls that get scraped from the scrapeProductPage function. But before looping, it needs to log in so that it can obtain the prices. The prices are only displayed to logged in users. Instead of looping through the urls it just returns the scraped data from one page. The error I get is “MaxListenersExceededWarning: Possible EventEmitter memory leak detected”.

const request = require("request-promise");
const cheerio = require("cheerio");
const ObjectsToCsv = require("objects-to-csv");
const puppeteer = require('puppeteer');


const url = "https://www.example.com";

const scrapeResults = [];

async function scrapeProductPage() {
  try {
	const htmlResult = await request.get(url);
    const $ = await cheerio.load(htmlResult);

$("td.productListing-data > a[style='position:relative;float:left;']").each((index, element) => {
     let url = $(element).attr("href");
	 url = "https\://www.example.com/" + url;
      const scrapeResult = { url };
      scrapeResults.push(scrapeResult);
    });
    return scrapeResults;
} catch (err) {
    console.error(err);
}
}

async function scrapeDescription(productsWithImages) {
process.setMaxListeners(0);
  const browser = await puppeteer.launch({
      headless: false
  }); 
  
  const page = await browser.newPage();
  await page.goto('https://www.example.com/login');
  
  await page.waitFor(500);

  await page.waitFor('input[name="email_address"]');
  await page.type('input[name="email_address"]', 'example@gmail.com');
  await page.type('input[name="password"]', '123test');
  await page.click('#btnLogin');

return await Promise.all(
   productsWithImages.map(async job => {
     try {
		 await page.goto(job.url, { waitUntil: "load" });
     	const content = await page.content();
        const $ = await cheerio.load(content);
	
		job.main_img = $('img#main_img').attr('src');
		job.name = $('h2').text();
		job.price =  $("td.products_info_price").text();
	
        return job;	
      } catch (error) {
        console.error(error);
      }
    })
  );
}



async function saveDataToCsv(data) {
  const csv = new ObjectsToCsv(data);
  console.log(csv);
}

async function scrapeWona() {
  const productsWithImages = await scrapeProductPage();
  const wonaFullData = await scrapeDescription(productsWithImages);
  await saveDataToCsv(productsWithImages);
}

scrapeWona();

Pretty hard to test without having a login and I can’t get one because you need an EIN to register.

Don’t you just have an array of links of all the single product pages from one page? I don’t see you navigating the pagination of the product page(s) (https://www.site.com/product/subproduct/page=2).

Do you really need to go to every single product page to get the price? Because that is a lot of activity to be hitting their servers with, not something I would suggest unless you have permission. Or you need to implement a lot more of a delay at least.

I asked my boss on slack if what we are doing is ethical but she never answered.

First off, does that mean yes to what I asked? Do you really have to hit every single product page to get the price? Does it not show in the list of products?

As I do not know any details I have a hard time giving much specific advice.

Scraping is not illegal, how and what you scrape is the determining factor. If you effectively end up doing what might amount to a DoS attack, that is bad. Respect the owner and do not request more, or faster, then what is appropriate. Or you might just end up getting your IP(s) blacklisted. If it has to take 10 min longer, so be it.

If you are obtaining resources you do not have the legal rights to use and use them, that is bad. Copyright, Licenses, TOS violations, etc. “legalese” needs to be factored in (I’m not a lawyer btw).

I watched a video on udemy.com called “Web Scraping in Nodejs” and created by Stefan Hyltoft. With help of this video I was able to copy new code that worked.

@lasjorg: I am hitting every product page to get more information than just the price. I am getting the descriptions as well and they are only on the product pages. I will have to find out from my boss if we really need to be doing this. The project is still quite young and I can’t predict what we will be needing. Thank you for your helpful comments.

Happy to help.

Implement some delay and break the jobs up into chunks that you can delay the execution of a bit. Even when not really needed, just think of is as a courtesy.

@lasjorg You commented that I should delay my scraping so I’ve been implementing the waitFor command after I go to the page like this:

await page.goto(url);
page.waitFor(10000);

I haven’t found much information specifically about slowing down scraping so I don’t know if this is sufficient. Is this all I need or could I do better?