Web Scraper Outputs Just One Item Instead of Many

lasjorg · March 20, 2020, 10:11pm

I don’t understand the connection you have between the link/category and image. There also isn’t a one-to-one relationship when it comes to the numbers of elements. There are 100 images per page (per category) and 39 categories in total.

If the second map (link/category) ran as many times as the first map (images) then you would just get the image from the images array inside the second map using its index.

const image = images[i];

return {
  link,
  cat,
  image
};

But this will only give you 39 images.

Can you show what you want the CSV to look like?

lasjorg · March 20, 2020, 10:18pm

I did a search for the image names in another thread and found the site. Not sure about linking it here? It is a online store.

lasjorg · March 20, 2020, 11:16pm

@makamo66 We merged two of your threads to give some more context. Please continue your current questions in this thread. Sorry for any confusion.

makamo66 · March 21, 2020, 1:56pm

You merged together two different code bases. The first one console logs all the output and the second one saves to a CSV file. It seems more confusing now than it was before.

lasjorg · March 22, 2020, 2:42pm

@makamo66 Well, in any case, if you have questions post them here. You also never answer the question I posted.

makamo66 · March 22, 2020, 2:51pm

I don’t know what I want the CSV to look like. That’s why I didn’t answer.

lasjorg · March 22, 2020, 2:57pm

Well, then I can’t really help you transform the data. But you see why I asked that question, right? You have many more images then links/categories.

makamo66 · March 22, 2020, 4:23pm

const images = cheerio('td.productListing-data > a > img', html)
  .map((i, image) => cheerio(image).attr('src'))

const img = images.each( (i, image) =>
  {image}
).get()

    return {
	  img,
    }

The code above resulted in the entire array being outputted to each cell of the spreadsheet.

makamo66 · March 22, 2020, 4:27pm

I would expect the spreadsheet to have many more cells filled with images than with links/categories.

makamo66 · March 22, 2020, 6:22pm

Even when I changed the names of the parameters, it still didn’t work.

makamo66 · April 1, 2020, 4:32pm

So I tried to learn how to do this on my own by taking two video courses about node and cheerio on udemy.com but I haven’t gotten any better at it. The code produces an array of image URLs when I do this:

const request = require("request-promise");
const cheerio = require("cheerio");

const url = "https://www.example.com";

const scrapeResults = [];
async function scrapeJobHeader() {
  try {
    const htmlResult = await request.get(url);
    const $ = await cheerio.load(htmlResult);
	$("td.productListing-data > a ").each((index, element) => {
      const resultTitle = $(element).children("img");
	  
	  const img_url = resultTitle.attr("src");
	  
      const scrapeResult = { img_url };
    
      scrapeResults.push(scrapeResult);
    });
    return scrapeResults;
  } catch (err) {
    console.error(err);
  }
}

async function scrapeWebsite() {
  const jobsWithHeaders = await scrapeJobHeader();
 console.log(jobsWithHeaders);

}

scrapeWebsite();

but I get the error “TypeError: Cannot read property ‘replace’ of undefined” when I do this:

const img_url = resultTitle.attr("src").replace("images\\/more_color.png","");

The images that are returned include “images/more_color.png” but I just want to return the actual product images and not the pngs.

The results look like this:

{ img_url: 'images/more_color.png' },
  { img_url: undefined },
  { img_url: undefined },
  {
    img_url: 'images/20191206/thumb/AK1501-@RH-CRY-LOVE@22X06-825_3L@467400@200@01@200.jpg'
  },
  { img_url: 'images/more_color.png' },
  { img_url: undefined },
  { img_url: undefined },
  {
    img_url: 'images/20191206/thumb/AK1501-@GD-CRY-LOVE@22X06-825_3L@467399@200@01@200.jpg'
  },
  { img_url: 'images/more_color.png' },
  { img_url: undefined },
  { img_url: undefined },
  {
    img_url: 'images/20191206/thumb/AK1500-@GD-CRY-QUEEN@3X06-825_3L@467397@200@01@200.jpg'
  },
  { img_url: 'images/more_color.png' },
  { img_url: undefined },
  { img_url: undefined },

The next step would be to get rid of the undefined image URLs but I haven’t gotten that far yet.

makamo66 · April 1, 2020, 5:22pm

I got rid of the undefined ones by using:

  if (img_url != undefined){
      scrapeResults.push(scrapeResult);
	 }

lasjorg · April 1, 2020, 5:33pm

You can check the Truthiness of the img string and see if it does not end with .png before pushing to the array.

async function scrapeJobHeader() {
  try {
    const htmlResult = await request.get(baseURL);
    const $ = await cheerio.load(htmlResult);
    $('td.productListing-data > a ').each((index, element) => {
      const resultTitle = $(element).children('img');
      if (
        resultTitle.attr('src') &&
        !resultTitle.attr('src').endsWith('.png')
      ) {
        let img_url = resultTitle.attr('src');
        scrapeResults.push({ img_url });
      }
    });
    return scrapeResults;
  } catch (err) {
    console.error(err);
  }
}

makamo66 · April 1, 2020, 5:42pm

Thank you lasjorg! That worked!

lasjorg · April 1, 2020, 5:55pm

No problem, glad to help. I should have linked to the endsWith method docs.

I suggest you keep these three links handy, on the left-hand side is the list of all the methods on the different objects (Object, Array, String).

Happy coding!

Topic		Replies	Views
Trying to Scrape an Array	17	801	June 1, 2021
Web Scraping for a Site that Uses Inline Styles JavaScript	3	811	June 1, 2021
Help in scraping a webpage	4	512	June 1, 2021
Scraping with Cheerio & Manipulating DOM with Jquery	1	926	June 1, 2021
Parsing API response JavaScript	1	478	February 2, 2021

Web Scraper Outputs Just One Item Instead of Many

Related topics