Web Scraper Outputs Just One Item Instead of Many

A part of my view source looks like this:

<td align="center" valign="top" class="productListing-data" style="position:relative;padding-bottom: 5px;" width="25%">
<a style="position:relative;float:left;" href="product_info.php?products_id=474302&kind=2&cPath=172_93_96&description=
3PCS---Round-Metal-Link-Chain-Layered-Anklets-">
<img src="images/20200131/thumb/AK0065-@GDXX@3P-03H-75_3L@474302@350@01@200.jpg" title="
3PCS - Round Metal Link Chain Layered Anklets " width="200" border="0" height="200" alt="
3PCS - Round Metal Link Chain Layered Anklets ">
<td align="center" valign="top" class="productListing-data" style="position:relative;padding-bottom: 5px;" width="25%">
<a style="position:relative;float:left;" href="product_info.php?products_id=474303&kind=2&cPath=172_93_96&description=
3PCS---Round-Metal-Link-Chain-Layered-Anklets-">
<img src="images/20200131/thumb/AK0065-@RHXX@3P-03H-75_3L@474303@350@01@200.jpg" title="
3PCS - Round Metal Link Chain Layered Anklets " width="200" border="0" height="200" alt="
3PCS - Round Metal Link Chain Layered Anklets ">
<td align="center" valign="top" class="productListing-data" style="position:relative;padding-bottom: 5px;" width="25%">
<a style="position:relative;float:left;" href="product_info.php?products_id=479684&kind=2&cPath=172_93_96&description=Faceted-Bead-Pearl-Link-Anklet">
<img src="images/20200312/thumb/AK0015-@GD-NMLT2@02H-9_3L@479684@225@01@200.jpg" title="Faceted Bead Pearl Link Anklet" width="200" border="0" height="200" alt="Faceted Bead Pearl Link Anklet"><span class="small_cart" ></span></a><a href="product_info.php?products_id=479684&kind=2&cPath=172_93_96&description=Faceted-Bead-Pearl-Link-Anklet"><span style="display:-webkit-inline-box">479684</span><br /><a href="product_info.php?products_id=479684&kind=2&cPath=172_93_96&description=Faceted-Bead-Pearl-Link-Anklet"><font style="display: block;height:40px;text-transform: uppercase;" title="Faceted Bead Pearl Link Anklet">Faceted Bead Pearl Link Anklet</font></a>&nbsp;<a href="https://www.wonatrading.com/login">Login for Price</a>&nbsp;&nbsp;</td>

My node code looks like this:

const rp = require('request-promise');
const $ = require('cheerio');
const url = 'https://www.example.com';

rp(url)
.then(function(html) {
console.log($('td.productListing-data > a > img', html).attr('src'));
})
.catch(function(err) {
//handle error
});

When I execute the file, I get just one image when I should get all three:
C:\Users\Maureen\Desktop\scraper>node scraper.js
images/20200312/thumb/AK0015-@GD-NMLT2@02H-9_3L@479684@225@01@200.jpg

I found the site so that helps.

I haven’t really use Cheerio much and I don’t use jQuery so this is just what I came up with after looking at a video for 2 min. I tried without looking up information first but I had forgotten how each works (parameter list is not like plain forEach) and I didn’t realize you had to select the element again inside the loop.

$('td.productListing-data > a > img', html).each((i, image) =>
  console.log($(image).attr('src'))
);

I would suggest learning more about how Cheerio works.
https://www.youtube.com/results?search_query=Cheeriojs

1 Like

Thank you so much, lasjorg! That worked like a charm.

The function imgArray returns just one image and not an array. That is, the third column of the spreadsheet which is img lists just one image. Here is my code:

const rp = require('request-promise');
const otcsv = require('objects-to-csv');
const cheerio = require('cheerio');

const baseURL = 'https://www.example.com';

const getCategories = async () => {
const html = await rp(baseURL);
  
const imgArray = () => {
cheerio('td.productListing-data > a > img', html).each((i, image) => {
 img = cheerio(image).attr('src');
})};

imgArray();
  
  const businessMap = cheerio('.category', html).map(async (i, e) => {
    const link = e.attribs.href;
    const innerHtml = await rp(link);
    const cat = e.children[0].data;
	
    return {
      link,
      cat,
	  img,
    }
  }).get();
  return Promise.all(businessMap);

};

getCategories()


  .then(result => {
    const transformed = new otcsv(result);
    return transformed.toDisk('./spreadsheets/output.csv');
  })
  .then(() => console.log('SUCCESSFULLY COMPLETED THE WEB SCRAPING SAMPLE'));

Both each and imgArray doesn’t return anything. The only reason you have any data is that img is an undeclared variable so it becomes a global variable and will contain the image string from the last loop iteration.

Declare img as an array outside the function and push to it. Or declare it inside the function and push to it, then return it after the loop, but that would require some refactoring.

Yes, the last image is the one that gets added to the spreadsheet.

I tried to do a push like you suggested, but it didn’t work.

I changed the code as follows and it still just gives me one item:

let img;
const images = cheerio('td.productListing-data > a > img', html).map(async (i, image) => {
img = cheerio(image).attr('src');
}).get();

You should probably start by looking at what cheerio(selector) is returning and if that is something you can map on. Remember Cheerio is built around jQuery and it works very similarly.

const images = cheerio('td.productListing-data > a > img', html)

console.log(typeof images);
console.log(Array.isArray(images));

I’d suggest you stick to the build-in methods when possible because switching to plain JS sometimes requires knowing what Cheerio is doing differently.

1 Like

I used:

const images = cheerio('td.productListing-data > a > img', html)
  .map((i, image) => cheerio(image).attr('src'))
  .get()

that had the result that every cell of the spreadsheet contained an array.

I didn’t see you were using the build-in map, my bad. It is only the argument list that shows that ((index, element) and not (element, index)), guess I wasn’t paying attention.

So are you getting the data you want now?

No, I’m not getting the desired result. Each cell contains the entire array when each cell should contain just one element of the array.

Well you have the array of images, but are you just adding that to the spreadsheet? What does result in the then look like?

I think we might need to see your new full code.

Edit: You may need to transform the data so you give objects to csv the correct data. You need each of the images inside its own object as I understand it (from glancing at the docs).

const rp = require('request-promise');
const otcsv = require('objects-to-csv');
const cheerio = require('cheerio');

const baseURL = 'https://www.example.com';

const getCategories = async () => {

 const html = await rp(baseURL);
const images = cheerio('td.productListing-data > a > img', html)
.map( (i, image) => cheerio(image).attr('src'))
.get()
 
  const businessMap = cheerio('.category', html).map(async (i, e) => {
    const link = e.attribs.href;
    const innerHtml = await rp(link);
    const cat = e.children[0].data;
	
    return {
      link,
      cat,
	  images,
    }
  }).get();
return Promise.all(businessMap);
};

getCategories()


  .then(result => {
    const transformed = new otcsv(result);
    return transformed.toDisk('./spreadsheets/output5.csv');
  })
  .then(() => console.log('SUCCESSFULLY COMPLETED THE WEB SCRAPING SAMPLE'));

I’m just spitballing here but I think you have to give it an array of objects where all the values are just strings.

Just using empty strings here because I can’t be bothered to fill it out

const data = [
  {link: '', cat: '', image: ''},
  {link: '', cat: '', image: ''},
  {link: '', cat: '', image: ''}
];

You can’t just give it this

const data = [
  {link: '', cat: '', images: ['', '', '']},
  {link: '', cat: '', images: ['', '', '']},
  {link: '', cat: '', images: ['', '', '']}
];

Yes, I see what you mean. images is returning an array for each loop but I don’t know how to make it just return an object for each iteration.

I could loop through the array of course with a for statement and an index but I don’t see how I would access the results in the return statement.

I did this but of course it didn’t work:

for (i = 0; i < 7; i++) {
	img = images[i];
}  
    return {
	  img,
    }

This didn’t work either:

var i = 0;
while(img = images[i++]){
    return {
	  img,
    }
}

Trying the .each method now.

Tried this

const images = cheerio('td.productListing-data > a > img', html)
.each( (i, image) => {cheerio(image).attr('src')})
.get()

but got error:

UnhandledPromiseRejectionWarning: TypeError: Converting circular structure to JSON