I have a very complicated and convoluted code so I can’t publish it. I am going to try to explain it in words (pseudo-code). I’m not sure anyone will be able to help me because it’s very complicated.
The way the function is executed in loops as follows causes the browser to visit the /page=1 twice
When I un-nest the m for loop it goes to a second page for an url that doesn’t have a second page
const urls=["https://www.example.com/jewelry/stainless-steel","https://www.example.com/jewelry/watch"
];
async function findTotalPages(url) {
//This function scrapes the page to find the number of products being listed and divides by 100 and uses ceil() to find the page number. There are 100 products per page.
return NumberOfPages;
}
async function scrapeProductPage(url){
//This function finds out where the product page is and takes you there so that you can scrape it
return scrapeResults;
}
async function scrapeDescription(url, page) {
//This function scrapes the product page and returns the data that was scraped.
return {url, cats, main_img, name, descript, price};
}
async function scrapeAll() {
//The first loop finds the number of pages to append to the category URL for the pagination
for (let m = 0; m < urls.length; m++ ){
pages = await findTotalPages(urls[m]);
//} when I un-nest the for loop it goes to a second page for an url that doesn't have a second page because the last url in the array has 2 pages
const descriptionPage = await browser.newPage();
//The second loop provides the URL array for the next loop
for (let k = 0; k < urls.length; k++ ){
The third loop uses the page number to go to the scrapeProductPage function
for (let j = 1; j <= pages; j++ ){
scrapeResults = await scrapeProductPage(urls[k] + "/page=" + j );
//The fourth loop scrapes the description in the product page and outputs the scraped data as an array
for (let i = 0; i < scrapeResults.length; i++){
result = await scrapeDescription(scrapeResults[i], descriptionPage);
resultsArray = [...resultsArray,result];
}}}}
1.) For the first iteration, findTotalPages finds the total number of pages for https://www.example.com/jewelry/stainless-steel which for this case is 1
2.) The second and third loop go to https://www.example.com/jewelry/stainless-steel/page=1
3.) The fourth loop scrapes the description in the product page which looks something like this:
https://www.example.com/product_info.php?products_id=430373