2

I have a website to scrape and what I need to scrape is inside a div that has an id left_container_scroll that contains multiple a tags. This div has the infinite scroll on it and I can't make it work. I am trying to make the program scroll in that div.

I have tried to do something like this, but I get an error: Evaluation failed: ReferenceError: elem is not defined

htmlTag = '#left_container_scroll'; //I think I am doing something wrong here let elem = await page.evaluate((htmlTag)=> { return document.querySelector(htmlTag); }) previousHeight = await page.evaluate("elem.scrollHeight"); await page.evaluate("window.scrollTo(0,elem.scrollHeight)"); await page.waitForFunction(`elem.scrollHeight > ${previousHeight}`); 

3 Answers 3

4

Some of this JavaScript code runs inside the browser, some inside the Node.js runtime, and they can't see each other's variables.

For example, page.evaluate("elem.scrollheight") cannot see the elem variable you've set above, since the variable is inside the Node.js runtime, and the code elem.scrollheight is being ran inside the browser (similar issue also with htmlTag earlier).
To pass values from Node.js to the browser, you would usually give additional arguments to page.evaluate.

Something like this might work (haven't tested if the scrolling works as intended, but at least Puppeteer runs the code)

// returns a Puppeteer ElementHandle (not browser DOM element) let elem = await page.$(htmlTag) // passes the ElementHandle back to the browser code (Puppeteer converts it back to DOM element) let previousHeight = await page.evaluate(e => e.scrollHeight, elem) // again, pass ElementHandle await page.evaluate(e => window.scrollTo(0, e.scrollHeight), elem) // pass both ElementHandle and previousHeight to the browser side await page.waitForFunction((e, ph) => e.scrollHeight > ph, {}, elem, previousHeight) 
Sign up to request clarification or add additional context in comments.

6 Comments

It still gives me this error: TimeoutError: waiting for function failed: timeout 30000ms exceeded
Oops, it looks like waitForFunction wants an extra options argument, so it should be page.waitForFunction((e, ph) => e.scrollHeight > ph, {}, elem, previousHeight) (fixed also above).
Do you know if the "infinite scroll" code on the page is working as intended? One way to debug this would be to add ` {headless: false}` to puppeteer.launch (so you can see what's happening, and see the developer tools console), and then log something inside the waiting function, like await page.waitForFunction((e, ph) => { console.log("Current scrollHeight:", e.scrollHeight, " previousHeight:", ph); return e.scrollHeight > ph; }, {}, elem, previousHeight).
The infinite scroll is not working, it takes only the data that is loaded together with the page, it is not scrolling down to load new data ...
It's difficult to tell exactly what's happening, but one possibility that comes to mind is that our call to window.scrollTo happens before the infinite scroll code on the page has actually added the event listener for scroll events. I guess you could test this by adding window.scrollTo(document.body.scrollHeight) to the waitForFunction part (so it's called repeatedly by Puppeteer)
|
2

Made a quite simple solution last time I was webscraping, hopefully it will help out!

let lastHeight = await page.evaluate('document.body.scrollHeight'); while (true) { await page.evaluate('window.scrollTo(0, document.body.scrollHeight)'); await page.waitForTimeout(2000); // sleep a bit let newHeight = await page.evaluate('document.body.scrollHeight'); if (newHeight === lastHeight) { break; } lastHeight = newHeight; } 

Comments

0

I would take in consideration the element you want to pull, I assume that using infinite scrolling you are looking to get more element. I would set a base counter of the element you want pull, then have a loop that checks if the previous element count is equal to the new element count, this way, you can break the loop then extract the data you want. In my case, I'd set another check for element_limit e.g. 100, regardless if the loop is done or not, it'll break the loop. You may also want to consider having random timeouts between 1-5secs, this will at least give your script time for the page to load, remember that not all pages are created equally, and the network connection is also a concern.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.