0

i try to make a web crawler with selenium. My program fire a StaleElementReferenceException. I thought that were because i crawl a page recursive and when a page have no more links the function navigate to next page and not previously to the parent page.

Therefore i have introduced a tree data structure to navigate back to the parent when the current url not equal the parent url. But this was not the solution for my problem.

Can anybody help me?

Code:

public class crawler { private static FirefoxDriver driver; private static String main_url = "https://robhammond.co/tools/seo-crawler"; private static List<String> uniqueLinks = new ArrayList<String>(); public static void main(String[] args) { driver = new FirefoxDriver(); Node<String> root = new Node<>(main_url); scrape(root, main_url); } public static void scrape(Node<String> node, String url) { if(node.getParent() != null && (!driver.getCurrentUrl().equals(node.getParent().getData()))) { driver.navigate().to(node.getParent().getData()); } driver.navigate().to(url); List<WebElement> allLinks = driver.findElements(By.tagName("a")); for(WebElement link : allLinks) { if(link.getAttribute("href").contains(main_url) && !uniqueLinks.contains(link.getAttribute("href")) && link.isDisplayed()) { uniqueLinks.add(link.getAttribute("href")); System.out.println(link.getAttribute("href")); scrape(new Node<>(link.getAttribute("href")), link.getAttribute("href")); } } } } 

And this is the output from the console:

D:\Programme\openjdk-12.0.1_windows-x64_bin\jdk-12.0.1\bin\java.exe "-javaagent:D:\Programme\JetBrains\IntelliJ IDEA 2019.1.2\lib\idea_rt.jar=60461:D:\Programme\JetBrains\IntelliJ IDEA 2019.1.2\bin" -Dfile.encoding=UTF-8 -classpath C:\Users\admin\Desktop\SeleniumWebScraper\out\production\SeleniumWebScraper;D:\Downloads\selenium-server-standalone-3.141.59.jar de.company.crawler.crawler 1557924446770 mozrunner::runner INFO Running command: "C:\\Program Files\\Mozilla Firefox\\firefox.exe" "-marionette" "-foreground" "-no-remote" "-profile" "C:\\Users\\admin\\AppData\\Local\\Temp\\rust_mozprofile.YqmEqE8y1pjv" 1557924447037 [email protected] WARN Loading extension '[email protected]': Reading manifest: Invalid extension permission: mozillaAddons 1557924447037 [email protected] WARN Loading extension '[email protected]': Reading manifest: Invalid extension permission: resource://pdf.js/ 1557924447037 [email protected] WARN Loading extension '[email protected]': Reading manifest: Invalid extension permission: about:reader* 1557924448047 Marionette INFO Listening on port 60468 1557924448383 Marionette WARN TLS certificate errors will be ignored for this session Mai 15, 2019 2:47:28 NACHM. org.openqa.selenium.remote.ProtocolHandshake createSession INFO: Detected dialect: W3C JavaScript warning: https://robhammond.co/js/jquery.min.js, line 4: Using //@ to indicate sourceMappingURL pragmas is deprecated. Use //# instead https://robhammond.co/tools/seo-crawler#content https://twitter.com/intent/tweet?text=SEO%20Crawler&url=https://robhammond.co/tools/seo-crawler&via=robhammond Exception in thread "main" org.openqa.selenium.StaleElementReferenceException: The element reference of <a href="/tools/"> is stale; either the element is no longer attached to the DOM, it is not in the current frame context, or the document has been refreshed For documentation on this error, please visit: https://www.seleniumhq.org/exceptions/stale_element_reference.html Build info: version: '3.141.59', revision: 'e82be7d358', time: '2018-11-14T08:25:53' System info: host: 'DESKTOP-admin', ip: '192.168.233.1', os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '12.0.1' Driver info: org.openqa.selenium.firefox.FirefoxDriver Capabilities {acceptInsecureCerts: true, browserName: firefox, browserVersion: 66.0.5, javascriptEnabled: true, moz:accessibilityChecks: false, moz:geckodriverVersion: 0.24.0, moz:headless: false, moz:processID: 19124, moz:profile: C:\Users\admin\AppData\Loca..., moz:shutdownTimeout: 60000, moz:useNonSpecCompliantPointerOrigin: false, moz:webdriverClick: true, pageLoadStrategy: normal, platform: WINDOWS, platformName: WINDOWS, platformVersion: 10.0, rotatable: false, setWindowRect: true, strictFileInteractability: false, timeouts: {implicit: 0, pageLoad: 300000, script: 30000}, unhandledPromptBehavior: dismiss and notify} Session ID: b3b87675-57c8-4b48-9a20-8df5e4d37503 at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:500) at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:481) at org.openqa.selenium.remote.http.W3CHttpResponseCodec.createException(W3CHttpResponseCodec.java:187) at org.openqa.selenium.remote.http.W3CHttpResponseCodec.decode(W3CHttpResponseCodec.java:122) at org.openqa.selenium.remote.http.W3CHttpResponseCodec.decode(W3CHttpResponseCodec.java:49) at org.openqa.selenium.remote.HttpCommandExecutor.execute(HttpCommandExecutor.java:158) at org.openqa.selenium.remote.service.DriverCommandExecutor.execute(DriverCommandExecutor.java:83) at org.openqa.selenium.remote.RemoteWebDriver.execute(RemoteWebDriver.java:552) at org.openqa.selenium.remote.RemoteWebElement.execute(RemoteWebElement.java:285) at org.openqa.selenium.remote.RemoteWebElement.getAttribute(RemoteWebElement.java:134) at de.company.crawler.crawler.scrape(crawler.java:33) at de.company.crawler.crawler.scrape(crawler.java:38) at de.company.crawler.crawler.main(crawler.java:20) Process finished with exit code 1 
1

1 Answer 1

1
  1. When you navigate away from the first page all WebElements in the allLinks list get lost.

    I would recommend converting it from the list of WebElement to the list of normal Strings like:

    List<String> allLinksHrefs = allLinks.stream().map(link -> link.getAttribute("href")).collect(Collectors.toList()); 

    and iterate through this new allLinksHrefs list instead.

  2. You can use a hash-based collection for holding the uniqueLinks like HashSet - this way duplicates will be automatically eliminated
  3. The current approach can take days to complete, consider using Selenium Grid and running your scraper in Parallel
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks, i have try your solution and it work with this url but when i try to crawl gmx.net i get the same error and i don't know why. Look at my code here.
Can't anybody help me?
Inside your loop you are navigating to another page which make the rest of the links you iterate over stale. It's exactly the same problem you have above.
But why? I iterate over list of strings that represents the links on the page and not over web elements...

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.