3

I'm in trouble with some url's from a web-store called Kabum.

The url is http://www.kabum.com.br/cgi-local/kabum3/produtos/descricao.cgi?id=01:02:23:55:159

If I enter the site in the address bar, or click the link, I got a page with the product, but If I use Jsoup, I get a page with only a meta refresh to the same address.

Tried setting the user agent, the referrer and follow the link in meta, but I got the same page.

My code is here:

Document doc; String url = "http://www.kabum.com.br/cgi-local/kabum3/produtos/descricao.cgi?id=01:02:23:55:159"; try { String ua = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0"; String referrer = "http://www.google.com"; doc = Jsoup.connect(url).timeout(20000).userAgent(ua).referrer(referrer).get(); Elements meta = doc.select("html head meta"); for (Iterator<Element> it = meta.iterator(); it.hasNext();) { Element element = it.next(); if (element.attr("http-equiv").matches("refresh")) { String novaUrl = element.attr("content").replaceFirst("\\d?;url=", ""); System.out.printf("redirecting to %s%n", novaUrl); doc = Jsoup.connect(novaUrl).userAgent(ua).referrer(referrer).get(); break; } } } catch (IOException ex) { Logger.getLogger(Teste1.class.getName()).log(Level.SEVERE, null, ex); return; } System.out.println(doc); 
5
  • There appears to be a lot of javascript on that page. Have you taken that into account? Commented May 17, 2012 at 18:00
  • I don't need the javascript, only the html. The html returned in jsoup is only this (can't format this?): <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=windows-1252" /> <meta http-equiv="refresh" content="0;url=kabum.com.br/cgi-local/kabum3/produtos/…" /> </head> <body></body> </html> Commented May 17, 2012 at 19:08
  • But the javascript may be generating a lot of the web page that you are trying to extract and see. If so, you may be out of luck with jsoup. Commented May 17, 2012 at 19:38
  • Tried disabling javascript with noscript, but the page works well, so I think the problem isn't javascript Commented May 17, 2012 at 20:44
  • Hm, this has me stumped. I don't know why Jsoup isn't working as expected. Commented May 17, 2012 at 20:56

2 Answers 2

2

You need to re-send a request with the cookies. The site is returning one session cookie which it expects to see in the next request.

String url = "http://www.kabum.com.br/cgi-local/kabum3/produtos/descricao.cgi?id=01:02:23:55:159"; Map<String, String> cookies = Jsoup.connect(url).execute().cookies(); Document document = Jsoup.connect(url).cookies(cookies).get(); System.out.println(document.html()); 

Note that you should use the same cookies on every subsequent request you'd like to fire in the same session.

Sign up to request clarification or add additional context in comments.

Comments

2

Very interesting.

Yeah, the following line: <meta http-equiv="refresh" content="0;url=kabum.com.br/cgi-local/kabum3/produtos/…; /> is telling the browser to refresh the current url.

So it looks like the page tells the browser to keep refreshing the page until the server has satisfied whatever criteria it is looking for.

You'll have to figure out what criteria the server is looking for. The first things to check might be (1) the redirect limit that jsoup is set to (if it has "follow redirect" capability and can understand that meta tag), and (2) cookies.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.