0

I am following a tutorial on web scraping from the book "Web Scraping with Java". The following code gives me a nullPointerExcpetion. Part of the problem is that (line = in.readLine()) is always null, so the while loop at line 33 never runs. I do not know why it is always null however. Can anyone offer me insight into this? This code should print the first paragraph of the wikipedia article on CPython.

import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import java.net.*; import java.io.*; public class WikiScraper { public static void main(String[] args) { scrapeTopic("/wiki/CPython"); } public static void scrapeTopic(String url){ String html = getUrl("http://www.wikipedia.org/"+url); Document doc = Jsoup.parse(html); String contentText = doc.select("#mw-content-text > p").first().text(); System.out.println(contentText); } public static String getUrl(String url){ URL urlObj = null; try{ urlObj = new URL(url); } catch(MalformedURLException e){ System.out.println("The url was malformed!"); return ""; } URLConnection urlCon = null; BufferedReader in = null; String outputText = ""; try{ urlCon = urlObj.openConnection(); in = new BufferedReader(new InputStreamReader(urlCon.getInputStream())); String line = ""; while((line = in.readLine()) != null){ outputText += line; } in.close(); }catch(IOException e){ System.out.println("There was an error connecting to the URL"); return ""; } return outputText; } } 
2
  • Because this question was downvoted, is there a way I can make the question more clear? I am trying to figure out how I can make the in.readline() command return something other than null and do not know how to explain my question further other than that the result of the whole program should be the first paragraph of the wikipedia article on CPython. Commented Jan 19, 2018 at 1:53
  • 'InputStreamReader` does not return null. BufferedReader.readLine() does, and the reason is documented. Unclear what you're asking. Commented Jan 19, 2018 at 5:05

1 Answer 1

2

If you enter http://www.wikipedia.org//wiki/CPython in web browser, it will be redirected to https://en.wikipedia.org/wiki/CPython, so

use String html = getUrl("https://en.wikipedia.org/"+url);

instead String html = getUrl("http://www.wikipedia.org/"+url);

then line = in.readLine() can really read something.

Sign up to request clarification or add additional context in comments.

8 Comments

Thank-you, it appears that doc is assigned to some html code that makes sense for it, but I still receive a null pointer exception at the line String contentText = doc.select("#mw-content-text > p").first().text();
Can you tell me how the (doc.select("#mw-content-text > p") line is working exactly, because I think that this is the part that is returning null.
Or specifically just the "#mw-content-text > p" part
doc.select("#mw-content-text > p) means select the p elements which are children of element with CSS id mw-content-text. You can view the source code of en.wikipedia.org/wiki/CPython with web browser, since element mw-content-text has no children, doc.select("#mw-content-text > p) return NULL.
@Jacob "Web Scraping with Java", Publication Date: August 26, 2013, the web page has changed since publication date, so the sample may not work.
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.