InputStreamReader on a URL connection returning null

Question

I am following a tutorial on web scraping from the book "Web Scraping with Java". The following code gives me a nullPointerExcpetion. Part of the problem is that (line = in.readLine()) is always null, so the while loop at line 33 never runs. I do not know why it is always null however. Can anyone offer me insight into this? This code should print the first paragraph of the wikipedia article on CPython.

import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import java.net.*; import java.io.*; public class WikiScraper { public static void main(String[] args) { scrapeTopic("/wiki/CPython"); } public static void scrapeTopic(String url){ String html = getUrl("http://www.wikipedia.org/"+url); Document doc = Jsoup.parse(html); String contentText = doc.select("#mw-content-text > p").first().text(); System.out.println(contentText); } public static String getUrl(String url){ URL urlObj = null; try{ urlObj = new URL(url); } catch(MalformedURLException e){ System.out.println("The url was malformed!"); return ""; } URLConnection urlCon = null; BufferedReader in = null; String outputText = ""; try{ urlCon = urlObj.openConnection(); in = new BufferedReader(new InputStreamReader(urlCon.getInputStream())); String line = ""; while((line = in.readLine()) != null){ outputText += line; } in.close(); }catch(IOException e){ System.out.println("There was an error connecting to the URL"); return ""; } return outputText; } }

Because this question was downvoted, is there a way I can make the question more clear? I am trying to figure out how I can make the in.readline() command return something other than null and do not know how to explain my question further other than that the result of the whole program should be the first paragraph of the wikipedia article on CPython. — Sam
– Sam, Commented Jan 19, 2018 at 1:53
'InputStreamReader` does not return null. BufferedReader.readLine() does, and the reason is documented. Unclear what you're asking. — user207421
– user207421, Commented Jan 19, 2018 at 5:05

xingbin · Accepted Answer · 2018-01-19 03:34:35Z

2

If you enter http://www.wikipedia.org//wiki/CPython in web browser, it will be redirected to https://en.wikipedia.org/wiki/CPython, so

use String html = getUrl("https://en.wikipedia.org/"+url);

instead String html = getUrl("http://www.wikipedia.org/"+url);

then line = in.readLine() can really read something.

answered Jan 19, 2018 at 3:34

xingbin

28.4k12 gold badges62 silver badges111 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Sam Over a year ago

Thank-you, it appears that doc is assigned to some html code that makes sense for it, but I still receive a null pointer exception at the line String contentText = doc.select("#mw-content-text > p").first().text();

Sam Over a year ago

Can you tell me how the (doc.select("#mw-content-text > p") line is working exactly, because I think that this is the part that is returning null.

Sam Over a year ago

Or specifically just the "#mw-content-text > p" part

xingbin Over a year ago

doc.select("#mw-content-text > p) means select the p elements which are children of element with CSS id mw-content-text. You can view the source code of en.wikipedia.org/wiki/CPython with web browser, since element mw-content-text has no children, doc.select("#mw-content-text > p) return NULL.

xingbin Over a year ago

@Jacob "Web Scraping with Java", Publication Date: August 26, 2013, the web page has changed since publication date, so the sample may not work.

|

Collectives™ on Stack Overflow

InputStreamReader on a URL connection returning null

1 Answer 1

8 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Related