3

I'm trying to read html code from a URL Connection. In one case the html file I'm trying to read includes 5 line breaks before the actual doc type declaration. In this case the input reader throws an exception for EOF.

URL pageUrl = new URL( "http://www.nytimes.com/2011/03/15/sports/basketball/15nbaround.html" ); URLConnection getConn = pageUrl.openConnection(); getConn.connect(); DataInputStream dis = new DataInputStream(getConn.getInputStream()); //some read method here 

Has anyone ran into a problem like this?

URL pageUrl = new URL("http://www.nytimes.com/2011/03/15/sports/basketball/15nbaround.html"); URLConnection getConn = pageUrl.openConnection(); getConn.connect(); DataInputStream dis = new DataInputStream(getConn.getInputStream()); String urlData = ""; while ((urlData = dis.readUTF()) != null) System.out.println(urlData); 

//exception thrown

java.io.EOFException at java.io.DataInputStream.readUnsignedShort(DataInputStream.java:323) at java.io.DataInputStream.readUTF(DataInputStream.java:572) at java.io.DataInputStream.readUTF(DataInputStream.java:547)

in the case of bufferedreader, it just responds null and doesn't continue

pageUrl = new URL("http://www.nytimes.com/2011/03/15/sports/basketball/15nbaround.html"); URLConnection getConn = pageUrl.openConnection(); getConn.connect(); BufferedReader br = new BufferedReader(new InputStreamReader(getConn.getInputStream())); String urlData = ""; while(true) urlData = br.readLine(); System.out.println(urlData); 

outputs null

2
  • 1
    Line breaks are not EOF. Perhaps post your reading code and the exception it is throwing? Commented Mar 20, 2011 at 22:25
  • I agree with the above comment from Brian R., without the stack trace it's hard to tell what the problem is. Also, I'm not sure why you would need to use a DataInputStream to read HTML. That is for reading Java primitive types (binary) mostly. If you want to read line-by-line, BufferedReader is a better (non-deprecated) choice. Commented Mar 20, 2011 at 22:33

3 Answers 3

1

You're using DataInputStream to read data that wasn't encoded using DataOutputStream. Examine the documented behavior for your call to DataInputStream#readUtf(); it first reads two bytes to form a 16-bit integer, indicating the number of bytes that follow comprising the UTF-encoded string. The data you're reading from the HTTP server is not encoded in this format.

Instead, the HTTP server is sending headers encoded in ASCII, per RFC 2616 sections 6.1 and 2.2. You need to read the headers as text, and then determine how the message body (the "entity") is encoded.

Sign up to request clarification or add additional context in comments.

Comments

1

This works fine:

package url; import java.io.BufferedReader; import java.io.IOException; import java.io.InputStreamReader; import java.io.Reader; import java.net.URL; /** * UrlReader * @author Michael * @since 3/20/11 */ public class UrlReader { public static void main(String[] args) { UrlReader urlReader = new UrlReader(); for (String url : args) { try { String contents = urlReader.readContents(url); System.out.printf("url: %s contents: %s\n", url, contents); } catch (Exception e) { e.printStackTrace(); } } } public String readContents(String address) throws IOException { StringBuilder contents = new StringBuilder(2048); BufferedReader br = null; try { URL url = new URL(address); br = new BufferedReader(new InputStreamReader(url.openStream())); String line = ""; while (line != null) { line = br.readLine(); contents.append(line); } } finally { close(br); } return contents.toString(); } private static void close(Reader br) { try { if (br != null) { br.close(); } } catch (Exception e) { e.printStackTrace(); } } } 

Comments

0

This:

public class Main { public static void main(String[] args) throws MalformedURLException, IOException { URL pageUrl = new URL("http://www.google.com"); URLConnection getConn = pageUrl.openConnection(); getConn.connect(); BufferedReader dis = new BufferedReader( new InputStreamReader( getConn.getInputStream())); String myString; while ((myString = dis.readLine()) != null) { System.out.println(myString); } } } 

Works perfectly. The URL you are supplying, however, returns nothing.

2 Comments

The supplied URL yields a 301 response ("Moved Permanently").
Ok thanks everyone. I didn't notice the 301, but now I fixed it

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.