26

So, I have an issue that really bothers me. I have a simple parser that I made in java. Here is the piece of relevant code:

while( (line = br.readLine())!=null) { String splitted[] = line.split(SPLITTER); int docNum = Integer.parseInt(splitted[0].trim()); //do something } 

Input file is CSV file, the first entry of the file being an integer. When I start parsing, I immidiately get this exception:

Exception in thread "main" java.lang.NumberFormatException: For input string: "1" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:580) at java.lang.Integer.parseInt(Integer.java:615) at dipl.parser.TableParser.parse(TableParser.java:50) at dipl.parser.DocumentParser.main(DocumentParser.java:87) 

I checked the file, it indeed has 1 as its first value (no other characters are in that field), but I still get the message. I think that it may be because of file encoding: it is UTF-8, with Unix endlines. And the program is run on Ubuntu 14.04. Any suggestions where to look for the problem are welcome.

1
  • 9
    Nice one using copy and paste to put the error in the question! Commented Sep 26, 2016 at 11:11

1 Answer 1

38

You have a BOM in front of that number; if I copy what looks like "1" in your question and paste it into vim, I see that you have a FE FF (e.g., a BOM) in front of it. From that link:

The exact bytes comprising the BOM will be whatever the Unicode character U+FEFF is converted into by that transformation format.

So that's the issue, consume the file with the appropriate reader for the transformation (UTF-8, UTF-16 big-endian, UTF-16 little-endian, etc.) the file is encoded with. See also this question and its answers for more about reading Unicode files in Java.

Sign up to request clarification or add additional context in comments.

4 Comments

@Doval: Thank you, I was absolutely wrong to say it was a UTF-8 BOM, and you're quite right that on-the-wire, the BOM for UTF-8 is EF BB BF. But what we're looking at is the end result of reading the file and then seeing the output in the error message. The file might be in any transformation; all BOMs end up being FE FF once read.
But if it was read raw, then...oh, I don't know. :-) Could well have been UTF-16. :-) It'll all depend on how the file was read into the stream.
"all BOMs end up being FE FF once read" - Not quite. All BOMs end up being U+FEFF (which is not the same as 0xFE 0xFF since it's a code point rather than a sequence of bytes) once decoded. Before decoding, all you have is bytes, which may be in any encoding that can represent Unicode characters (mostly UTF-8 and UTF-16 but others exist).
@Kevin: Yes, that's what I meant.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.