10

This is my method

public void readFile3()throws IOException { try { FileReader fr = new FileReader(Path3); BufferedReader br = new BufferedReader(fr); String s = br.readLine(); int a =1; while( a != 2) { s = br.readLine(); a ++; } Storage.add(s); br.close(); } catch(IOException e) { System.out.println(e.getMessage()); } } 

For some reason I am unable to read the file which only contains this " Name Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz "

When i debug the code the String s is being returned as "\ufffd\ufffdN a m e" and i have no clue as to where those extra characters are coming from.. This is preventing me from properly reading the file.

7
  • fileformat.info/info/unicode/char/0fffd/index.htm Commented Jun 30, 2014 at 15:07
  • fileformat.info/info/unicode/char/0fffd/index.htm i.e, you have a non-unicode character. @Ocelot20 beat me to it! Commented Jun 30, 2014 at 15:08
  • But what is causing that? there is only that line of text in the file? Commented Jun 30, 2014 at 15:09
  • Encode your text file properly (in UTF-8, for example) and specify the encoding. Commented Jun 30, 2014 at 15:13
  • Its saved as .txt i'm using sublime editor.. could that be why?.. furthermore i have two other files which are saved in the same way using sublime but this issue doesnt appear on those files Commented Jun 30, 2014 at 15:18

3 Answers 3

15

\ufffd is the replacement character in unicode, it is used when you try to read a code that has no representation in unicode. I suppose you are on a Windows platform (or at least the file you read was created on Windows). Windows supports many formats for text files, the most common is Ansi : each character is represented but its ansi code.

But Windows can directly use UTF16, where each character is represented by its unicode code as a 16bits integer so with 2 bytes per character. Those files uses special markers (Byte Order Mark in Windows dialect) to say :

  • that the file is encoded with 2 (or even 4) bytes per character
  • the encoding is little or big endian

(Reference : Using Byte Order Marks on MSDN)

As you write after the first two replacement characters N a m e and not Name, I suppose you have an UTF16 encoded text file. Notepad can transparently edit those files (without even saying you the actual format) but other tools do have problems with those ... The excellent vim can read files with different encodings and convert between them.

If you want to use directly this kind of file in java, you have to use the UTF-16 charset. From JaveSE 7 javadoc on Charset : UTF-16 Sixteen-bit UCS Transformation Format, byte order identified by an optional byte-order mark

Sign up to request clarification or add additional context in comments.

4 Comments

“Ansi” is not a character encoding; you're probably thinking of “ASCII”, more formally known as ISO/IEC-646:1991, or "Latin-1", more formally known as ISO/IEC-8859-1
Very little of this answer is actually specific to Windows. A broader explanation would be an improvement.
ANSI is the name used in Microsoft docs for the 8bit encoding used in the WINDOWS (graphics) subsystem. They define: OEM for 8bit encoding in console applications, UNICODE for UTF16 and ANSI for 8 bit encoding in window applications. ANSI is commonly cp1252 (a variation of ISO-8859-1) for Western European languages, and can be cp1250 (a variation of ISO-8859-2) for Eastern European ones. This is really specific to the Windows system.
OK, so that explains the weird naming of the standard; impersonating an official standards-setting body is pretty obnoxious, but not uncommon. Saying CP1251/2/3… rather than ANSI would make it clear that they're not asserting that this is a publication of the American National Standards Institute. The only context given in the question is the JVM, not Windows, and although UTF-16/UCS-2 text files are unusual on other systems, they can be encountered there. The word “suppose” doesn't adequately indicate that the entire answer is predicated on this assumption.
4

You must specify the encoding when reading the file, in your case probably is UTF-16.

Reader reader = new InputStreamReader(new FileInputStream(fileName), "UTF-16"); BufferedReader br = new BufferedReader(reader); 

Check the documentation for more details: InputStreamReader class.

Comments

0

Check to see if the file is .odt, .rtf, or something other than .txt. This may be what's causing the extra UTF-16 characters to appear. Also, make sure that (even if it is a .txt file) your file is encoded in UTF-8 characters.

Perhaps you have UTF-16 characters such as '®' in your document.

3 Comments

I just tried to open the file in notepad and it said that its unicode.. i checked my other files and they were ANSI... if this is the case then i have over 2000 folders with these type of files which are saved in that format... is there a way to actually convert those to utf-8 or ansci ?
If notepad said “ANSI” then it's broken, because ANSI isn't a character encoding. Perhaps you mean “ASCII”?
Or as indicated in response to the other answer, “it's just normal for Microsoft, as they're impersonating the American National Standards Institute”.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.