0

My application is set up to support storing UTF-8 character encodings. I am reading files that I get from various other organizations which might be in UTF-8, latin-1, ASCII, etc. Do I need to do anything special to ensure that the files which have various character encodings are read into UTF-8 format correctly? e.g. do I need to figure out what character encoding the file is in and explicitly convert it to UTF-8?

Or is the following sufficient?

Reader reader = new InputStreamReader(new FileInputStream("c:/file.txt"), "UTF-8");

2 Answers 2

6

You have that wrong. You don't read into an encoding, you read from encoding. The encoding you provide as the second argument to InputStreamReader should be the expected encoding of the source stream(file).

Reader reader = new InputStreamReader(new FileInputStream("c:/file.txt"), "<encoding_of_file.txt>"); 

Once the data is in memory, it is always UTF-16. When you want to write the data (assuming you always want to write it as UTF-8), then you will use:

Writer writer = new OutputStreamWriter(new FileOutputStream("destfile"), "UTF-8"); 
Sign up to request clarification or add additional context in comments.

Comments

2

You need to tell the reader the encoding of the file.

If your input can be in many different encodings, then you might have a problem: You cannot reliably detect an encoding, see How can I detect the encoding/codepage of a text file

When you want to support different encodings, you basically have three options:

  • Store information about the encoding somewhere, such as <?xml version="1.0" encoding="UTF-8" ?> in XML files. Unfortunately, not all file formats – such as "plain text" files – have such meta data.
  • "Detect"/guess the encoding with various heuristics. This might sometimes go wrong.
  • Ask the user. This is terrible user experience, because most users have absolutely no clue what encodings even are.

2 Comments

Just so I understand it better, what would happen if you said the encoding was UTF-8, yet it was latin-1? Does all of latin-1 fit into UTF-8 and match up correctly? Or must you absolutely always set the file encoding of the source?
@BestPractices: Some encodings are compatible, most are not. If a file is ASCII, but you read it as UTF-8, nothing bad happens (UTF-8 was designed this way). But if a file is Latin-1/UTF-8, but you read it as UTF-8/Latin-1, then you get this: en.wikipedia.org/wiki/Mojibake

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.