Reading unicode lines from files conversion UTF-8

Question

I'm reading a file which contains unicode escape sequence among the text, here the example:

\u201c@hannah_hartzler: In line for the gate keeper! @nerk97 @ShannonWalkup\u201d\ud83d\ude0d\ud83d\ude0d\ud83d\ude0d\ud83d\ude0d\u2764\u2764\u2764

When I'm reading it with a BufferedReader and write it back to another file with FileWriter the text become like this :

â€œ@hannah_hartzler: In line for the gate keeper! @nerk97 @ShannonWalkupâ€ðŸ˜ðŸ˜ðŸ˜ðŸ˜â¤â¤â¤

due, to the UTF-8 encoding, but what I want to have is:

“@hannah_hartzler: In line for the gate keeper! @nerk97 @ShannonWalkup”😍😍😍😍❤❤❤

My question is, how to read and write correctly lines of text, in order to have printed the rights characters ?

I'dont' modify the lines of text, it's just a problem of conversion between unicode and utf-8 here's my code:

FileReader fileReader = new FileReader("tweets.json"); BufferedReader bufferedReader = new BufferedReader(fileReader); File tmp = new File("out.txt"); FileWriter fileWriter = new FileWriter(tmp); BufferedWriter bw = new BufferedWriter(fileWriter); ... String line = bufferedReader.readLine(); bw.write(line);

Either your platform default encoding is UTF8, and your code just creates an exact copy of the original file, or it's not, and you're creating the a copy of the file, but with a different encoding. The new file still contains the unicode escape sequences. You need some code to extract the unicode escape sequences and transform them to actual characters. Does the file really contain escape sequences, or is your text editor transforming special characters to escape sequences? — JB Nizet
– JB Nizet, Commented Oct 26, 2015 at 17:07

Community · Accepted Answer · 2017-05-23 12:14:42Z

When you open a file via new FileReader("tweets.json");, its contents gets interpreted using the system’s default encoding. When you open the target file via new BufferedWriter(fileWriter), the characters get encoded via the system’s default encoding again. This might look like the file gets copied as-is, but unfortunately, things are not so simple.

When the file’s actual character encoding does not match the system’s default encoding, this misinterpretation might case certain bytes to get classified as invalid, which will cause unspecified behavior, either these “characters” might get filtered out or replaced by a replacement character, which might cause garbage or even invalid characters according to the real encoding in the target file.

As Andreas correctly pointed out, the first character “ has been copied without damage, but is incorrectly displayed, because, whatever tool you used to open the file, misinterpreted the contents again as Windows-1252. However, some of the other characters seem to be irreversible damaged (but this could also be a result of copying them to this website)…

You may either use the constructors
new InputStreamReader(new FileInputStream("tweets.json"), StandardCharsets.UTF_8) and
new OutputStreamWriter(new FileOutputStream(tmp), StandardCharsets.UTF_8) to interpret an UTF-8 file correctly or, better, just copy the file without interpreting its contents:

Files.copy(Paths.get("tweets.json"), Paths.get("out.txt"));

or, if you really want to do the copying loop manually

try(FileChannel in =FileChannel.open(Paths.get("tweets.json"),READ); FileChannel out=FileChannel.open(Paths.get("out.txt"),WRITE,CREATE,TRUNCATE_EXISTING)){ long size=in.size(), trans=out.transferFrom(in, 0, size); for(long p=trans; p<size && trans>0; p+=trans) trans=out.transferFrom(in, p, size-p); }

^{(assuming you do a import static java.nio.file.StandardOpenOption.*;)}

If you copy the files this way, you ensure that no damage occurs. Then you may focus on using an editor which reads them using the right encoding when opening the copy.

Andreas · Accepted Answer · 2015-10-26 17:16:01Z

The Unicode character “ (\u201c) is encoded to UTF-8 as:

\xE2\x80\x9C

Which in Windows-1252 looks like:

â€œ

So your problem is not that the Java code isn't generating UTF-8, because it is, but that whatever tool you use to view the file content is reading it in Windows-1252.

If you use a program like NotePad++, you can change the encoding used, by selecting the appropriate option on the Encoding pull-down menu.

FYI: Windows-1252 / ISO 8859-1 doesn't support smileys, so you can't use that.

Collectives™ on Stack Overflow

Reading unicode lines from files conversion UTF-8

2 Answers 2

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Related