Java UTF-8 encoding produces incorrect output

Question

In Java, I've been trying to write a String to a file using UTF-8 encoding which will later be read by another program written in a different programming language. While doing so I noticed that the bytes created when encoding a String into a byte array didn't seem to have the correct byte values.

I narrowed down the problem to the symbol "£" which seems to produce incorrect bytes when encoded to UTF-8

byte[] byteArray = "£".getBytes(Charset.forName("UTF-8")); // Print out the Byte Array of the UTF-8 converted string // Upcast byte values to print the bytes as unsigned for (byte signedByte : byteArray) { System.out.print((signedByte & 0xFF) + " "); }

This outputs 6 bytes with the decimal values: 239 190 130 239 189 163, in hex this is: ef be 82 ef bd a3

http://www.utf8-chartable.de/ however says that the values for "£" in hex is: c2 a3, the output should then be: 194 163

Other strings seem to produce correct bytes when encoded as UTF-8, so I'm wondering why Java is producing these 6 bytes for "£", and how I should go about properly converting by Strings to byte arrays using UTF-8 encoding

I have also tried

OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(outputFile), "UTF-8"); out.write("£"); out.close();

but this produced the same 6 bytes

I copied and pasted your code, and it produces 194 163, as expected. I doubt you're actually encoding a single £ character to get 6 bytes. Which platform and JDK are you using? Are you sure the code executed is the code you posted? — JB Nizet
– JB Nizet, Commented Mar 1, 2014 at 20:58
My full code is pastebin.com/uV3m6Pri I'm using jdk1.7.0_51 and I'm running windows 7 64bit — user3369258
– user3369258, Commented Mar 1, 2014 at 21:02
@user3369258: But what encoding is your source file in, and how are you compiling? Try replacing each occurrence of £ with \u00a3 in your code, and I'm sure you'll find it works. — Jon Skeet
– Jon Skeet, Commented Mar 1, 2014 at 21:04

Jon Skeet · Accepted Answer · 2014-03-01 21:00:21Z

I suspect the problem is that you're using a string literal in your Java code using an editor which writes it out in one encoding - but then you're compiling without specifying the same encoding. In other words, I suspect that your "£" string is not actually a single pound sign at all.

This should be easy to validate. For example:

char[] chars = "£".toCharArray(); for (char c : chars) { System.out.println((int) c); }

To take this out of the equation, you can specify the string using a pure-ASCII representation using a Unicode escape sequence:

String pound = "\u00a3"; // Now encode as before

I'm sure you'll then get the right bytes. For example:

import java.nio.charset.Charset; class Test { public static void main(String[] args) throws Exception { String pound = "\u00a3"; byte[] bytes = pound.getBytes(Charset.forName("UTF-8")); for (byte b : bytes) { System.out.println(b & 0xff); // 194, 163 } } }

Thank you, writing it out using the Unicode escape sequence worked! Running your first code block outputted: 65410 65379
Great answer! Didn't think about the mismatch between the file and compiler encodings.
To add to this solution, I believe what was causing the problem described by Jon Skeet was because my System Locale was actually in Japanese, this set my default file encoding to "MS932" and defaultCharset to "windows-31j" instead of "UTF-8". By changing the environmental variable JAVA_TOOL_OPTIONS value to -Dfile.encoding=UTF8, I managed to have the JVM start with the default encodings set to UTF-8 rather than the system defaults, and the program worked without using the Unicode escape sequence! Check out stackoverflow.com/questions/361975/…

Collectives™ on Stack Overflow

Java UTF-8 encoding produces incorrect output

1 Answer 1

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Linked

Related