Unicode Replacement with ASCII

Question

I have created a text file on windows system where I think default encoding style is ANSI and contents of the file looks like this :

This is\u2019 a sample text file \u2014and it can ....

I saved this file using the default encoding style of windows though there were encoding styles were also available like UTF-8,UTF-16 etc.

Now I want to write a simple java function where I will pass some input string and replace all of the unicodes with the corresponding ascii value.

e.g :- \u2019 should be replaced with "'" \u2014 should be replaced with "-" and so on.

Observation : When i created a string literal like this

 String s = "This is\u2019 a sample text file \u2014and it can ....";

My code is working fine , but when I am reading it from the file it is not working. I am aware that in Java String uses UTF-16 encoding .

Below is the code that I am using to read the input file.

FileReader fileReader = new FileReader(new File("C:\\input.txt")); BufferedReader bufferedReader = new BufferedReader(fileReader) String record = bufferedReader.readLine();

I also tried using the InputStream and setting the Charset to UTF-8 , but still the same result.

Replacement code :

public static String removeUTFCharacters(String data){ for(Entry<String,String> entry : utfChars.entrySet()){ data=data.replaceAll(entry.getKey(), entry.getValue()); } return data; }

Map :

 utfChars.put("\u2019","'"); utfChars.put("\u2018","'"); utfChars.put("\u201c","\""); utfChars.put("\u201d","\""); utfChars.put("\u2013","-"); utfChars.put("\u2014","-"); utfChars.put("\u2212","-"); utfChars.put("\u2022","*");

Can anybody help me in understanding the concept and solution to this problem.

Just to be clear, are you saying there are six characters in your file that are literally '\', 'u', '2', '0', '1', '9'? — ajb
– ajb, Commented Jun 13, 2014 at 23:31
In real world I am going to receive this file from some external systems and they told me you will receive these unicodes like "\u2019" in the input text file. For unit test purpose i tried to create the same type of file that I am going to receive. — saurav
– saurav, Commented Jun 13, 2014 at 23:33
Could you show us what 16-bit characters actually show up in the String after you read it? Something like for (i=0; i<record.length(), i++) System.out.printf("%04X ",(int)record.charAt(i)); — ajb
– ajb, Commented Jun 13, 2014 at 23:46
it displayed like this - 0054 0068 0069 0073 0020 0069 0073 005C 005C 0075 0032 0030 0031 0039 0020 0061 0020 0073 0061 006D 0070 006C 0065 0020 0074 0065 0078 0074 0020 0066 0069 006C 0065 0020 005C 005C 0075 0032 0030 0031 0034 0061 006E 0064 0020 0069 0074 0020 0063 0061 006E 0020 002E 002E 002E — saurav
– saurav, Commented Jun 13, 2014 at 23:57

erickson · Accepted Answer · 2015-10-01 16:21:05Z

Match the escape sequence \uXXXX with a regular expression. Then use a replacement loop to replace each occurrence of that escape sequence with the decoded value of the character.

Because Java string literals use \ to introduce escapes, the sequence \\ is used to represent \. Also, the Java regex syntax treats the sequence \u specially (to represent a Unicode escape). So the \ has to be escaped again, with an additonal \\. So, in the pattern, "\\\\u" really means, "match \u in the input."

To match the numeric portion, four hexadecimal characters, use the pattern \p{XDigit}, escaping the \ with an extra \. We want to easily extract the hex number as a group, so it is enclosed in parentheses to create a capturing group. Thus, "(\\p{XDigit}{4})" in the pattern means, "match 4 hexadecimal characters in the input, and capture them."

In a loop, we search for occurrences of the pattern, replacing each occurrence with the decoded character value. The character value is decoded by parsing the hexadecimal number. Integer.parseInt(m.group(1), 16) means, "parse the group captured in the previous match as a base-16 number." Then a replacement string is created with that character. The replacement string must be escaped, or quoted, in case it is $, which has special meaning in replacement text.

String data = "This is\\u2019 a sample text file \\u2014and it can ..."; Pattern p = Pattern.compile("\\\\u(\\p{XDigit}{4})"); Matcher m = p.matcher(data); StringBuffer buf = new StringBuffer(data.length()); while (m.find()) { String ch = String.valueOf((char) Integer.parseInt(m.group(1), 16)); m.appendReplacement(buf, Matcher.quoteReplacement(ch)); } m.appendTail(buf); System.out.println(buf);

Thanks , it worked. It would be of great help if you can explain what actually is going in the background .
@Saurav Note that I made a small change to fix a bug when the sequence \u0024 ($) is found in the input. I will comment the example to explain what's happening.
And one more thing, what will be the impact if i change the encoding style of my file from Default to UTF-8 ot UTF-16 while saving it.
@Saurav I also wasn't handling hexadecimal properly, so pick up that change too. When you create the Reader, you should use an InputStreamReader and specify the encoding to be whatever you used to save the file. Right now, you are using the system default encoding to read the file, so if you wrote it using a different encoding, it could break. However, I'm guessing that the whole point of using the escape sequences in the input file is so that they can be encoding in US-ASCII; that is, they should never contain "special" characters, right?
Yeah makes sense to me. Thank you for very much for your crisp explanation. Your example motivated me to learn regex.

Stanislas Klukowski · Accepted Answer · 2017-11-15 15:37:51Z

If you can use another library, you can use apache commons https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html

String dirtyString = "Colocaci\u00F3n"; String cleanString = StringEscapeUtils.unescapeJava(dirtyString); //cleanString = "Colocación"

Collectives™ on Stack Overflow

Unicode Replacement with ASCII

2 Answers 2

6 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

Comments

Related