25

I am reading some text files in a Java program and would like to replace some Unicode characters with ASCII approximations. These files will eventually be broken into sentences that are fed to OpenNLP. OpenNLP does not recognize Unicode characters and gives improper results on a number of symbols (it tokenizes "girl's" as "girl" and "'s" but if it is a Unicode quote it is treated as a single token)..

For example, the source sentence may contain the Unicode directional quotation U2018 (‘) and I would like to convert that to U0027 ('). Eventually I will be stripping the remaining Unicode.

I understand that I am losing information, and I know that I could write regular expressions to convert each of these symbols, but I am asking if there is code I can reuse to convert some of these symbols.

This is what I could, but I'm sure I will make mistakes/miss things/etc.:

 // double quotation (") replacements.add(new Replacement(Pattern.compile("[\u201c\u201d\u201e\u201f\u275d\u275e]"), "\"")); // single quotation (') replacements.add(new Replacement(Pattern.compile("[\u2018\u2019\u201a\u201b\u275b\u275c]"), "'")); 

replacements is a custom class that I later run over and apply the replacements.

 for (Replacement replacement : replacements) { text = replacement.pattern.matcher(text).replaceAll(r.replacement); } 

As you can see, I had to find:

  • LEFT SINGLE QUOTATION MARK
  • RIGHT SINGLE QUOTATION MARK
  • SINGLE LOW-9 QUOTATION MARK (what is this/should I replace this?)
  • SINGLE HIGH-REVERSED-9 QUOTATION MARK (what is this/should I replace this?)
3
  • Are you looking for a library and/or example code in a particular language? Or are you looking for a pre-existing mapping of Unicode characters onto ASCII approximations? I'm not sure what the difference is between a regex and code you can reuse. Commented Jan 26, 2011 at 19:32
  • I am looking for a Java library. I can write regular expressions, but I'm sure I will miss something in the process. I was wondering if someone else has already made decisions for me. Have you been reading GEB, Mu Mind? Commented Jan 26, 2011 at 19:49
  • those unicode links are dead Commented Apr 11, 2014 at 15:16

7 Answers 7

16

I found a pretty extensive table that maps Unicode punctuation to their closest ASCII equivalents.

Here's more info: Map Symbols & Punctuation to ASCII.

Sign up to request clarification or add additional context in comments.

2 Comments

I translated that list to Scala and put it here: gist.github.com/dirkgr/6349f379740880209475
@schmmd has a more comprehensive version below.
9

I followed @marek-stoj's link and created a Scala application that cleans unicode out of strings while maintaining the string length. It remove diacritics (accents) and uses the map suggested by @marek-stoj to convert non-Ascii unicode characters to their ascii approximations.

import java.text.Normalizer object Asciifier { def apply(string: String) = { var cleaned = string for ((unicode, ascii) <- substitutions) { cleaned = cleaned.replaceAll(unicode, ascii) } // convert diacritics to a two-character form (NFD) // http://docs.oracle.com/javase/tutorial/i18n/text/normalizerapi.html cleaned = Normalizer.normalize(cleaned, Normalizer.Form.NFD) // remove all characters that combine with the previous character // to form a diacritic. Also remove control characters. // http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html cleaned.replaceAll("[\\p{InCombiningDiacriticalMarks}\\p{Cntrl}]", "") // size must not change require(cleaned.size == string.size) cleaned } val substitutions = Set( (0x00AB, '"'), (0x00AD, '-'), (0x00B4, '\''), (0x00BB, '"'), (0x00F7, '/'), (0x01C0, '|'), (0x01C3, '!'), (0x02B9, '\''), (0x02BA, '"'), (0x02BC, '\''), (0x02C4, '^'), (0x02C6, '^'), (0x02C8, '\''), (0x02CB, '`'), (0x02CD, '_'), (0x02DC, '~'), (0x0300, '`'), (0x0301, '\''), (0x0302, '^'), (0x0303, '~'), (0x030B, '"'), (0x030E, '"'), (0x0331, '_'), (0x0332, '_'), (0x0338, '/'), (0x0589, ':'), (0x05C0, '|'), (0x05C3, ':'), (0x066A, '%'), (0x066D, '*'), (0x200B, ' '), (0x2010, '-'), (0x2011, '-'), (0x2012, '-'), (0x2013, '-'), (0x2014, '-'), (0x2015, '-'), (0x2016, '|'), (0x2017, '_'), (0x2018, '\''), (0x2019, '\''), (0x201A, ','), (0x201B, '\''), (0x201C, '"'), (0x201D, '"'), (0x201E, '"'), (0x201F, '"'), (0x2032, '\''), (0x2033, '"'), (0x2034, '\''), (0x2035, '`'), (0x2036, '"'), (0x2037, '\''), (0x2038, '^'), (0x2039, '<'), (0x203A, '>'), (0x203D, '?'), (0x2044, '/'), (0x204E, '*'), (0x2052, '%'), (0x2053, '~'), (0x2060, ' '), (0x20E5, '\\'), (0x2212, '-'), (0x2215, '/'), (0x2216, '\\'), (0x2217, '*'), (0x2223, '|'), (0x2236, ':'), (0x223C, '~'), (0x2264, '<'), (0x2265, '>'), (0x2266, '<'), (0x2267, '>'), (0x2303, '^'), (0x2329, '<'), (0x232A, '>'), (0x266F, '#'), (0x2731, '*'), (0x2758, '|'), (0x2762, '!'), (0x27E6, '['), (0x27E8, '<'), (0x27E9, '>'), (0x2983, '{'), (0x2984, '}'), (0x3003, '"'), (0x3008, '<'), (0x3009, '>'), (0x301B, ']'), (0x301C, '~'), (0x301D, '"'), (0x301E, '"'), (0xFEFF, ' ')).map { case (unicode, ascii) => (unicode.toChar.toString, ascii.toString) } } 

1 Comment

You have a bug: replaceAll doesn't mutate string. You need to assign result of replaceAll back to cleaned.
7

Each unicode character is assigned a category. There exists two separate categories for quotes:

With these lists, you should be able to handle all quotes appropriately, if you would like to code the regex manually.

Java Character.getType gives you the category of character, for example FINAL_QUOTE_PUNCTUATION.

Now you can get the category of each (punctuation-)character and replace it with an appropriate supplement in ASCII.

You can use the other punctuation categories accordingly. In 'Punctuation, Other' there are some characters, for example PRIME , which you may also want to substitute with an apostrophe.

2 Comments

I'm resorting to just using a custom map, with as many characters as I can define, because the Unicode categories assigned to basic characters seem inadequate. For example, the basic single and double quote characters (the ones you type into notepad using your keyboard for example) are categorized as "Punctuation Other", rather than the Punctuation Initial and Punctuation Final categories that you'd expect them to be categorized under.
@Triynko - the problem there is: there is only one "normal" (ASCII) single quote and one double quote, so marking it as either INITIAL or FINAL quote punctuation would also be wrong.
3

While this does not exactly answers your question, you can convert your Unicode text to US-ASCII replacing non-ASCII characters with '?' symbols.

String input = "aáeéiíoóuú"; // 10 chars. Charset ch = Charset.forName("US-ASCII"); CharsetEncoder enc = ch.newEncoder(); enc.onUnmappableCharacter(CodingErrorAction.REPLACE); enc.replaceWith(new byte[]{'?'}); ByteBuffer out = null; try { out = enc.encode(CharBuffer.wrap(input)); } catch (CharacterCodingException e) { /* ignored, shouldn't happen */ } String outStr = ch.decode(out).toString(); // Prints "a?e?i?o?u?" System.out.println(outStr); 

2 Comments

I remove diacritics with Normalizer.normalize(text, Normalizer.Form.NFD) followed by a replace with Pattern.compile("\\p{InCombiningDiacriticalMarks}+").
With this solution, basic punctuation marks like quotes that ought to be mapped are not mapped to the ASCII quote. Many other Unicode characters that you would say "this is basically the same thing as this ASCII character" will not get mapped properly. Therefore, I think that using a custom map with all reasonable replacements would achieve better results.
3

Here's a Python package that does a good job. It's based on a Perl module Text::Unidecode. I assume this could be ported to Java.

http://www.tablix.org/~avian/blog/archives/2009/01/unicode_transliteration_in_python/

http://pypi.python.org/pypi/Unidecode

Comments

2

What I've done for similar substitutions is create a Map (usually HashMap) with the Unicode characters as the keys and their substitute as the values.

Pseudo-Java; the for depends on what sort of character container you're using as a parameter to the method that does this, e.g. String, CharSequence, etc.

StringBuilder output = new StringBuilder(); for (each Character 'c' in inputString) { Character replacement = xlateMap.get( c ); output.append( replacement != null ? replacement : c ); } return output.toString(); 

Anything in the Map is replaced, anything not in the Map is unchanged and copied to output.

Comments

2
String lstring = "my string containing all different simbols"; lstring = lstring.replaceAll("\u2013", "-") .replaceAll("\u2014", "-") .replaceAll("\u2015", "-") .replaceAll("\u2017", "_") .replaceAll("\u2018", "\'") .replaceAll("\u2019", "\'") .replaceAll("\u201a", ",") .replaceAll("\u201b", "\'") .replaceAll("\u201c", "\"") .replaceAll("\u201d", "\"") .replaceAll("\u201e", "\"") .replaceAll("\u2026", "...") .replaceAll("\u2032", "\'") .replaceAll("\u2033", "\""); 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.