Unicode to ASCII conversion/mapping

Question

I need some sort of conversion/mapping that, for example, is done by CLCL clipboard manager.

What it does is like that:

I copy the following Unicode text: ūī
And CLCL converts it to: ui

Is there any technique to do such a conversion? Or maybe there are mapping tables that can be used to convert, let's say, symbol ū is mapped to u.

UPDATE

Thanks to all for help. Here is what I came with (a hybrid of two solutions), one posted by Erik Schierboom and one taken from http://blogs.infosupport.com/normalizing-unicode-strings-in-c/#comment-8984

public static string ConvertUnicodeToAscii(string unicodeStr, bool skipNonConvertibleChars = false) { if (string.IsNullOrWhiteSpace(unicodeStr)) { return unicodeStr; } var normalizedStr = unicodeStr.Normalize(NormalizationForm.FormD); if (skipNonConvertibleChars) { return new string(normalizedStr.ToCharArray().Where(c => (int) c <= 127).ToArray()); } return new string( normalizedStr.Where( c => { UnicodeCategory category = CharUnicodeInfo.GetUnicodeCategory(c); return category != UnicodeCategory.NonSpacingMark; }).ToArray()); }

What, several questions that say that this is impossible? Which are those questions? They are wrong and need proper answers. There are also several questions which show how this works. — Konrad Rudolph
– Konrad Rudolph, Commented Mar 28, 2013 at 13:56

Erik Schierboom · Accepted Answer · 2013-03-28 14:51:52Z

3

I have used the following code for some time:

private static string NormalizeDiacriticalCharacters(string value) { if (value == null) { throw new ArgumentNullException("value"); } var normalised = value.Normalize(NormalizationForm.FormD).ToCharArray(); return new string(normalised.Where(c => (int)c <= 127).ToArray()); }

edited Mar 28, 2013 at 14:51

answered Mar 28, 2013 at 13:59

Erik Schierboom

16.7k10 gold badges67 silver badges82 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Konrad Rudolph Over a year ago

I dislike the c <= 127 hack, it’s unnecessary. But yes, that’s the gist of it.

Erik Schierboom Over a year ago

Well, otherwise you could have returned a string that contains characters that fall outside the ASCII range, right?

Konrad Rudolph Over a year ago

Look at the question I marked this one as a duplicate of. The “right” way is to look at the Unicode category and only retain non-spacing / non-combining diacritic characters. But to be honest that’s probably way less efficient and in my (admittedly limited) understanding of Unicode, your answer always yields the correct result.

Erik Schierboom Over a year ago

Sorry, I missed the duplicate question part. You are right of course.

net_prog Over a year ago

It works, but one note, the characters which cannot be mapped, are ignored. For example, "Łukasz" becomes "ukasz". The method used in the "duplicate of" question leaves such characters in output. So, probably, it is a good idea to combine the two methods and put a bool parameter whether to leave or skip.

|

SecurityMatt · Accepted Answer · 2013-03-30 13:30:42Z

0

In general, it is not possible to convert Unicode to ASCII because ASCII is a subset of Unicode.

That being said, it is possible to convert characters within the ASCII subset of Unicode to Unicode.

In C#, generally there's no need to do the conversion, since all strings are Unicode by default anyway, and all components are Unicode-aware, but if you must do the conversion, use the following:

 string myString = "SomeString"; byte[] asciiString = System.Text.Encoding.ASCII.GetBytes(myString);

edited Mar 30, 2013 at 13:30

answered Mar 28, 2013 at 13:57

SecurityMatt

6,8131 gold badge25 silver badges28 bronze badges

4 Comments

Konrad Rudolph Over a year ago

This is not what OP meant.

SecurityMatt Over a year ago

@DavinTryon: Can you think of any ASCII characters that aren't contained in, say, UTF-8? I can think of many characters in UTF-8 that aren't in ASCII. For example the character 字 cannot be represented in US-ASCII.

Davin Tryon Over a year ago

Yes, but saying that it is a subset is not correct. UTF-8 (only one of the unicode formats) was explicitly created to be "backwards compatible" with ASCII.

SecurityMatt Over a year ago

@DavinTryon: What definition of subset are you using? Every codepoint in ASCII is contained in Unicode. ASCII is therefore completely contained within Unicode, or in other words, ASCII is a subset of Unicode. That's not to say Unicode predates ASCII, merely that it contains every element in ASCII (after all, that's what subset means).

Collectives™ on Stack Overflow

Unicode to ASCII conversion/mapping

2 Answers 2

7 Comments

4 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

4 Comments

Linked

Related