c++ Unicode string comparison with confusable character. example ( U 0054) should be == (U03A4) etc

Question

I have Unicode string and I want to compare with the following requirements.

Confusable s [1] character should be consider the same character, example: T (LATIN CAPITAL LETTER T U 0054) should be == T (GREEK CAPITAL LETTER TAU U03A4) etc

(* [1] example http://unicode.org/cldr/utility/confusables.jsp?a=TESTt&r=None*)

http://www.unicode.org/Public/security/revision-03/confusablesSummary.txt

I will use the above file in order to make the code, but if there are already any free libraries I would prefer to use it.

I am thinking that the code would create a temporary ustring in which every confusable character would be replaced with the corresponding latin character.

In the real program I will be testing 10x5000x10000 strings containing one word each.

Test program:

 std::locale::global(std::locale("")); std::cout.imbue(std::locale()); Glib::ustring s1,s2; s1="TEST"; s2="TΕST"; s1.normalize(Glib::NORMALIZE_NFKD ); s2.normalize(Glib::NORMALIZE_NFKD ); std::cout<<"1->true, 0->false (s1==s2) => "<<(s1==s2)<<"\n";

Test program output:

1->true, 0->false (s1==s2) => 0

Ubuntu locale command Output:

Ubuntu 12.04 64 bit>$ locale LANG=en_US.UTF-8 LANGUAGE= LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL=

Thank you for your time!

When 160 characters don't suffice, Twitter users start posting on Stack Overflow... :-S — Kerrek SB
– Kerrek SB, Commented Sep 18, 2012 at 15:32
This looks like a simple matter of parsing the data file, building up a map from input characters to replacement characters, and applying that to the text file. — Kerrek SB
– Kerrek SB, Commented Sep 18, 2012 at 15:33
From "I will use the above file in order to make the code, but if there are already any free libraries I would prefer to use it" I think the question is "Are there any free libraries which provide the functionality I intend to implement?" — JoeG
– JoeG, Commented Sep 18, 2012 at 15:42
@JoeGauterin yes you are right. A good library candidate should be the ICU with the "uspoof_getSkeleton :(Get the "skeleton" for an identifier string.Identifier Skeletons: A skeleton is a transformation of an identifier, such that all identifiers that are confusable with each other have the same skeleton. Using skeletons, it is possible to build a dictionary data structure for a set of identifiers, and then quickly test whether a new identifier is confusable with an identifier already in the set. The uspoof_getSkeleton() family of functions will produce the skeleton from an identifier.) — user1675224
– user1675224, Commented Sep 18, 2012 at 15:48
This isn't as simple as it sounds if you want to use the entire database from the Unicode consortium: there isn't a 1-to-1 mapping between confusable characters. For example, U+2474 PARENTHESIZED DIGIT 1 "⑴" can be confused with the the three-character sequence of U+0028 U+0031 U+0029 "(1)". — Adam Rosenfield
– Adam Rosenfield, Commented Sep 18, 2012 at 15:50

ecatmur · Accepted Answer · 2012-09-18 16:33:34Z

As user1675224 says you should be using ICU rather than attempting to roll your own algorithm.

For example, to use uspoof_areConfusable:

UErrorCode status = 0; USpoofChecker *sc = uspoof_open(&status); int result = uspoof_areConfusable(sc, s1.data(), s1.length(), s2.data(), s2.length(), &status); uspoof_close(sc);

If you're comparing large numbers of strings against each other, you should convert them to their skeletons using uspoof_getSkeleton, and put that in a set or hash set.

user1675224 · Accepted Answer · 2012-09-19 08:43:13Z

 std::string s1,s2; s1="TEst"; s2="TΕst"; std::cout<<" s1.length()="<<s1.length()<<"\n"; std::cout<<" s2.length()="<<s2.length()<<"\n"; UErrorCode status = U_ZERO_ERROR ; USpoofChecker *sc = uspoof_open(&status); char p[100]; int result = uspoof_getSkeletonUTF8 (sc,USPOOF_ANY_CASE, s1.data(),s1.length(),p,100,&status); std::string skeleton1(p,result); std::cout<<" result in bytes="<<result<<" status="<<status<<"\n"; std::cout<<" skeleton1="<<skeleton1<<"\n"; std::cout<<"1->true, 0->false (s1==skeleton1) => "<<(s1==skeleton1)<<"\n"; // char p2[100]; int result2 = uspoof_getSkeletonUTF8 (sc,USPOOF_ANY_CASE , s2.data(),s2.length(),p2,100,&status); std::string skeleton2(p2,result2); std::cout<<" result2 in bytes="<<result2<<" status="<<status<<"\n"; std::cout<<" skeleton2="<<skeleton2<<"\n"; std::cout<<"1->true, 0->false (s2==skeleton2) => "<<(s2==skeleton2)<<"\n"; std::cout<<"1->true, 0->false (s1==s2) => "<<(s1==s2)<<"\n"; std::cout<<"1->true, 0->false (skeleton1==skeleton2) => "<<(skeleton1==skeleton2)<<"\n"; // uspoof_close(sc);

Output

 s1.length()=4 s2.length()=5 result in bytes=4 status=0 skeleton1=TEst 1->true, 0->false (s1==skeleton1) => 1 result2 in bytes=4 status=0 skeleton2=TEst 1->true, 0->false (s2==skeleton2) => 0 1->true, 0->false (s1==s2) => 0 1->true, 0->false (skeleton1==skeleton2) => 1

Thank you.

Collectives™ on Stack Overflow

c++ Unicode string comparison with confusable character. example ( U 0054) should be == (U03A4) etc

2 Answers 2

Comments

Comments

Hot Network Questions