8

I have to check if the particular string begins with another one. Strings are encoded using utf8, and a comparison should be case insensitive.

I know that this is very similar to that topic Case insensitive string comparison in C++ but I do not want to use the boost library and I prefer portable solutions (If it is 'nearly' impossible, I prefer Linux oriented solutions).

Is it possible in C++11 using its regexp library? Or just using simple string compare methods?

11
  • 1
    Why don't you want to use boost (its practically standard on all development machines nowadays). Commented May 4, 2012 at 7:30
  • 1
    Try a portable unicode compliant string library such as ICU. Though, I really don't see why you can use one portable solution and not another. Commented May 4, 2012 at 7:33
  • It might seem simple but there are far more issues than you may think. First, there are many different possible representations for visual characters: for instance, the character é has its own code point, but can also be achieved by using the character e followed by the acute accent code point. Your solution needs to be aware of that. Second, case-insensitive comparison usually takes the strings and uppercases/lowercases them. This is actually a locale-sensitive operation: for instance, the German letter ß is the shorthand for ss and its uppercase version is SS. Commented May 4, 2012 at 7:34
  • In other words, you certainly don't want to roll your own library for Unicode string manipulation, and since C++ doesn't have built-in features for that, you'll have to choose your poison. Commented May 4, 2012 at 7:35
  • 2
    OK. Learning to do it manually for educational purposes is a good reason. But once you get to the real world stl/boost are indispensable. Commented May 4, 2012 at 7:41

3 Answers 3

13

The only way I know of that is UTF8/internationalization/culture-aware is the excellent and well-maintained IBM ICU: International Components for Unicode. It's a C/C++ library for *nix or Windows into which a ton of research has gone to provide a culture-aware string library, including case-insensitive string comparison that's both fast and accurate.

IMHO, the two things you should never write yourself unless you're doing a thesis paper are encryption and culture-sensitive string libraries.

Sign up to request clarification or add additional context in comments.

1 Comment

I am not sure those are the only two, but I completely agree that it's not something one is likely to get right!
3

Are there any restrictions on what can be in the string you're looking for? It it's user input, and can be any UTF-8 string, the problem is extremely complex. As others have mentioned, one character can have several different representations, so you'd probably have to normalize the strings first. Then: what counts as equal? Should 'E' compare equal to 'é' (as is usual in some circles in French), or not (which would be conform to the "official" rules of the Imprimerie nationale).

For all but the most trivial definitions, rolling your own will represent a significant effort. For this sort of thing, the library ICU is the reference. It contains all that you'll need. Note however that it works on UTF16, not UTF8, so you'll have to convert the strings first, as well as normalizing them. (ICU has support for both.)

4 Comments

It is quite unfortunate that it settled on UTF-16. I wish for a version of the library that would deal with UTF-8 directly instead :x
I have to filter out a list of names and surnames. And they can contain characters from any latin based alphabet. I thought that every national character (like é) has its own uppercase variant. And should be equal only to it.
@MiniKarol The equivalence between characters is very locale dependent. In French, it's common (although IMHO not good practice) to omit accents on upper case, so 'E' would be the (ambiguous) upper case for 'e', 'é', 'è', 'ë' and 'ê'. In Swiss German, "Ae" is the standard upper case for 'ä'. (Note that the upper case requires two code points, where the lower case may be only a single code point.)
@MiniKarol Not to mention the German 'ß', whose upper case form depends on the word (at least according to Duden). You can ignore accents entirely by converting to the Normalized form D and ignoring the various combining accents in the text; this is a simple (but not too accurate) solution, but will still not work in cases where the number of code points in upper case and lower case are different.
2

Using the stl regex classes you could do something like the following snippet. Unfortunately its not utf8. Changing str2 to std::wstring str2 = L"hello World" results in a lot of conversion warnings. Making str1 an std::wchar doesn't work at all, since std::regex doesn't allow a whar input (as far as i can see).

#include <regex> #include <iostream> #include <string> int main() { //The input strings std::string str1 = "Hello"; std::string str2 = "hello World"; //Define the regular expression using case-insensitivity std::regex regx(str1, std::regex_constants::icase); //Only search at the beginning std::regex_constants::match_flag_type fl = std::regex_constants::match_continuous; //display some output std::cout << std::boolalpha << std::regex_search(str2.begin(), str2.end(), regx, fl) << std::endl; return 0; } 

4 Comments

Thats right. But in my answer i said, that std::regex doesn't work with wchar and so i hoped it is a valid answer anyway, since it answers the first question with "no"
The thing is, utf-8 use a regular std::string, so char under the hood.
So lets say we have an utf8 implementation based on chars, which mean one utf8-sign could be represented as an string of 1-4 length. Wouldn't utf8 words look like a "normal" string and could thus be handled by the regex-class? Ok, case-insensitivity wouldn't work, but theoretically...
It can work with the regex class for a number of regular expressions. For example <name>(.*?)</name> will happily capture everything in the name tag, utf-8 or not. However it will fail as soon as you start using shortcuts. For example \w is equivalent to [a-zA-Z_] so it's alpha... for ASCII, and will not match letters outside the Latin alphabet or hyphenated letters, etc... Also, since it does not know about multibyte encoding, even <name>([^<]{1,26})</name> may not work as desired: it will capture from 1 to 26 bytes, not codepoints or characters.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.