2

I am trying to detect some of the combination of Unicode character (like ​) to cleanup the string, For a single Unicode character it is detecting but combination of Unicode is not detecting.

These string I am using to make HTML page from another HTML page which need to be cleanup. I want to clean only string which have these kind of unicode that not even visible in html page in browser.

below is the sample code:

void detect_Unicode(string& str) { if(!str.empty() && str.find_first_not_of(" \t\n\r\f\v\u00A0\u00C2\u00E2\u20AC\u2039")==string::npos) str.assign(" "); return; } 

Input string:

1. " ​ ​ " ; 2. "are   there is something    ​ combination ​" 3. "   " 4. "​   ​" 5 . "  â â" 

Expected Output:

1. " " 2. "are   there is something    ​ combination ​" 3. " " 4. " " 5. " " 

Please let me know other ways too.

13
  • If you can, use std::wstring Commented Jul 6, 2018 at 13:22
  • std::string doesn't contain unicode character but "encoded" byte (possibly utf-8). so for multibyte character, you have to use std::search instead of find_first_not_of. Commented Jul 6, 2018 at 13:22
  • @PaulSanders: wchar is not guarantied to be 2, even in that case, unicode might need several wchars. Commented Jul 6, 2018 at 13:32
  • @Jarod42 Can you explain how I can use std::search with string Commented Jul 6, 2018 at 13:33
  • @Jarod452 wchar is not guaranteed to be 2 I don't think I ever claimed that it was. Commented Jul 6, 2018 at 13:37

1 Answer 1

2

OK, following on from the comments above, I think it's highly likely that the input string is in UTF-8 (after all, in an HTML context, what else would it be?).

On that basis, I humbly submit this:

#include <string> #include <codecvt> #include <locale> std::string narrow (const std::wstring& ws) { std::wstring_convert <std::codecvt_utf8 <wchar_t>, wchar_t> convert; return convert.to_bytes (ws); } std::wstring widen (const std::string& s) { std::wstring_convert <std::codecvt_utf8 <wchar_t>, wchar_t> convert; return convert.from_bytes (s); } std::string detect_Unicode (const std::string& s) { std::wstring ws = widen (s); if (ws.empty() || ws.find_first_not_of (L" \t\n\r\f\v\u00A0\u00C2\u00E2\u20AC\u2039") != std::wstring::npos) return " "; return s; } #include <iostream> int main () { std::cout << narrow (L"\u00A0 \u00C2 \u00E2 \u20AC \u2039\n\n"); std::cout << "0.\t\"" << detect_Unicode (u8"abcde") << "\"\n"; std::cout << "1.\t\"" << detect_Unicode (u8" ​ ​ ") << "\"\n"; std::cout << "2.\t\"" << detect_Unicode (u8"are   there is something    ​ combination ​") << "\"\n"; std::cout << "3.\t\"" << detect_Unicode (u8"   ") << "\"\n"; std::cout << "4.\t\"" << detect_Unicode (u8"​   ​") << "\"\n"; std::cout << "5.\t\"" << detect_Unicode (u8"  â â") << "\"\n"; } 

Output:

  ⠀ ‹ 0. " " 1. " ​ ​ " 2. " " 3. "   " 4. "​   ​" 5. "  â â" 

Now this is not the output the OP expects, but I think that's simply because the logic (as opposed to the implementation) of detect_Unicode() looks flawed. The point here is that converting the input string to a wide string means that you can use standard basic_string operations on it reliably, because there are no multibyte issues now.

An alternative, slightly radical, implementation of detect_Unicode() might be:

for (auto wide_char : ws) { if (wide_char > 0xff) return " "; } return s; 

But really, now you have a wide string to hand in detect_Unicode, anything is possible, so go wild OP.

Other notes:

  • std::codecvt is deprecated in C++17, but since there is no other obvious choice you might as well run with it. You can always change the implementations of narrow and widen if it comes to it.
  • Depending on platform, std::wstring might not be the best choice but it's probably fine. You could also look at std::u16string and std::u32string.

Live demo.

Inspiration taken from here.

Sign up to request clarification or add additional context in comments.

6 Comments

This seems good to me but It is not handling all the case like the input std::cout << "1.\t\"" << detect_Unicode (u8" ​ ​ ") << "\"\n"; its output should be 1. " "
other case like that are handled by your code can be done simply by this condition if(!str.empty() && str.find_first_not_of(" \t\n\r\f\v\u00A0\u00C2\u00E2\u20AC\u2039")==string::npos) str.assign(" ");
...its output should be 1. " " Why? ... can be done simply by this condition How does that improve things?
In your live demo it works as expected but for my string didnot worked as expected, seems my string is not UTF8.
this "​" is not detecting for me.
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.