2

How do I convert ú within a c++ application where the application receives the character as UTF-8 encoding %C3%BA and store it as the unicode equivalent %FA. I just want to know how I would go about writing code to perform this encoding process

4
  • 3
    utfcpp.sourceforge.net? Commented Aug 30, 2013 at 13:48
  • msdn.microsoft.com/en-us/library/dd374130(v=vs.85).aspx ? Commented Aug 30, 2013 at 13:52
  • 3
    Just for the record, with regards to your title: UTF-8 is Unicode. And the standard way of specifying the code point would be U+00FA (with at least 4 hex digits, but up to 6). Commented Aug 30, 2013 at 13:58
  • 1
    You look up the rules for UTF-8, unicode and url encoding etc. and you implement them in code. I don't know any other way to answer the question. It might help you progress if you said specifically where you are stuck. I would break the problem into three steps, URL-decode (convert %xy etc. to character value), UTF-8 to unicode code point (this is converts for instance C3 BA to FA, this is the difficult step), URL-encode (put back the %'s). Each of these steps is simpler than the overall problem, just pick the easiest and code that one first. Commented Aug 30, 2013 at 14:07

2 Answers 2

8

I just wrote some code to do this yesterday...

I'm not saying this is the "perfect" way to do this, but it appears to work for all testcases I've run through it (I wrote both directions for that purpose).

I'll leave it to you to translate "%NN" to an integer value.

#include <iostream> #include <deque> std::deque<int> unicode_to_utf8(int charcode) { std::deque<int> d; if (charcode < 128) { d.push_back(charcode); } else { int first_bits = 6; const int other_bits = 6; int first_val = 0xC0; int t = 0; while (charcode >= (1 << first_bits)) { { t = 128 | (charcode & ((1 << other_bits)-1)); charcode >>= other_bits; first_val |= 1 << (first_bits); first_bits--; } d.push_front(t); } t = first_val | charcode; d.push_front(t); } return d; } int utf8_to_unicode(std::deque<int> &coded) { int charcode = 0; int t = coded.front(); coded.pop_front(); if (t < 128) { return t; } int high_bit_mask = (1 << 6) -1; int high_bit_shift = 0; int total_bits = 0; const int other_bits = 6; while((t & 0xC0) == 0xC0) { t <<= 1; t &= 0xff; total_bits += 6; high_bit_mask >>= 1; high_bit_shift++; charcode <<= other_bits; charcode |= coded.front() & ((1 << other_bits)-1); coded.pop_front(); } charcode |= ((t >> high_bit_shift) & high_bit_mask) << total_bits; return charcode; } int main() { int charcode; for(;;) { std::cout << "Enter unicode value:" << std::endl; std::cin >> charcode; auto x = unicode_to_utf8(charcode); for(auto c : x) { std::cout << "\\x" << std::hex << c << " "; } std::cout << std::endl; int c = utf8_to_unicode(x); std::cout << "reversed:" << std::dec << c << std::hex << " in hex:" << c << std::endl; } } 
Sign up to request clarification or add additional context in comments.

6 Comments

OP wants to go the other way, no?
The code contains BOTH directions - from a deque to unicode and from unicode to deque. It just doesn't happen to have the "required" code FIRST, I wasn't going to reformat my code...
Just a little note regarding naming; I suggest the names utf32_to_utf8 and utf8_to_utf32; the word "unicode" is a bit overloaded and is sometimes understood to mean utf-16.
Yes, name isn't great, the REAL code that I use this in (in PHP, the above was just a hack to test the principle) is called utf8_to_html, and produces a "&#x1234;" string.
@MatsPetersson Thanks for the code above, I'm struggling to implement this into my code as I'm new to c++. How will the string %C3B%A be converted using this code?
|
2

This is actually in the standard libray:

#include <string> #include <codecvt> // for std::codecvt_utf8 #include <locale> // for std::wstring_convert std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv_utf8_utf32; int main() { std::string utf8_bytes = "ú"; std::u32string unicode_codepoints = conv_utf8_utf32.from_bytes(utf8_bytes); return 0; } 

The other way around is done with conv_utf8_utf32.to_bytes.

Example with printing in your %hex format using printf:

#include <string> #include <codecvt> // for std::codecvt_utf8 #include <locale> // for std::wstring_convert #include <cstdio> std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv_utf8_utf32; int main() { std::string utf8_bytes = "ú"; // print the bytes in %hex format for (char byte: utf8_bytes) { printf("%%%2X", reinterpret_cast<unsigned char&>(byte)); } printf("\n"); std::u32string unicode_codepoints = conv_utf8_utf32.from_bytes(utf8_bytes); // print the code points in %hex format for (char32_t chr: unicode_codepoints) { printf("%%%2X", chr); } printf("\n"); return 0; } 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.