How to convert (char *) from ISO-8859-1 to UTF-8 in C++ multiplatformly?

Question

I'm changing a software in C++, wich process texts in ISO Latin 1 format, to store data in a database in SQLite.
The problem is that SQLite works in UTF-8... and the Java modules that use same database work in UTF-8.

I wanted to have a way to convert the ISO Latin 1 characters to UTF-8 characters before storing in the database. I need it to work in Windows and Mac.

I heard ICU would do that, but I think it's too bloated. I just need a simple convertion system(preferably back and forth) for these 2 charsets.

How would I do that?

I would have suggested using Glib's wrapper for iconv which converts easily between any 2 charsets, but if you are sure that you need only latin1->utf8, then @Evan 's solution below is the simplest. In any way, ICU seems way to big for this. — davka
– davka, Commented Apr 7, 2011 at 19:59

Community · Accepted Answer · 2017-05-23 11:53:55Z

ISO-8859-1 was incorporated as the first 256 code points of ISO/IEC 10646 and Unicode. So the conversion is pretty simple.

for each char:

uint8_t ch = code_point; /* assume that code points above 0xff are impossible since latin-1 is 8-bit */ if(ch < 0x80) { append(ch); } else { append(0xc0 | (ch & 0xc0) >> 6); /* first byte, simplified since our range is only 8-bits */ append(0x80 | (ch & 0x3f)); }

See http://en.wikipedia.org/wiki/UTF-8#Description for more details.

EDIT: according to a comment by ninjalj, latin-1 translates direclty to the first 256 unicode code points, so the above algorithm should work.

As I said, if it's real Latin1. Windows CP1252 (sometimes incorrectly called Latin1) has additional characters (in a range reserved in ISO-8859 for control characters), most notably, versions of opening and closing quotes.
(ch & 0xc0) >> 6 is redundant. You can just write ch >> 6.
I really can't understand the table on the wikipedia link. so if i have Latin-1 Ç , that falls under below 11bits, but how does the above following formula work?

Lord Raiden · Accepted Answer · 2016-10-05 21:19:03Z

TO c++ i use this:

std::string iso_8859_1_to_utf8(std::string &str) { string strOut; for (std::string::iterator it = str.begin(); it != str.end(); ++it) { uint8_t ch = *it; if (ch < 0x80) { strOut.push_back(ch); } else { strOut.push_back(0xc0 | ch >> 6); strOut.push_back(0x80 | (ch & 0x3f)); } } return strOut; }

This solution does seem to work for me on Unix systems but somehow does not seem to work on Windows with Visual Studio. Does anyone have any ideas?

Seva Alekseyev · Accepted Answer · 2011-04-07 19:16:44Z

If general-purpose charset frameworks (like iconv) are too bloated for you, roll your own.

Compose a static translation table (char to UTF-8 sequence), put together your own translation. Depending on what do you use for string storage (char buffers, or std::string or what) it would look somewhat differently, but the idea is - scroll through the source string, replace each character with code over 127 with its UTF-8 counterpart string. Since this can potentially increase string length, doing it in place would be rather inconvenient. For added benefit, you can do it in two passes: pass one determines the necessary target string size, pass two performs the translation.

If it's real Latin1, the translation table is trivial, Latin1 maps directly to the first 256 Unicode codepoints.
@ninjalj, this answer doesn't propose translating to codepoints but to UTF-8 sequences. Each sequence will be either one or two bytes.
@Mark Ransom: it's the same, it's trivial to generate the table without having to look at loads of character tables.
@Mark: which, incidentally, you would have to to translate from/to CP1252

Nemanja Trifunovic · Accepted Answer · 2011-04-07 19:31:47Z

0

If you don't mind doing an extra copy, you can just "widen" your ISO Latin 1 chars to 16-bit characters and thus get UTF-16. Then you can use something like UTF8-CPP to convert it to UTF-8.

In fact, I think UTF8-CPP could even convert ISO Latin 1 to UTF-8 directly (utf16to8 function) but you may get a warning.

Of course, it needs to be real ISO Latin 1, not Windows CP 1232.

answered Apr 7, 2011 at 19:31

Nemanja Trifunovic

24.6k4 gold badges53 silver badges89 bronze badges

3 Comments

Seva Alekseyev Over a year ago

Two translations instead of one?

Nemanja Trifunovic Over a year ago

One is not a translation - code units of ISO Latin 1 are exactly the same as the ones for UTF16, just of the different size. That's why I say he can probably supply the Latin1 string directly to the utf16to8 function.

MaestroMaus Over a year ago

I tried this solution. It failed on special characters (such as ë) sadly. Nice theory though.

Collectives™ on Stack Overflow

How to convert (char *) from ISO-8859-1 to UTF-8 in C++ multiplatformly?

4 Answers 4

6 Comments

1 Comment

4 Comments

3 Comments

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

6 Comments

1 Comment

4 Comments

3 Comments

Related