5

I'm changing a software in C++, wich process texts in ISO Latin 1 format, to store data in a database in SQLite.
The problem is that SQLite works in UTF-8... and the Java modules that use same database work in UTF-8.

I wanted to have a way to convert the ISO Latin 1 characters to UTF-8 characters before storing in the database. I need it to work in Windows and Mac.

I heard ICU would do that, but I think it's too bloated. I just need a simple convertion system(preferably back and forth) for these 2 charsets.

How would I do that?

2
  • 2
    Are you using Windows Latin-1 or true ISO Latin 1? Commented Apr 7, 2011 at 19:10
  • I would have suggested using Glib's wrapper for iconv which converts easily between any 2 charsets, but if you are sure that you need only latin1->utf8, then @Evan 's solution below is the simplest. In any way, ICU seems way to big for this. Commented Apr 7, 2011 at 19:59

4 Answers 4

18

ISO-8859-1 was incorporated as the first 256 code points of ISO/IEC 10646 and Unicode. So the conversion is pretty simple.

for each char:

uint8_t ch = code_point; /* assume that code points above 0xff are impossible since latin-1 is 8-bit */ if(ch < 0x80) { append(ch); } else { append(0xc0 | (ch & 0xc0) >> 6); /* first byte, simplified since our range is only 8-bits */ append(0x80 | (ch & 0x3f)); } 

See http://en.wikipedia.org/wiki/UTF-8#Description for more details.

EDIT: according to a comment by ninjalj, latin-1 translates direclty to the first 256 unicode code points, so the above algorithm should work.

Sign up to request clarification or add additional context in comments.

6 Comments

As I said, if it's real Latin1. Windows CP1252 (sometimes incorrectly called Latin1) has additional characters (in a range reserved in ISO-8859 for control characters), most notably, versions of opening and closing quotes.
Oh, and there's no below on SO ;-P
(ch & 0xc0) >> 6 is redundant. You can just write ch >> 6.
@dan04: can't ever hurt to be explicit.
I really can't understand the table on the wikipedia link. so if i have Latin-1 Ç , that falls under below 11bits, but how does the above following formula work?
|
2

TO c++ i use this:

std::string iso_8859_1_to_utf8(std::string &str) { string strOut; for (std::string::iterator it = str.begin(); it != str.end(); ++it) { uint8_t ch = *it; if (ch < 0x80) { strOut.push_back(ch); } else { strOut.push_back(0xc0 | ch >> 6); strOut.push_back(0x80 | (ch & 0x3f)); } } return strOut; } 

1 Comment

This solution does seem to work for me on Unix systems but somehow does not seem to work on Windows with Visual Studio. Does anyone have any ideas?
1

If general-purpose charset frameworks (like iconv) are too bloated for you, roll your own.

Compose a static translation table (char to UTF-8 sequence), put together your own translation. Depending on what do you use for string storage (char buffers, or std::string or what) it would look somewhat differently, but the idea is - scroll through the source string, replace each character with code over 127 with its UTF-8 counterpart string. Since this can potentially increase string length, doing it in place would be rather inconvenient. For added benefit, you can do it in two passes: pass one determines the necessary target string size, pass two performs the translation.

4 Comments

If it's real Latin1, the translation table is trivial, Latin1 maps directly to the first 256 Unicode codepoints.
@ninjalj, this answer doesn't propose translating to codepoints but to UTF-8 sequences. Each sequence will be either one or two bytes.
@Mark Ransom: it's the same, it's trivial to generate the table without having to look at loads of character tables.
@Mark: which, incidentally, you would have to to translate from/to CP1252
0

If you don't mind doing an extra copy, you can just "widen" your ISO Latin 1 chars to 16-bit characters and thus get UTF-16. Then you can use something like UTF8-CPP to convert it to UTF-8.

In fact, I think UTF8-CPP could even convert ISO Latin 1 to UTF-8 directly (utf16to8 function) but you may get a warning.

Of course, it needs to be real ISO Latin 1, not Windows CP 1232.

3 Comments

Two translations instead of one?
One is not a translation - code units of ISO Latin 1 are exactly the same as the ones for UTF16, just of the different size. That's why I say he can probably supply the Latin1 string directly to the utf16to8 function.
I tried this solution. It failed on special characters (such as ë) sadly. Nice theory though.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.