How can I "convert" ISO-8859-7 strings to UTF-8 in C++?

Question

I'm working with 10+ years old machines which use ISO 8859-7 to represent Greek characters using a single byte each. I need to catch those characters and convert them to UTF-8 in order to inject them in a JSON to be sent via HTTPS. Also, I'm using GCC v4.4.7 and I don't feel like upgrading so I can't use codeconv or such.

Example: "OΛΑ": I get char values [ 0xcf, 0xcb, 0xc1, ], I need to write this string "\u039F\u039B\u0391".

PS: I'm not a charset expert so please avoid philosophical answers like "ISO 8859 is a subset of Unicode so you just need to implement the algorithm".

Are you basically asking "what is the library I could use to convert one encoding into another, compatible with my ancient compiler?". This is kind of off-topic here, check softwarerecs.stackexchange.com — Dan M.
– Dan M., Commented Jul 8, 2020 at 14:55
it's not possible "in general", since encoding mappings are not fixed. Of course the hacky ad-hoc solution of just mapping 256 chars from ISO encoding to UTF-8 would work. Unless you also want to do the reverse conersion. — Dan M.
– Dan M., Commented Jul 8, 2020 at 15:12
"I'd like to implement this without external libraries" - Does libiconv count? It's so common that the functions are even included in gnu's libc so you don't even have to link with extra libraries on linux for example. — Ted Lyngmo
– Ted Lyngmo, Commented Jul 8, 2020 at 16:38

eerorika · Accepted Answer · 2020-07-08 17:45:43Z

Given that there are so few values to map, a simple solution is to use a lookup table.

Pseudocode:

id_offset = 0x80 // 0x00 .. 0x7F same in UTF-8 c1_offset = 0x20 // 0x80 .. 0x9F control characters table_offset = id_offset + c1_offset table = [ u8"\u00A0", // 0xA0 u8"‘", // 0xA1 u8"’", u8"£", u8"€", u8"₯", // ... Refer to ISO 8859-7 for full list of characters. ] let S be the input string let O be an empty output string for each char C in S reinterpret C as unsigned char U if U less than id_offset // same in both encodings append C to O else if U less than table_offset // control code append char '\xC2' to O // lead byte append char C to O else append string table[U - table_offset] to O

All that said, I recommend to save some time by using a library instead.

this could be a low-effort solution I could choose when I'm out of hope. I'm keeping this as backup plan.
It's a good working solution. I just generated a std::unordered_map<unsigned char, std::string_view> using libiconv. That map can then be included separately without using iconv or any other lib.
@TedLyngmo Clever use of meta programming. I like it. Although, I would prefer an array table in this case.
Thanks! I'll add a godbolt link to the result as a comment here when I get to a computer. An array is much better in this case, I agree.
@afe Here's the table you'll need: godbolt.org/z/5zanvc There are three ? (\x3f) in the high part. Those are codepoints not used in iso-8859-7.

Ted Lyngmo · Accepted Answer · 2020-07-10 09:19:02Z

One way could be to use the Posix libiconv library. On Linux, the functions needed (iconv_open, iconv and iconv_close) are even included in libc so no extra linkage is needed there. On your old machines you may need to install libiconv but I doubt it.

Converting may be as simple as this:

#include <iconv.h> #include <cerrno> #include <cstring> #include <iostream> #include <iterator> #include <stdexcept> #include <string> // A wrapper for the iconv functions class Conv { public: // Open a conversion descriptor for the two selected character sets Conv(const char* to, const char* from) : cd(iconv_open(to, from)) { if(cd == reinterpret_cast<iconv_t>(-1)) throw std::runtime_error(std::strerror(errno)); } Conv(const Conv&) = delete; ~Conv() { iconv_close(cd); } // the actual conversion function std::string convert(const std::string& in) { const char* inbuf = in.c_str(); size_t inbytesleft = in.size(); // make the "out" buffer big to fit whatever we throw at it and set pointers std::string out(inbytesleft * 6, '\0'); char* outbuf = out.data(); size_t outbytesleft = out.size(); // the const_cast shouldn't be needed but my "iconv" function declares it // "char**" not "const char**" size_t non_rev_converted = iconv(cd, const_cast<char**>(&inbuf), &inbytesleft, &outbuf, &outbytesleft); if(non_rev_converted == static_cast<size_t>(-1)) { // here you can add misc handling like replacing erroneous chars // and continue converting etc. // I'll just throw... throw std::runtime_error(std::strerror(errno)); } // shrink to keep only what we converted out.resize(outbuf - out.data()); return out; } private: iconv_t cd; }; int main() { Conv cvt("UTF-8", "ISO-8859-7"); // create a string from the ISO-8859-7 data unsigned char data[]{0xcf, 0xcb, 0xc1}; std::string iso88597_str(std::begin(data), std::end(data)); auto utf8 = cvt.convert(iso88597_str); std::cout << utf8 << '\n'; }

Output (in UTF-8):

ΟΛΑ

Using this you can create a mapping table, from ISO-8859-7 to UTF-8, that you include in your project instead of iconv:

Demo

afe · Accepted Answer · 2020-07-13 08:48:06Z

Ok I decided to do this myself instead of looking for a compatible library. Here's how I did.

The main problem was figuring out how to fill the two bytes for Unicode using the single one for ISO, so I used the debugger to read the value for the same character, first written by the old machine and then written with a constant string (UTF-8 by default). I started with "O" and "Π" and saw that in UTF-8 the first byte was always 0xCE while the second one was filled with the ISO value plus an offset (-0x30). I built the following code to implement this and used a test string filled with all greek letters, both upper and lower case. Then I realised that starting from "π" (0xF0 in ISO) both the first byte and the offset for the second one changed, so I added a test to figure out which of the two rules to apply. The following method returns a bool to let the caller know whether the original string contained ISO characters (useful for other purposes) and overwrites the original string, passed as reference, with the new one. I worked with char arrays instead of strings for coherence with the rest of the project which is basically a C project written in C++.

bool iso_to_utf8(char* in){ bool wasISO=false; if(in == NULL) return wasISO; // count chars int i=strlen(in); if(!i) return wasISO; // create and size new buffer char *out = new char[2*i]; // fill with 0's, useful for watching the string as it gets built memset(out, 0, 2*i); // ready to start from head of old buffer i=0; // index for new buffer int j=0; // for each char in old buffer while(in[i]!='\0'){ if(in[i] >= 0){ // it's already utf8-compliant, take it as it is out[j++] = in[i]; }else{ // it's ISO wasISO=true; // get plain value int val = in[i] & 0xFF; // first byte to CF or CE out[j++]= val > 0xEF ? 0xCF : 0xCE; // second char to plain value normalized out[j++] = val - (val > 0xEF ? 0x70 : 0x30); } i++; } // add string terminator out[j]='\0'; // paste into old char array strcpy(in, out); return wasISO;

}

Does this work for the iso-8859-7 characters 0xa1 ‘, 0xa2 ’, 0xa4 €, 0xa5 ₯ and 0xaf ―?
Since you asked I guess it doesn't but that's out of scope, I only focused on Greek characters without symbols. Following the steps described it could be easy to add all missing characters.
I didn't test your version but it looks like making 3 byte utf8 sequences wont work. The map I linked to is both accurate for all iso-8859-7 characterss and it's faster.

Collectives™ on Stack Overflow

How can I "convert" ISO-8859-7 strings to UTF-8 in C++?

3 Answers 3

6 Comments

Comments

3 Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

Comments

3 Comments

Related