3

I'm working with 10+ years old machines which use ISO 8859-7 to represent Greek characters using a single byte each. I need to catch those characters and convert them to UTF-8 in order to inject them in a JSON to be sent via HTTPS. Also, I'm using GCC v4.4.7 and I don't feel like upgrading so I can't use codeconv or such.

Example: "OΛΑ": I get char values [ 0xcf, 0xcb, 0xc1, ], I need to write this string "\u039F\u039B\u0391".

PS: I'm not a charset expert so please avoid philosophical answers like "ISO 8859 is a subset of Unicode so you just need to implement the algorithm".

4
  • Are you basically asking "what is the library I could use to convert one encoding into another, compatible with my ancient compiler?". This is kind of off-topic here, check softwarerecs.stackexchange.com Commented Jul 8, 2020 at 14:55
  • I'd like to implement this without external libraries. Commented Jul 8, 2020 at 14:59
  • it's not possible "in general", since encoding mappings are not fixed. Of course the hacky ad-hoc solution of just mapping 256 chars from ISO encoding to UTF-8 would work. Unless you also want to do the reverse conersion. Commented Jul 8, 2020 at 15:12
  • "I'd like to implement this without external libraries" - Does libiconv count? It's so common that the functions are even included in gnu's libc so you don't even have to link with extra libraries on linux for example. Commented Jul 8, 2020 at 16:38

3 Answers 3

1

Given that there are so few values to map, a simple solution is to use a lookup table.

Pseudocode:

id_offset = 0x80 // 0x00 .. 0x7F same in UTF-8 c1_offset = 0x20 // 0x80 .. 0x9F control characters table_offset = id_offset + c1_offset table = [ u8"\u00A0", // 0xA0 u8"‘", // 0xA1 u8"’", u8"£", u8"€", u8"₯", // ... Refer to ISO 8859-7 for full list of characters. ] let S be the input string let O be an empty output string for each char C in S reinterpret C as unsigned char U if U less than id_offset // same in both encodings append C to O else if U less than table_offset // control code append char '\xC2' to O // lead byte append char C to O else append string table[U - table_offset] to O 

All that said, I recommend to save some time by using a library instead.

Sign up to request clarification or add additional context in comments.

6 Comments

this could be a low-effort solution I could choose when I'm out of hope. I'm keeping this as backup plan.
It's a good working solution. I just generated a std::unordered_map<unsigned char, std::string_view> using libiconv. That map can then be included separately without using iconv or any other lib.
@TedLyngmo Clever use of meta programming. I like it. Although, I would prefer an array table in this case.
Thanks! I'll add a godbolt link to the result as a comment here when I get to a computer. An array is much better in this case, I agree.
@afe Here's the table you'll need: godbolt.org/z/5zanvc There are three ? (\x3f) in the high part. Those are codepoints not used in iso-8859-7.
|
1

One way could be to use the Posix libiconv library. On Linux, the functions needed (iconv_open, iconv and iconv_close) are even included in libc so no extra linkage is needed there. On your old machines you may need to install libiconv but I doubt it.

Converting may be as simple as this:

#include <iconv.h> #include <cerrno> #include <cstring> #include <iostream> #include <iterator> #include <stdexcept> #include <string> // A wrapper for the iconv functions class Conv { public: // Open a conversion descriptor for the two selected character sets Conv(const char* to, const char* from) : cd(iconv_open(to, from)) { if(cd == reinterpret_cast<iconv_t>(-1)) throw std::runtime_error(std::strerror(errno)); } Conv(const Conv&) = delete; ~Conv() { iconv_close(cd); } // the actual conversion function std::string convert(const std::string& in) { const char* inbuf = in.c_str(); size_t inbytesleft = in.size(); // make the "out" buffer big to fit whatever we throw at it and set pointers std::string out(inbytesleft * 6, '\0'); char* outbuf = out.data(); size_t outbytesleft = out.size(); // the const_cast shouldn't be needed but my "iconv" function declares it // "char**" not "const char**" size_t non_rev_converted = iconv(cd, const_cast<char**>(&inbuf), &inbytesleft, &outbuf, &outbytesleft); if(non_rev_converted == static_cast<size_t>(-1)) { // here you can add misc handling like replacing erroneous chars // and continue converting etc. // I'll just throw... throw std::runtime_error(std::strerror(errno)); } // shrink to keep only what we converted out.resize(outbuf - out.data()); return out; } private: iconv_t cd; }; int main() { Conv cvt("UTF-8", "ISO-8859-7"); // create a string from the ISO-8859-7 data unsigned char data[]{0xcf, 0xcb, 0xc1}; std::string iso88597_str(std::begin(data), std::end(data)); auto utf8 = cvt.convert(iso88597_str); std::cout << utf8 << '\n'; } 

Output (in UTF-8):

ΟΛΑ 

Using this you can create a mapping table, from ISO-8859-7 to UTF-8, that you include in your project instead of iconv:

Demo

Comments

0

Ok I decided to do this myself instead of looking for a compatible library. Here's how I did.

The main problem was figuring out how to fill the two bytes for Unicode using the single one for ISO, so I used the debugger to read the value for the same character, first written by the old machine and then written with a constant string (UTF-8 by default). I started with "O" and "Π" and saw that in UTF-8 the first byte was always 0xCE while the second one was filled with the ISO value plus an offset (-0x30). I built the following code to implement this and used a test string filled with all greek letters, both upper and lower case. Then I realised that starting from "π" (0xF0 in ISO) both the first byte and the offset for the second one changed, so I added a test to figure out which of the two rules to apply. The following method returns a bool to let the caller know whether the original string contained ISO characters (useful for other purposes) and overwrites the original string, passed as reference, with the new one. I worked with char arrays instead of strings for coherence with the rest of the project which is basically a C project written in C++.

bool iso_to_utf8(char* in){ bool wasISO=false; if(in == NULL) return wasISO; // count chars int i=strlen(in); if(!i) return wasISO; // create and size new buffer char *out = new char[2*i]; // fill with 0's, useful for watching the string as it gets built memset(out, 0, 2*i); // ready to start from head of old buffer i=0; // index for new buffer int j=0; // for each char in old buffer while(in[i]!='\0'){ if(in[i] >= 0){ // it's already utf8-compliant, take it as it is out[j++] = in[i]; }else{ // it's ISO wasISO=true; // get plain value int val = in[i] & 0xFF; // first byte to CF or CE out[j++]= val > 0xEF ? 0xCF : 0xCE; // second char to plain value normalized out[j++] = val - (val > 0xEF ? 0x70 : 0x30); } i++; } // add string terminator out[j]='\0'; // paste into old char array strcpy(in, out); return wasISO; 

}

3 Comments

Does this work for the iso-8859-7 characters 0xa1 , 0xa2 , 0xa4 , 0xa5 and 0xaf ?
Since you asked I guess it doesn't but that's out of scope, I only focused on Greek characters without symbols. Following the steps described it could be easy to add all missing characters.
I didn't test your version but it looks like making 3 byte utf8 sequences wont work. The map I linked to is both accurate for all iso-8859-7 characterss and it's faster.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.