Do all character sets have ASCII in common?

Question

The reason I ask is that there's a "standard" for affix files that says to read the first line of a file, and it will tell you how the file is encoded:

The first line specifies the character set used for both the wordlist and the affix file (should be all uppercase). For example: SET ISO8859-1

That strikes me as being both unreasonable and unreliable, unless all character sets have the 7-bit ASCII range in common, which would allow you to "taste" up to the first newline byte(s): 0xA or 0xD.

But I have no idea if the ASCII range is common to all character sets or not.

Technically speaking it's not, no. Practically speaking most encodings you'll encounter in use today are. If there's a standard that tells you to do this, then it's safe to assume that all files that are supposed to work with this standard share that constraint, no? Set a sensible limit for the line length when "tasting" so you're not being thrown off by random garbage files (including non-ASCII based encodings). — deceze
– deceze ♦, Commented Jun 23, 2017 at 14:08
I put "standard" in quotes, because it's not a real standard, as far as I know. It's probably more of a convention. I agree that most will probably have the ASCII characters in common, but I don't think the convention disallows non-ISO-8859-* character sets, for example. — rianjs
– rianjs, Commented Jun 23, 2017 at 14:12

dan04 · Accepted Answer · 2017-06-24 20:24:12Z

No. EBCDIC is non-ASCII based, and is still used in IBM mainframe-based software environments with extreme backwards-compatibility requirements.

More popular are UTF-16 and UTF-32, which although ASCII-based, are backwards-incompatible due to all the extra 00 bytes.

Still, there are only a few ways to encode the Basic Latin alphabet. (What distinguishes most of the hundreds of character encodings that exist are their handling of accented and non-Latin letters.) So, the program that reads these files only needs to handle a few possible ways of encoding the word SET:

53 45 54 for ASCII-based encodings (Windows-1252, UTF-8, etc.)
E2 C5 E3 for EBCDIC-based encodings (if these are considered worth supporting at all)
00 53 00 45 00 54 for UTF-16BE
53 00 45 00 54 00 for UTF-16LE
00 00 00 53 00 00 00 45 00 00 00 54 for UTF-32BE
53 00 00 00 45 00 00 00 54 00 00 00 for UTF-32LE

The decoder could simply look for them all.

Collectives™ on Stack Overflow

Do all character sets have ASCII in common?

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related