1

The reason I ask is that there's a "standard" for affix files that says to read the first line of a file, and it will tell you how the file is encoded:

The first line specifies the character set used for both the wordlist and the affix file (should be all uppercase). For example: SET ISO8859-1 

That strikes me as being both unreasonable and unreliable, unless all character sets have the 7-bit ASCII range in common, which would allow you to "taste" up to the first newline byte(s): 0xA or 0xD.

But I have no idea if the ASCII range is common to all character sets or not.

2
  • 1
    Technically speaking it's not, no. Practically speaking most encodings you'll encounter in use today are. If there's a standard that tells you to do this, then it's safe to assume that all files that are supposed to work with this standard share that constraint, no? Set a sensible limit for the line length when "tasting" so you're not being thrown off by random garbage files (including non-ASCII based encodings). Commented Jun 23, 2017 at 14:08
  • I put "standard" in quotes, because it's not a real standard, as far as I know. It's probably more of a convention. I agree that most will probably have the ASCII characters in common, but I don't think the convention disallows non-ISO-8859-* character sets, for example. Commented Jun 23, 2017 at 14:12

1 Answer 1

3

No. EBCDIC is non-ASCII based, and is still used in IBM mainframe-based software environments with extreme backwards-compatibility requirements.

More popular are UTF-16 and UTF-32, which although ASCII-based, are backwards-incompatible due to all the extra 00 bytes.

Still, there are only a few ways to encode the Basic Latin alphabet. (What distinguishes most of the hundreds of character encodings that exist are their handling of accented and non-Latin letters.) So, the program that reads these files only needs to handle a few possible ways of encoding the word SET:

  • 53 45 54 for ASCII-based encodings (Windows-1252, UTF-8, etc.)
  • E2 C5 E3 for EBCDIC-based encodings (if these are considered worth supporting at all)
  • 00 53 00 45 00 54 for UTF-16BE
  • 53 00 45 00 54 00 for UTF-16LE
  • 00 00 00 53 00 00 00 45 00 00 00 54 for UTF-32BE
  • 53 00 00 00 45 00 00 00 54 00 00 00 for UTF-32LE

The decoder could simply look for them all.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.