4

I came across two special characters which seem not to be covered by the ISO-8859-1 character set i.e. they don't make it through to my program.

The German ß and the Norwegian ø

i'm reading the files as follows:

FileInputStream inputFile = new FileInputStream(corpus[i]); InputStreamReader ir = new InputStreamReader(inputFile, "ISO-8859-1") ; 

Is there a way for me to read these characters without having to apply manual replacement as a workaround?

[EDIT]

this is how it looks on screen. Note that i have no problems with other accents e.g. è and the lot...

enter image description here

2
  • Are you absolutely sure that the eszett is 0xdf in the file and they are not read into the program (as char 0x1E9E), rather than not displayed by the font you're using for output? Commented Apr 30, 2011 at 22:02
  • @Pete, I'm not sure. I'm using text copied and pasted from the Universal Declaration of Human Rights found here: ohchr.org/EN/UDHR/Pages/SearchByLang.aspx Commented Apr 30, 2011 at 22:07

3 Answers 3

3

Both characters are present in ISO-Latin-1 (check my name to see why I've looked into this).

If the characters are not read in correctly, the most likely cause is that the text in the file is not saved in that encoding, but in something else.

Depending on your operating system and the origin of the file, possible encodings could be UTF-8 or a Windows code page like 850 or 437.

The easiest way is to look at the file with a hex editor and report back what exact values are saved for these two characters.

Sign up to request clarification or add additional context in comments.

2 Comments

is there a file encoding that will work for several Indo_European language groups e.g. Germanic, Latin and Slavic?
@Baba pretty much every single time, you should just use UTF-8.
1

ISO-8859-1 covers ß and ø, so the file is probably saved in a different encoding. You should pass in file's encoding to new InputStreamReader().

3 Comments

thanks for the link, it seems that there can be cases of incomplete coverage as mentioned on that page. For example my Norwegian text might be uisng the Danish ø wich isn't covered.
Yes, there is incomplete coverage, but not for the characters you mentioned in your question.
No, Ø and ø are covered but Ǿ and ǿ are missing.
0

Assuming that your file is probably UTF-8 encoded, try this:

InputStreamReader ir = new InputStreamReader(inputFile, "UTF-8"); 

1 Comment

That sounds unlikely, as you would then expect UTF8 character pairs for non-ASCII content rather than simply alternative characters.