reading file with accented characters in Java

Question

I came across two special characters which seem not to be covered by the ISO-8859-1 character set i.e. they don't make it through to my program.

The German ß and the Norwegian ø

i'm reading the files as follows:

FileInputStream inputFile = new FileInputStream(corpus[i]); InputStreamReader ir = new InputStreamReader(inputFile, "ISO-8859-1") ;

Is there a way for me to read these characters without having to apply manual replacement as a workaround?

[EDIT]

this is how it looks on screen. Note that i have no problems with other accents e.g. è and the lot...

enter image description here

Are you absolutely sure that the eszett is 0xdf in the file and they are not read into the program (as char 0x1E9E), rather than not displayed by the font you're using for output? — Pete Kirkham
– Pete Kirkham, Commented Apr 30, 2011 at 22:02
@Pete, I'm not sure. I'm using text copied and pasted from the Universal Declaration of Human Rights found here: ohchr.org/EN/UDHR/Pages/SearchByLang.aspx — user425727
– user425727, Commented Apr 30, 2011 at 22:07

Thorbjørn Ravn Andersen · Accepted Answer · 2011-04-30 21:58:06Z

Both characters are present in ISO-Latin-1 (check my name to see why I've looked into this).

If the characters are not read in correctly, the most likely cause is that the text in the file is not saved in that encoding, but in something else.

Depending on your operating system and the origin of the file, possible encodings could be UTF-8 or a Windows code page like 850 or 437.

The easiest way is to look at the file with a hex editor and report back what exact values are saved for these two characters.

is there a file encoding that will work for several Indo_European language groups e.g. Germanic, Latin and Slavic?
@Baba pretty much every single time, you should just use UTF-8.

Matt Ball · Accepted Answer · 2011-04-30 21:54:15Z

1

ISO-8859-1 covers ß and ø, so the file is probably saved in a different encoding. You should pass in file's encoding to new InputStreamReader().

answered Apr 30, 2011 at 21:54

Matt Ball

361k102 gold badges655 silver badges725 bronze badges

3 Comments

user425727 Over a year ago

thanks for the link, it seems that there can be cases of incomplete coverage as mentioned on that page. For example my Norwegian text might be uisng the Danish ø wich isn't covered.

Matt Ball Over a year ago

Yes, there is incomplete coverage, but not for the characters you mentioned in your question.

Matt Ball Over a year ago

No, Ø and ø are covered but Ǿ and ǿ are missing.

WhiteFang34 · Accepted Answer · 2011-04-30 21:53:43Z

0

Assuming that your file is probably UTF-8 encoded, try this:

InputStreamReader ir = new InputStreamReader(inputFile, "UTF-8");

answered Apr 30, 2011 at 21:53

WhiteFang34

72.2k18 gold badges110 silver badges112 bronze badges

1 Comment

Neil Over a year ago

That sounds unlikely, as you would then expect UTF8 character pairs for non-ASCII content rather than simply alternative characters.

Collectives™ on Stack Overflow

reading file with accented characters in Java

3 Answers 3

2 Comments

3 Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

3 Comments

1 Comment

Linked

Related