Issue about 65533 � in C# text file reading

Question

I created a sample app to load all special characters while copy pasting from Openoffice writer to Notepad. Double codes differs and when I try to load this.

var lines = File.ReadAllLines("..\\ter34.txt");

This creates problem of 65533 Issue comes and the text file contains:

“

This has been changed to the symbol:

�

What encoding is the text file using? ANSI? ASCII? UTF8? UTF16? — Matthew Watson
– Matthew Watson, Commented Feb 22, 2013 at 10:43
Problem comes only in ANSI....rest of things working correctly it changes it to -- “ -- — Aravind Srinivas
– Aravind Srinivas, Commented Feb 22, 2013 at 10:52
Just to those who might not know. The (char)65533 is also known as U+FFFD and is a REPLACEMENT CHARACTER. This is often emitted when the data to be converted is corrupt, or when the encoding to convert into can't represent the correct character. See Wikipedia. — Jeppe Stig Nielsen
– Jeppe Stig Nielsen, Commented Feb 22, 2013 at 10:53

Jon Skeet · Accepted Answer · 2013-02-22 10:48:18Z

26

U+FFFD is the "Unicode replacement character", which is used if the data you try to read is invalid for the encoding which is being used to convert binary data to text.

For example, if you write a file out using ISO-8859-1, but then try to read it using UTF-8, then you could easily end up with some byte sequences which simply aren't valid UTF-8. Each invalid byte would be translated (by default) into U+FFFD.

Basically, you need to provide the right encoding to File.ReadAllLines, as a second argument. That means you need to know the encoding of the file first, of course.

answered Feb 22, 2013 at 10:48

Jon Skeet

1.5m893 gold badges9.3k silver badges9.3k bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

quetzalcoatl Over a year ago

Oddly enough, I always thought that this is just custom feature data streaming/transcoding library. And it is well-defined Unicode transcoding behavior? Great!

Aravind Srinivas Over a year ago

When i save the txt file in formats like UTF8,Unicode ..etc its working correctly but when i save it in ANSI .. then that symbol comes

quetzalcoatl Over a year ago

Unicode files can present many different characters, while ANSI - dependents on selected CodePage, and usually far less. When you try to save some 'extended' character to ANSI file, you have some chances that this character simply cannot be translated to that ANSI CodePage you have selected (or defaulted to). In such cases, three things could happen: an exception could be thrown and crash everythin so you see there's a problem, OR those characters could be silently skipped (eeviill), OR, some "replacement character" is written to file instead so you see there's a problem

Jon Skeet Over a year ago

@user2046631: Right, so when you read the file you need to specify that encoding too. "ANSI" isn't a single encoding though - it's a broad term used for lots of encodings. You'll need to find out which one you actually mean.

Jeppe Stig Nielsen Over a year ago

@user2046631 You can possibly use File.ReadAllLines(@"..\ter34.txt", Encoding.GetEncoding("Windows-1252")) if the text file is in "Windows (Western European)" kind of ANSI. To rely on the ANSI of your own machine, use File.ReadAllLines(@"..\ter34.txt", Encoding.Default).

|

Collectives™ on Stack Overflow

Issue about 65533 � in C# text file reading

1 Answer 1

10 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

10 Comments

Linked

Related