12

I created a sample app to load all special characters while copy pasting from Openoffice writer to Notepad. Double codes differs and when I try to load this.

var lines = File.ReadAllLines("..\\ter34.txt"); 

This creates problem of 65533 Issue comes and the text file contains:

This has been changed to the symbol:

3
  • 2
    What encoding is the text file using? ANSI? ASCII? UTF8? UTF16? Commented Feb 22, 2013 at 10:43
  • Problem comes only in ANSI....rest of things working correctly it changes it to -- “ -- Commented Feb 22, 2013 at 10:52
  • 2
    Just to those who might not know. The (char)65533 is also known as U+FFFD and is a REPLACEMENT CHARACTER. This is often emitted when the data to be converted is corrupt, or when the encoding to convert into can't represent the correct character. See Wikipedia. Commented Feb 22, 2013 at 10:53

1 Answer 1

26

U+FFFD is the "Unicode replacement character", which is used if the data you try to read is invalid for the encoding which is being used to convert binary data to text.

For example, if you write a file out using ISO-8859-1, but then try to read it using UTF-8, then you could easily end up with some byte sequences which simply aren't valid UTF-8. Each invalid byte would be translated (by default) into U+FFFD.

Basically, you need to provide the right encoding to File.ReadAllLines, as a second argument. That means you need to know the encoding of the file first, of course.

Sign up to request clarification or add additional context in comments.

10 Comments

Oddly enough, I always thought that this is just custom feature data streaming/transcoding library. And it is well-defined Unicode transcoding behavior? Great!
When i save the txt file in formats like UTF8,Unicode ..etc its working correctly but when i save it in ANSI .. then that symbol comes
Unicode files can present many different characters, while ANSI - dependents on selected CodePage, and usually far less. When you try to save some 'extended' character to ANSI file, you have some chances that this character simply cannot be translated to that ANSI CodePage you have selected (or defaulted to). In such cases, three things could happen: an exception could be thrown and crash everythin so you see there's a problem, OR those characters could be silently skipped (eeviill), OR, some "replacement character" is written to file instead so you see there's a problem
@user2046631: Right, so when you read the file you need to specify that encoding too. "ANSI" isn't a single encoding though - it's a broad term used for lots of encodings. You'll need to find out which one you actually mean.
@user2046631 You can possibly use File.ReadAllLines(@"..\ter34.txt", Encoding.GetEncoding("Windows-1252")) if the text file is in "Windows (Western European)" kind of ANSI. To rely on the ANSI of your own machine, use File.ReadAllLines(@"..\ter34.txt", Encoding.Default).
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.