0

I get a UnicodeEncodeError writing text with a special character to a file:

 File "D:\SOFT\Python3\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\ufffd' in position 956: character maps to <undefined> 

My code:

expFile = open(expFilePath, 'w') # data var is what contains a special char expFile.write("\n\n" + data) 

The data is probably some weird character from something like Microsoft Word that got pasted into the application's HTML form and it got persisted, now I am importing it. I can't even see it, shows as a diamond in my DB editor when I query it. It just has a placeholder in the text editor. The input should be more rigorously checked for character set compliance but it is not.

Is there a way to encode the data to make any character digestable for I/O processing?

Alternatively, is there a way to check whether my str is compliant to the character standard expected by file IO in order to do replacements of any data that violates it?

10
  • This is beside the point, but what exactly does data contain? Commented Dec 6, 2016 at 16:44
  • If you really want to write arbitrary bytes, try b as a modifier for open to switch to binary mode. Commented Dec 6, 2016 at 16:46
  • it's probably some weird character from something like Microsoft Word that got pasted into the application's HTML form and it got processed, now I am importing it. I can't even see it, shows as a diamond in my DB editor when I query it. It just has a placeholder in the text editor. The input should be more rigorously checked for character set compliance but it is not Commented Dec 6, 2016 at 16:47
  • 1
    That should be expFile = open(expFilePath, 'w', encoding='UTF-8'). Please check the documentation for open. Commented Dec 6, 2016 at 19:03
  • 1
    Give the upvote to ShadowRanger. He wrote a nice explanation and I'm not a point hunter anyway. ;-) Commented Dec 6, 2016 at 19:12

2 Answers 2

2

Your problem is that opening in text mode on your Windows system defaulted to the locale code page, cp1252, an ASCII superset that only encodes a tiny fraction of the Unicode range.

To fix, supply a more comprehensive encoding that can support the whole Unicode range; open accepts a keyword argument to override the default encoding, so it's as simple as changing:

expFile = open(expFilePath, 'w') 

to

expFile = open(expFilePath, 'w', encoding='utf-8') 

Depending on your needs, I'd choose either utf-8 or utf-16; the former is more compact for mostly ASCII text, and is commonly seen everywhere, while the latter matches Microsoft's typical encoding for storing portable (non-locale dependent) text, so it's possible a few Windows-specific text editors would recognize it/handle it more easily.

Sign up to request clarification or add additional context in comments.

Comments

0

you may need to determine the correct encoding of your file. You can use the chardet library to automatically detect the encoding.

import chardet

with open('data.txt', 'rb') as file: result = chardet.detect(file.read())

encoding = result['encoding']

with open('data.txt', 'r', encoding=encoding) as file: data = file.read()

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.