UnicodeEncodeError writing text with special character to file

Question

I get a UnicodeEncodeError writing text with a special character to a file:

 File "D:\SOFT\Python3\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\ufffd' in position 956: character maps to <undefined>

My code:

expFile = open(expFilePath, 'w') # data var is what contains a special char expFile.write("\n\n" + data)

The data is probably some weird character from something like Microsoft Word that got pasted into the application's HTML form and it got persisted, now I am importing it. I can't even see it, shows as a diamond in my DB editor when I query it. It just has a placeholder in the text editor. The input should be more rigorously checked for character set compliance but it is not.

Is there a way to encode the data to make any character digestable for I/O processing?

Alternatively, is there a way to check whether my str is compliant to the character standard expected by file IO in order to do replacements of any data that violates it?

This is beside the point, but what exactly does data contain? — Chris
– Chris, Commented Dec 6, 2016 at 16:44
If you really want to write arbitrary bytes, try b as a modifier for open to switch to binary mode. — languitar
– languitar, Commented Dec 6, 2016 at 16:46
it's probably some weird character from something like Microsoft Word that got pasted into the application's HTML form and it got processed, now I am importing it. I can't even see it, shows as a diamond in my DB editor when I query it. It just has a placeholder in the text editor. The input should be more rigorously checked for character set compliance but it is not — amphibient
– amphibient, Commented Dec 6, 2016 at 16:47
That should be expFile = open(expFilePath, 'w', encoding='UTF-8'). Please check the documentation for open. — Matthias
– Matthias, Commented Dec 6, 2016 at 19:03
Give the upvote to ShadowRanger. He wrote a nice explanation and I'm not a point hunter anyway. ;-) — Matthias
– Matthias, Commented Dec 6, 2016 at 19:12

ShadowRanger · Accepted Answer · 2016-12-06 19:10:43Z

Your problem is that opening in text mode on your Windows system defaulted to the locale code page, cp1252, an ASCII superset that only encodes a tiny fraction of the Unicode range.

To fix, supply a more comprehensive encoding that can support the whole Unicode range; open accepts a keyword argument to override the default encoding, so it's as simple as changing:

expFile = open(expFilePath, 'w')

to

expFile = open(expFilePath, 'w', encoding='utf-8')

Depending on your needs, I'd choose either utf-8 or utf-16; the former is more compact for mostly ASCII text, and is commonly seen everywhere, while the latter matches Microsoft's typical encoding for storing portable (non-locale dependent) text, so it's possible a few Windows-specific text editors would recognize it/handle it more easily.

Sandeep G · Accepted Answer · 2024-01-25 14:11:50Z

you may need to determine the correct encoding of your file. You can use the chardet library to automatically detect the encoding.

import chardet

with open('data.txt', 'rb') as file: result = chardet.detect(file.read())

encoding = result['encoding']

with open('data.txt', 'r', encoding=encoding) as file: data = file.read()

Collectives™ on Stack Overflow

UnicodeEncodeError writing text with special character to file

2 Answers 2

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Linked

Related