19

I am trying to write some strings to a file (the strings have been given to me by the HTML parser BeautifulSoup).

I can use "print" to display them, but when I use file.write() I get the following error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 6: ordinal not in range(128) 

How can I parse this?

0

3 Answers 3

16

This error occurs when you pass a Unicode string containing non-English characters (Unicode characters beyond 128) to something that expects an ASCII bytestring. The default encoding for a Python bytestring is ASCII, "which handles exactly 128 (English) characters". This is why trying to convert Unicode characters beyond 128 produces the error.

The unicode()

unicode(string[, encoding, errors]) 

constructor has the signature unicode(string[, encoding, errors]). All of its arguments should be 8-bit strings.

The first argument is converted to Unicode using the specified encoding; if you leave off the encoding argument, the ASCII encoding is used for the conversion, so characters greater than 127 will be treated as errors

for example

s = u'La Pe\xf1a' print s.encode('latin-1') 

or

write(s.encode('latin-1')) 

will encode using latin-1

Sign up to request clarification or add additional context in comments.

4 Comments

The string it's outputting is a price like "£123"
which is not valid ASCII. The pound sign is char code 163, outside of the ASCII range of 127.
You must specify an encoding that can encode those characters. Files do not contain characters; they contain bytes. Encodings convert characters to bytes.
Yes, when I say "you must do this" I understand perfectly that you aren't doing it yet. That's why you must do it: to fix the problem you describe. write() doesn't "understand Unicode" because (a) files do not contain characters, but bytes; and (b) there is more than one way to do the encoding and there is no particularly good way for it to choose on your behalf. Well, actually, it does: it picks the simplest possible encoding, that only handles the few character that everyone agrees upon, so that an error comes up if anything special is required.
3

I tried this it works fine

with open(r"C:\rag\sampleoutput.txt", 'w', encoding="utf-8") as f: 

Comments

2

The answer to your question is "use codecs". The appeded code also shows some gettext magic, FWIW. http://wiki.wxpython.org/Internationalization

import codecs import gettext localedir = './locale' langid = wx.LANGUAGE_DEFAULT # use OS default; or use LANGUAGE_JAPANESE, etc. domain = "MyApp" mylocale = wx.Locale(langid) mylocale.AddCatalogLookupPathPrefix(localedir) mylocale.AddCatalog(domain) translater = gettext.translation(domain, localedir, [mylocale.GetCanonicalName()], fallback = True) translater.install(unicode = True) # translater.install() installs the gettext _() translater function into our namespace... msg = _("A message that gettext will translate, probably putting Unicode in here") # use codecs.open() to convert Unicode strings to UTF8 Logfile = codecs.open(logfile_name, 'w', encoding='utf-8') Logfile.write(msg + '\n') 

Despite Google being full of hits on this problem, I found it rather hard to find this simple solution (it is actually in the Python docs about Unicode, but rather burried).

So ... HTH...

GaJ

1 Comment

"Simple"? That's also showing a bunch of i18n machinery that OP doesn't care about - he's not trying to make sure that people see text in the right language, he's trying to grab text in a specific language from a specific source and put it in a file. So the only relevant part of your snipped is the first line and the last two, really. As for "hard to find", really? What did you Google for? I tried UnicodeEncodeError: 'ascii' codec can't encode character; the results seem helpful enough...

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.