0

I'm using BeautifulSoup to parse some XML files. One of the fields in this file frequently uses Unicode characters. I've tried unsuccessfully to write the unicode to a file using encode.

The process so far is basically:

  1. Get the name

    gamename = items.find('name').string.strip()

  2. Then incorporate the name into a list which is later converted into a string:

    stringtoprint = userid, gamename.encode('utf-8') #

    newstring = "INSERT INTO collections VALUES " + str(stringtoprint) + ";" +"\n"

Then write that string to a file.

listofgamesowned.write(newstring.encode("UTF-8"))

It seems that I won't have to .encode quite so often. I had tried encoding directly upon parsing out the name e.g. gamename = items.find('name').string.strip().encode('utf-8') - however, that did not seem to work.

Currently - 'Uudet L\xc3\xb6yt\xc3\xb6retket'

is being printed and saved rather than Uudet Löytöretket.

It seems if this were a string I was generating then I'd use something.write(u'Uudet L\xc3\xb6yt\xc3\xb6retket'); however, it's one element embedded in a string.

1
  • The source of the problem is in trying to add the unicode string within another string and expect that to be written out in a suitable manner. The answer was basically to not try to do it - but rather just manually make the string to be written with to begin with e.g. + "','" + etc + "','" ... Commented Jan 20, 2013 at 19:57

1 Answer 1

1

Unicode is an in-memory representation of a string. When you write out or read in you need to encode and decode.

Uudet L\xc3\xb6yt\xc3\xb6retket is the utf-8 encoded version of Uudet Löytöretket, so it is what you want to write out. When you want to read a string back from a file you need to decode it.

>>> print 'Uudet L\xc3\xb6yt\xc3\xb6retket' Uudet Löytöretket >>> print 'Uudet L\xc3\xb6yt\xc3\xb6retket'.decode('utf-8') Uudet Löytöretket 

Just remember to encode immediately before you output and decode immediately after you read it back.

Sign up to request clarification or add additional context in comments.

4 Comments

what do I do if I simply want to write "Uudet Löytöretket" to the file?
Write Uudet L\xc3\xb6yt\xc3\xb6retket and when you want to read the file make sure you are decoding it with utf-8.
Perhaps I'm a bit confused. The terminal's default encoding is utf-8 - so if I cat a file with "Uudet Löytöretket" in it, I am expecting to see "Uudet Löytöretket" as when the original file is catted and not "Uudet L\xc3\xb6yt\xc3\xb6retket" as is currently being displayed.
For what it's worth - it was a 'shortcut' type approach that I was using that hosed me. I liked simply adding a list to a string to keep the quotation marks added by moving from a list to a string.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.