Confused by python's unicode regex errors

Question

Can someone explain why the middle code excerpt in python 2.7x throws an error?

import re walden = "Waldenström" walden print(walden) s1 = "ö" s2 = "Wal" s3 = "OOOOO" out = re.sub(s1, s3, walden) print(out) out = re.sub("W", "w", walden) print(out) # I need this one to work out = re.sub('W', u'w', walden) # ERROR out = re.sub(u'W', 'w', walden) print(out) out = re.sub(s2, s1, walden) print(out)

I'm very confused and have tried reading the manual

zvone · Accepted Answer · 2016-10-09 23:59:14Z

walden is a str:

walden = "Waldenström"

This code replaces a character with a unicode string:

re.sub('W', u'w', walden)

The result of that should be u'w' + "aldenström". This is the part that fails.

In order to concatenate str and unicode, both have to be first converted to unicode. The result is unicode as well.

The problem is, the interpreter does not know how to convert 'ö' to unicode, because it does not know which encoding to use. The result is ambiguous.

The solution is to convert yourself before doing the replacement:

re.sub('W', u'w', unicode(walden, encoding))

The encoding should be the one you use to create that file, e.g.

re.sub('W', u'w', unicode(walden, 'utf-8'))

Thank you! That clarified a lot and now the documentation make better sense, too!

Collectives™ on Stack Overflow

Confused by python's unicode regex errors

1 Answer 1

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Related