2

In reading this tutorial I came across the following difference between __unicode__ and __str__ method:

Due to this difference, there’s yet another dunder method in the mix for controlling string conversion in Python 2: __unicode__. In Python 2, __str__ returns bytes, whereas __unicode__ returns characters.

How exactly is a "character" and "byte" be defined here? For example, in C a char is one byte, so wouldn't a char = a byte? Or, is this referring to (potentially) unicode characters, which could be multiple bytes? For example, if we took the following:

Ω (omega symbol) 03 A9 or u'\u03a9' 

In python, would this be considered one character (Ω) and two bytes, or two characters(03 A9) and two bytes? Or maybe I am confusing the difference between char and character ?

2
  • Forget any tutorials which describe Python2. Python3 makes such things simpler, and Python2 is at end of live (it is supported only for less then 3 months): do not try to understand things that are very different now (and that are already obsolete) Commented Oct 10, 2019 at 8:19
  • I'd suggest you read this write-up. It is from 2003 (by Stackoverflow founder) , but approaches the exact doubts you have right now: joelonsoftware.com/2003/10/08/… Commented Oct 14, 2019 at 14:42

1 Answer 1

3

In Python, u'\u03a9' is a string consisting of the single Unicode character Ω (U+03A9). The internal representation of that string is an implementation detail, so it doesn't make sense to ask about the bytes involved.

One source of ambiguity is a string like 'é', which could either be the single character U+00E9 or the two-character string U+0065 U+0301.

>>> len(u'\u00e9'); print(u'\u00e9') 1 é >>> len(u'\u0065\u0301'); print(u'\u0065\u0301') 2 é 

The two-byte sequence '\xce\xa9', however, can be interpret as the UTF-8 encoding of U+03A9.

>>> u'\u03a9'.encode('utf-8') '\xce\xa9' >>> '\xce\xa9'.decode('utf-8') u'\u03a9' 

In Python 3, that would be (with UTF-8 being the default encoding scheme)

>>> '\u03a9'.encode() b'\xce\xa9' >>> b'\xce\xa9'.decode() 'Ω' 

Other byte sequences can be decoded to U+03A9 as well:

>>> b'\xff\xfe\xa9\x03'.decode('utf16') 'Ω' >>> b'\xff\xfe\x00\x00\xa9\x03\x00\x00'.decode('utf32') 'Ω' 
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.