Char and bytes in python

Question

In reading this tutorial I came across the following difference between __unicode__ and __str__ method:

Due to this difference, there’s yet another dunder method in the mix for controlling string conversion in Python 2: __unicode__. In Python 2, __str__ returns bytes, whereas __unicode__ returns characters.

How exactly is a "character" and "byte" be defined here? For example, in C a char is one byte, so wouldn't a char = a byte? Or, is this referring to (potentially) unicode characters, which could be multiple bytes? For example, if we took the following:

Ω (omega symbol) 03 A9 or u'\u03a9'

In python, would this be considered one character (Ω) and two bytes, or two characters(03 A9) and two bytes? Or maybe I am confusing the difference between char and character ?

Forget any tutorials which describe Python2. Python3 makes such things simpler, and Python2 is at end of live (it is supported only for less then 3 months): do not try to understand things that are very different now (and that are already obsolete) — Giacomo Catenazzi
– Giacomo Catenazzi, Commented Oct 10, 2019 at 8:19
I'd suggest you read this write-up. It is from 2003 (by Stackoverflow founder) , but approaches the exact doubts you have right now: joelonsoftware.com/2003/10/08/… — jsbueno
– jsbueno, Commented Oct 14, 2019 at 14:42

chepner · Accepted Answer · 2019-10-09 20:18:32Z

In Python, u'\u03a9' is a string consisting of the single Unicode character Ω (U+03A9). The internal representation of that string is an implementation detail, so it doesn't make sense to ask about the bytes involved.

One source of ambiguity is a string like 'é', which could either be the single character U+00E9 or the two-character string U+0065 U+0301.

>>> len(u'\u00e9'); print(u'\u00e9') 1 é >>> len(u'\u0065\u0301'); print(u'\u0065\u0301') 2 é

The two-byte sequence '\xce\xa9', however, can be interpret as the UTF-8 encoding of U+03A9.

>>> u'\u03a9'.encode('utf-8') '\xce\xa9' >>> '\xce\xa9'.decode('utf-8') u'\u03a9'

In Python 3, that would be (with UTF-8 being the default encoding scheme)

>>> '\u03a9'.encode() b'\xce\xa9' >>> b'\xce\xa9'.decode() 'Ω'

Other byte sequences can be decoded to U+03A9 as well:

>>> b'\xff\xfe\xa9\x03'.decode('utf16') 'Ω' >>> b'\xff\xfe\x00\x00\xa9\x03\x00\x00'.decode('utf32') 'Ω'

Collectives™ on Stack Overflow

Char and bytes in python

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related