Python's .format() minilanguage and Unicode

Question

I'm trying to use some of the simple unicode characters in a command line program I'm writing, but drawing these things into a table becomes difficult because Python appears to be treating single-character symbols as multi-character strings.

For example, if I try to print(u"\u2714".encode("utf-8")) I see the unicode checkmark. However, if I try to add some padding to that character (as one might in tabular structure), Python seems to be interpreting this single-character string as a 3-character one. All three of these lines print the same thing:

print("|{:1}|".format(u"\u2714".encode("utf-8"))) print("|{:2}|".format(u"\u2714".encode("utf-8"))) print("|{:3}|".format(u"\u2714".encode("utf-8")))

Now I think I understand why this is happening: it's a multibyte string. My question is, how do I get Python to pad this string appropriately?

I'm currently working 2.7, but we need to support 3 as well. — Daniel Quinn
– Daniel Quinn, Commented Oct 25, 2015 at 17:54

chucksmash · Accepted Answer · 2015-10-25 17:55:17Z

2

Make your format strings unicode:

from __future__ import print_function print(u"|{:1}|".format(u"\u2714")) print(u"|{:2}|".format(u"\u2714")) print(u"|{:3}|".format(u"\u2714"))

outputs:

|✔| |✔ | |✔ |

answered Oct 25, 2015 at 17:55

chucksmash

6,0671 gold badge36 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

poke Over a year ago

The print function is not required for this to work though.

chucksmash Over a year ago

@poke You're correct. OP mentioned in a comment that he was specifically targeting Python 2.7 and 3+ so importing and using unicode_literals, print_function and division are all good practice if not required.

poke Over a year ago

I absolutely agree with that :) My comment was more directed at another comment that has been removed since.

Dan D. · Accepted Answer · 2015-10-25 17:54:22Z

1

Don't encode('utf-8') at that point do it latter:

>>> u"\u2714".encode("utf-8") '\xe2\x9c\x94'

The UTF-8 encoding is three bytes long. Look at how format works with Unicode strings:

>>> u"|{:1}|".format(u"\u2714") u'|\u2714|' >>> u"|{:2}|".format(u"\u2714") u'|\u2714 |' >>> u"|{:3}|".format(u"\u2714") u'|\u2714 |'

Tested on Python 2.7.3.

answered Oct 25, 2015 at 17:54

Dan D.

75k15 gold badges111 silver badges129 bronze badges

4 Comments

Daniel Quinn Over a year ago

Exactly what I needed! Thank you.

jfs Over a year ago

@DanielQuinn: don't encode at all. Print Unicode directly instead. Otherwise, your code may produce a mojibake if the environment uses a different character encoding.

Daniel Quinn Over a year ago

@J.F.Sebastian If I don't encode, Python2.7 explodes with a UnicodeEncodeError. If I do, then Python 3 prints out b'\xe2\x9c\x98'.

jfs Over a year ago

@DanielQuinn: If you have issues with printing Unicode then it is a different question (and hard-coding the character encoding is not the answer). Read the link from my previous comment. If you read the linked answer and you have failed to apply the solutions to your case then ask a separate question.

Collectives™ on Stack Overflow

Python's .format() minilanguage and Unicode

2 Answers 2

3 Comments

4 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

4 Comments

Linked

Related