How can I determine the byte length of a utf-8 encoded string in Python?

Question

I am working with Amazon S3 uploads and am having trouble with key names being too long. S3 limits the length of the key by bytes, not characters.

From the docs:

The name for a key is a sequence of Unicode characters whose UTF-8 encoding is at most 1024 bytes long.

I also attempt to embed metadata in the file name, so I need to be able to calculate the current byte length of the string using Python to make sure the metadata does not make the key too long (in which case I would have to use a separate metadata file).

How can I determine the byte length of the utf-8 encoded string? Again, I am not interested in the character length... rather the actual byte length used to store the string.

Dietrich Epp · Accepted Answer · 2011-07-16 02:24:20Z

44

def utf8len(s): return len(s.encode('utf-8'))

Works fine in Python 2 and 3.

answered Jul 16, 2011 at 2:24

Dietrich Epp

216k39 gold badges366 silver badges426 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user319862 Over a year ago

Thanks. I also found a website that shows you how to do it in several languages here: rosettacode.org/wiki/String_length#Byte_Length_49

Mark Reed · Accepted Answer · 2017-06-28 17:12:38Z

Use the string 'encode' method to convert from a character-string to a byte-string, then use len() like normal:

>>> s = u"¡Hola, mundo!" >>> len(s) 13 # characters >>> len(s.encode('utf-8')) 14 # bytes

Please don't use str as a variable name! It will cause no end of grief.

Mark Ransom · Accepted Answer · 2013-09-23 16:27:52Z

Encoding the string and using len on the result works great, as other answers have shown. It does need to build a throw-away copy of the string - if you're working with very large strings this might not be optimal (I don't consider 1024 bytes to be large though). The structure of UTF-8 allows you to get the length of each character very easily without even encoding it, although it might still be easier to encode a single character. I present both methods here, they should give the same result.

def utf8_char_len_1(c): codepoint = ord(c) if codepoint <= 0x7f: return 1 if codepoint <= 0x7ff: return 2 if codepoint <= 0xffff: return 3 if codepoint <= 0x10ffff: return 4 raise ValueError('Invalid Unicode character: ' + hex(codepoint)) def utf8_char_len_2(c): return len(c.encode('utf-8')) utf8_char_len = utf8_char_len_1 def utf8len(s): return sum(utf8_char_len(c) for c in s)

Note that in exchange for not making a copy this takes about 180x as long as len(s.encode('utf-8')), at least on my python 3.3.2 on a string of 1000 utf8 characters generated from the code here. (It'd be of comparable speed if you wrote the same algorithm in C, presumably.)
@Dougal, thanks for running the test. That's useful information, essential for evaluating possible solutions. I had a feeling it might be slower, but didn't know the magnitude. Did you try both versions?
The version with utf8_char_len_2 is about 1.5x slower than utf8_char_len_1. Of course, we're talking about under a millisecond in every case, so if you're just doing it a few times it doesn't matter at all: 2 µs / 375 µs / 600 µs. That said, copying 1kb of memory is also unlikely to matter either. :)

Collectives™ on Stack Overflow

How can I determine the byte length of a utf-8 encoded string in Python?

3 Answers 3

1 Comment

1 Comment

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

1 Comment

3 Comments

Linked

Related