Calculate bytes of the unicode character in python

Question

I'm writing a Python script to read Unicode characters from a file and insert them into a database. I can only insert 30 bytes of each string. How do I calculate the size of the string in bytes before I insert into the database?

jfs · Accepted Answer · 2015-06-03 09:23:44Z

If you need to know the bytes count (the file size) then just call
bytes_count = os.path.getsize(filename).

If you want to find out how many bytes a Unicode character may require then it depends on character encoding:

>>> print(u"\N{EURO SIGN}") € >>> u"\N{EURO SIGN}".encode('utf-8') # 3 bytes '\xe2\x82\xac' >>> u"\N{EURO SIGN}".encode('cp1252') # 1 byte '\x80' >>> u"\N{EURO SIGN}".encode('utf-16le') # 2 bytes '\xac '

To find out how many Unicode characters a file contains, you don't need to read the whole file in memory at once (in case it is a large file):

with open(filename, encoding=character_encoding) as file: unicode_character_count = sum(len(line) for line in file)

If you are on Python 2 then add from io import open at the top.

The exact count for the same human-readable text may depend on Unicode normalization (different environments may use different settings):

>>> import unicodedata >>> print(u"\u212b") Å >>> unicodedata.normalize("NFD", u"\u212b") # 2 Unicode codepoints u'A\u030a' >>> unicodedata.normalize("NFC", u"\u212b") # 1 Unicode codepoint u'\xc5' >>> unicodedata.normalize("NFKD", u"\u212b") # 2 Unicode codepoints u'A\u030a' >>> unicodedata.normalize("NFKC", u"\u212b") # 1 Unicode codepoint u'\xc5'

As the example shows, a single character (Å) may be represented using several Unicode codepoints.

To find out how many user-perceived characters in a file, you could use \X regular expression (count eXtended grapheme clusters):

import regex # $ pip install regex with open(filename, encoding=character_encoding) as file: character_count = sum(len(regex.findall(r'\X', line)) for line in file)

Example:

>>> import regex >>> char = u'A\u030a' >>> print(char) Å >>> len(char) 2 >>> regex.findall(r'\X', char) ['Å'] >>> len(regex.findall(r'\X', char)) 1

shruti1810 · Accepted Answer · 2015-06-03 11:12:22Z

0

Suppose you are reading the unicode characters from file into a variable called byteString. Then you can do the following:

unicode_string = byteString.decode("utf-8") print len(unicode_string)

edited Jun 3, 2015 at 11:12

answered Jun 3, 2015 at 5:29

shruti1810

4,0452 gold badges18 silver badges28 bronze badges

1 Comment

jfs Over a year ago

uniChars is misleading (you want to call .decode() on a bytes object; you shouldn't call it on a Unicode text). You might mean bytestring instead.

Collectives™ on Stack Overflow

Calculate bytes of the unicode character in python

2 Answers 2

Comments

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Related