0

I'm writing a Python script to read Unicode characters from a file and insert them into a database. I can only insert 30 bytes of each string. How do I calculate the size of the string in bytes before I insert into the database?

2 Answers 2

5

If you need to know the bytes count (the file size) then just call
bytes_count = os.path.getsize(filename).


If you want to find out how many bytes a Unicode character may require then it depends on character encoding:

>>> print(u"\N{EURO SIGN}") € >>> u"\N{EURO SIGN}".encode('utf-8') # 3 bytes '\xe2\x82\xac' >>> u"\N{EURO SIGN}".encode('cp1252') # 1 byte '\x80' >>> u"\N{EURO SIGN}".encode('utf-16le') # 2 bytes '\xac ' 

To find out how many Unicode characters a file contains, you don't need to read the whole file in memory at once (in case it is a large file):

with open(filename, encoding=character_encoding) as file: unicode_character_count = sum(len(line) for line in file) 

If you are on Python 2 then add from io import open at the top.

The exact count for the same human-readable text may depend on Unicode normalization (different environments may use different settings):

>>> import unicodedata >>> print(u"\u212b") Å >>> unicodedata.normalize("NFD", u"\u212b") # 2 Unicode codepoints u'A\u030a' >>> unicodedata.normalize("NFC", u"\u212b") # 1 Unicode codepoint u'\xc5' >>> unicodedata.normalize("NFKD", u"\u212b") # 2 Unicode codepoints u'A\u030a' >>> unicodedata.normalize("NFKC", u"\u212b") # 1 Unicode codepoint u'\xc5' 

As the example shows, a single character (Å) may be represented using several Unicode codepoints.

To find out how many user-perceived characters in a file, you could use \X regular expression (count eXtended grapheme clusters):

import regex # $ pip install regex with open(filename, encoding=character_encoding) as file: character_count = sum(len(regex.findall(r'\X', line)) for line in file) 

Example:

>>> import regex >>> char = u'A\u030a' >>> print(char) Å >>> len(char) 2 >>> regex.findall(r'\X', char) ['Å'] >>> len(regex.findall(r'\X', char)) 1 
Sign up to request clarification or add additional context in comments.

Comments

0

Suppose you are reading the unicode characters from file into a variable called byteString. Then you can do the following:

unicode_string = byteString.decode("utf-8") print len(unicode_string) 

1 Comment

uniChars is misleading (you want to call .decode() on a bytes object; you shouldn't call it on a Unicode text). You might mean bytestring instead.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.