Correct length of a string of non-English characters in Python3

Question

I am given a string of Hebrew characters (and some other Arabic ones. I know neither of them) in a file

צוֹר‎

When I load this string from file in Python3

fin = open("filename") x = next(fin).strip()

The length of x appears to be 5

>>> len(x) 5

Its unicode utf-8 encoding is

>>> x.encode("utf-8") b'\xd7\xa6\xd7\x95\xd6\xb9\xd7\xa8\xe2\x80\x8e'

However, in browsers, it is clear that the length of these Hebrew characters is 3.

How to get the length properly? And why does this happen?

I am aware that Python 3 is by default unicode so I did not expect there is such an issue.

"it is clear that the length of these Hebrew characters is 3" — It is clear that the computer disagrees with you, can you explain your position? — Josh Lee
– Josh Lee, Commented Dec 18, 2017 at 2:03
I don't know how many characters are there -- I don't read Hebrew. But I do know that there are 5 unicode code points there. Try this in Python3: for ch in 'צוֹר‎': print(unicodedata.name(ch)) — Robᵩ
– Robᵩ, Commented Dec 18, 2017 at 2:06
Consider also breaking the text into grapheme clusters pypi.python.org/pypi/uniseg — Josh Lee
– Josh Lee, Commented Dec 18, 2017 at 2:17

lemonhead · Accepted Answer · 2017-12-18 02:17:03Z

The reason is the included text contains the control character \u200e which is an invisible character used as a Left-to-right mark (often used when you have multiple languages mixed to demarcate between the Left-to-Right and Right-to-Left). Additionally, it includes the vowel "character" (the little dot above the second character which shows how to pronounce it).

If you replace the LTR mark with the empty string for instance, you will get the length of 4:

>> x = 'צוֹר' >> x 'צוֹר\u200e' # note the control character escape sequence >> print(len(x)) 5 >> print(len(x.replace('\u200e', '')) 4

If you want the length of strictly alphabetic character and space characters only, you could do something like re.sub out all non-space non-word characters:

>> print(len(re.sub('[^\w\s]', '', x))) 3

Nice answer! A follow-up question: if I have x = "צוֹר abc (123)" and I want to use index to access the 123, how could I do it? Naively 'a' is at 4, and '1' is at 9. The substitution you suggested removes the punctuation as well.
Hmm, well it depends what you are looking to do. The "correct" indices for the raw text would be 6 and 9 due to the control and accent characters. If you want a version of the text which explicitly excludes non-spacing marks and control characters only, you could do something like (borrowing from @MichaelButscher's answer): ''.join(c for c in x if unicodedata.category(c) not in ['Mn', 'Cf'])

Michael Butscher · Accepted Answer · 2017-12-18 02:09:41Z

Unicode characters have different categories. In your case:

>>> import unicodedata >>> s = b'\xd7\xa6\xd7\x95\xd6\xb9\xd7\xa8\xe2\x80\x8e'.decode("utf-8") >>> list(unicodedata.category(c) for c in s) ['Lo', 'Lo', 'Mn', 'Lo', 'Cf']

Lo: Letter, other (not uppercase, lowercase or such). These are "real" characters
Mn: Mark, nonspacing. This is some type of accent character combined with the previous character
Cf: Control, format. Here it switches back to left-to-right write direction

Nice way to distill the "real" characters. If I have other words after the Hebrew characters and I want to index them "correctly" (counting only the "real" characters), is there a way to do it?
@YoHsiao I only see the way to iterate through the code points and look at each or first convert them using lemonheads approach to get the position of the filtered real characters and words.

Dawid Laszuk · Accepted Answer · 2017-12-18 01:57:19Z

0

Have you tried with io libary?

>>> import io >>> with io.open('text.txt', mode="r", encoding="utf-8") as f: x = f.read() >>> print(len(x))

You can also try codecs:

>>> import codecs >>> with codecs.open('text.txt', 'r', 'utf-8') as f: x = f.read() >>> print(len(x))

answered Dec 18, 2017 at 1:57

Dawid Laszuk

2,02728 silver badges45 bronze badges

3 Comments

Yo Hsiao Over a year ago

Thanks for the advice! But these two give identical results: 5. In fact, if you open it with an editor that decodes it correctly, moving the cursor will show that there are some underlying characters that modify "backward". In other words, clicking on "right" button moves the cursor back and forth, just not always forward. Backward modification is just like the unicode for accents.

Josh Lee Over a year ago

Using io and codecs is needed in Python 2 but generally not in Python 3.

Dawid Laszuk Over a year ago

@JoshLee that's what I thought as all fake example I made were working out of box. Just thought of throwing it out there.

Ahmad Yoosofan · Accepted Answer · 2017-12-24 19:57:00Z

Open the file with utf-8 encoding.

fin = open('filename','r',encoding='utf-8')

or

with open('filename','r',encoding='utf-8') as fin: for line1 in fin: print(len(line1.strip()))

Collectives™ on Stack Overflow

Correct length of a string of non-English characters in Python3

4 Answers 4

3 Comments

2 Comments

3 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

2 Comments

3 Comments

Comments

Linked

Related