4

I am given a string of Hebrew characters (and some other Arabic ones. I know neither of them) in a file

צוֹר‎

When I load this string from file in Python3

fin = open("filename") x = next(fin).strip() 

The length of x appears to be 5

>>> len(x) 5 

Its unicode utf-8 encoding is

>>> x.encode("utf-8") b'\xd7\xa6\xd7\x95\xd6\xb9\xd7\xa8\xe2\x80\x8e' 

However, in browsers, it is clear that the length of these Hebrew characters is 3.

How to get the length properly? And why does this happen?

I am aware that Python 3 is by default unicode so I did not expect there is such an issue.

6
  • "it is clear that the length of these Hebrew characters is 3" — It is clear that the computer disagrees with you, can you explain your position? Commented Dec 18, 2017 at 2:03
  • len(re.findall('\w', x)) Commented Dec 18, 2017 at 2:05
  • 1
    I don't know how many characters are there -- I don't read Hebrew. But I do know that there are 5 unicode code points there. Try this in Python3: for ch in 'צוֹר‎': print(unicodedata.name(ch)) Commented Dec 18, 2017 at 2:06
  • Related: stackoverflow.com/questions/2247205/… Commented Dec 18, 2017 at 2:07
  • 1
    Consider also breaking the text into grapheme clusters pypi.python.org/pypi/uniseg Commented Dec 18, 2017 at 2:17

4 Answers 4

6

The reason is the included text contains the control character \u200e which is an invisible character used as a Left-to-right mark (often used when you have multiple languages mixed to demarcate between the Left-to-Right and Right-to-Left). Additionally, it includes the vowel "character" (the little dot above the second character which shows how to pronounce it).

If you replace the LTR mark with the empty string for instance, you will get the length of 4:

>> x = 'צוֹר' >> x 'צוֹר\u200e' # note the control character escape sequence >> print(len(x)) 5 >> print(len(x.replace('\u200e', '')) 4 

If you want the length of strictly alphabetic character and space characters only, you could do something like re.sub out all non-space non-word characters:

>> print(len(re.sub('[^\w\s]', '', x))) 3 
Sign up to request clarification or add additional context in comments.

3 Comments

Nice answer! A follow-up question: if I have x = "צוֹר abc (123)" and I want to use index to access the 123, how could I do it? Naively 'a' is at 4, and '1' is at 9. The substitution you suggested removes the punctuation as well.
Hmm, well it depends what you are looking to do. The "correct" indices for the raw text would be 6 and 9 due to the control and accent characters. If you want a version of the text which explicitly excludes non-spacing marks and control characters only, you could do something like (borrowing from @MichaelButscher's answer): ''.join(c for c in x if unicodedata.category(c) not in ['Mn', 'Cf'])
Correction: should be indices 6 and 11 above ^
4

Unicode characters have different categories. In your case:

>>> import unicodedata >>> s = b'\xd7\xa6\xd7\x95\xd6\xb9\xd7\xa8\xe2\x80\x8e'.decode("utf-8") >>> list(unicodedata.category(c) for c in s) ['Lo', 'Lo', 'Mn', 'Lo', 'Cf'] 
  • Lo: Letter, other (not uppercase, lowercase or such). These are "real" characters
  • Mn: Mark, nonspacing. This is some type of accent character combined with the previous character
  • Cf: Control, format. Here it switches back to left-to-right write direction

2 Comments

Nice way to distill the "real" characters. If I have other words after the Hebrew characters and I want to index them "correctly" (counting only the "real" characters), is there a way to do it?
@YoHsiao I only see the way to iterate through the code points and look at each or first convert them using lemonheads approach to get the position of the filtered real characters and words.
0

Have you tried with io libary?

>>> import io >>> with io.open('text.txt', mode="r", encoding="utf-8") as f: x = f.read() >>> print(len(x)) 

You can also try codecs:

>>> import codecs >>> with codecs.open('text.txt', 'r', 'utf-8') as f: x = f.read() >>> print(len(x)) 

3 Comments

Thanks for the advice! But these two give identical results: 5. In fact, if you open it with an editor that decodes it correctly, moving the cursor will show that there are some underlying characters that modify "backward". In other words, clicking on "right" button moves the cursor back and forth, just not always forward. Backward modification is just like the unicode for accents.
Using io and codecs is needed in Python 2 but generally not in Python 3.
@JoshLee that's what I thought as all fake example I made were working out of box. Just thought of throwing it out there.
0

Open the file with utf-8 encoding.

fin = open('filename','r',encoding='utf-8') 

or

with open('filename','r',encoding='utf-8') as fin: for line1 in fin: print(len(line1.strip())) 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.