I am given a string of Hebrew characters (and some other Arabic ones. I know neither of them) in a file
צוֹר
When I load this string from file in Python3
fin = open("filename") x = next(fin).strip() The length of x appears to be 5
>>> len(x) 5 Its unicode utf-8 encoding is
>>> x.encode("utf-8") b'\xd7\xa6\xd7\x95\xd6\xb9\xd7\xa8\xe2\x80\x8e' However, in browsers, it is clear that the length of these Hebrew characters is 3.
How to get the length properly? And why does this happen?
I am aware that Python 3 is by default unicode so I did not expect there is such an issue.
len(re.findall('\w', x))for ch in 'צוֹר': print(unicodedata.name(ch))