Python returns length of 2 for single non-ascii character string

Question

I am trying to get the span of selected words in a string. When working with the İ character, I noticed the following behavior of Python:

len("İ") Out[39]: 1 len("İ".lower()) Out[40]: 2 # when `upper()` is applied, the length stays the same len("İ".lower().upper()) Out[41]: 2

Why does the length of the upper and lowercase value of the same character differ (that seems very confusing/undesired to me)?

Does anyone know if there are other characters for which that will happen? Thank you!

EDIT:

On the other hand for e.g. Î the length stays the same:

len('Î') Out[42]: 1 len('Î'.lower()) Out[43]: 1

Does anyone know if there are other characters for which that will happen? That İ is the only one, currently, as far as I know. For a lower character that does the other way (becomes longer after str.upper) there are hundreds, the most well-known of which is ß — wim
– wim, Commented Nov 25, 2020 at 17:26
Thanks for your comment, I was not aware of thes behaviour either. — lux7
– lux7, Commented Nov 26, 2020 at 9:13

Red · Accepted Answer · 2020-11-25 18:30:26Z

That's because 'İ' in lowercase is 'i̇', which has 2 characters

>>> import unicodedata >>> unicodedata.name('İ') 'LATIN CAPITAL LETTER I WITH DOT ABOVE' >>> unicodedata.name('İ'.lower()[0]) 'LATIN SMALL LETTER I' >>> unicodedata.name('İ'.lower()[1]) 'COMBINING DOT ABOVE'

One character is a combining dot that your browser might render overlapped with the last quote, so you may not be able to see it. But if you copy-paste it into your python console, you should be able to see it.

If you try:

print('i̇'.upper())

you should get

İ

Thanks a lot for your answer! The unicodedata.name('İ') method is very interessting indeed. The question that I still have difficulties to understand, is why that happens? Isn't this a undesired behaviour?

topgunner · Accepted Answer · 2020-11-25 16:54:47Z

I think the issue is that a lower case character for that symbol is undefined in ASCII.

The .lower() function probably performs a fixed offset to the ASCII number associated with the character, since that works for the English alphabet.

This is unicode - has nothing to do with ASCII encoding. And that's not what str.lower actually does..

Collectives™ on Stack Overflow

Python returns length of 2 for single non-ascii character string

2 Answers 2

1 Comment

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Related