4

I am trying to get the span of selected words in a string. When working with the İ character, I noticed the following behavior of Python:

len("İ") Out[39]: 1 len("İ".lower()) Out[40]: 2 # when `upper()` is applied, the length stays the same len("İ".lower().upper()) Out[41]: 2 

Why does the length of the upper and lowercase value of the same character differ (that seems very confusing/undesired to me)?

Does anyone know if there are other characters for which that will happen? Thank you!

EDIT:

On the other hand for e.g. Î the length stays the same:

len('Î') Out[42]: 1 len('Î'.lower()) Out[43]: 1 
2
  • Does anyone know if there are other characters for which that will happen? That İ is the only one, currently, as far as I know. For a lower character that does the other way (becomes longer after str.upper) there are hundreds, the most well-known of which is ß Commented Nov 25, 2020 at 17:26
  • Thanks for your comment, I was not aware of thes behaviour either. Commented Nov 26, 2020 at 9:13

2 Answers 2

3

That's because 'İ' in lowercase is 'i̇', which has 2 characters

>>> import unicodedata >>> unicodedata.name('İ') 'LATIN CAPITAL LETTER I WITH DOT ABOVE' >>> unicodedata.name('İ'.lower()[0]) 'LATIN SMALL LETTER I' >>> unicodedata.name('İ'.lower()[1]) 'COMBINING DOT ABOVE' 

One character is a combining dot that your browser might render overlapped with the last quote, so you may not be able to see it. But if you copy-paste it into your python console, you should be able to see it.


If you try:

print('i̇'.upper()) 

you should get

İ 
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks a lot for your answer! The unicodedata.name('İ') method is very interessting indeed. The question that I still have difficulties to understand, is why that happens? Isn't this a undesired behaviour?
-1

I think the issue is that a lower case character for that symbol is undefined in ASCII.

The .lower() function probably performs a fixed offset to the ASCII number associated with the character, since that works for the English alphabet.

1 Comment

This is unicode - has nothing to do with ASCII encoding. And that's not what str.lower actually does..

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.