python - conversion of a string to a unicode string

Question

Im using a library unidecode to convert accentred strings to ascii represented stirngs.

>>> accented_string = u'Málaga' # accented_string is of type 'unicode' >>> import unidecode >>> unidecode.unidecode(accented_string) >>> Malaga

But the problem is I'm reading the string from a file how do I send it to the 'unidecode' library.

for name in strings: print unidecode.unidecode(u+name) #?????

I can't get my head around it? if I encode it that just gives me the wrong encoding.

from a csv file to panda data frame then looping over every string value the type is 'string' for evey value. — Mohsin
– Mohsin, Commented Jul 30, 2018 at 11:37
Ignore the "u" you see in the example; it's just Python 2 notation to tell you it's unicode. If your strings are not yet unicode, you'll need know their encoding and convert them from str to unicode. — alexis
– alexis, Commented Jul 30, 2018 at 11:38
If this is not part of a large, existing program, I strongly recommend you install Python 3 today and start using it. Trying to figure out the Python 2 approach to character encodings in 2018 is an exercise in masochism. — alexis
– alexis, Commented Jul 30, 2018 at 11:40

Mohsin · Accepted Answer · 2018-07-30 13:19:01Z

I have a work around which was too simple, just decode the read string back to a unicode string and then pass it to the 'unidecode' library.

>>> accented_string = 'Málaga' >>> accented_string_u = accented_string.decode('utf-8') >>> import unidecode >>> unidecode.unidecode(accented_string_u) >>> Malaga

You're not decoding it "back" to unicode, you're decoding it for the first time within Python.

alexis · Accepted Answer · 2018-07-30 13:21:16Z

We still don't know the type of your pandas column, so here are two versions for Python 2:

If strings is already a sequence of Unicode strings (type(name) is unicode):
```
for name in strings: print unidecode.unidecode(name) 
```

If the elements of strings are regular Python 2 str (type(name) is str):

for name in strings: print unidecode.unidecode(name.decode("utf-8"))

This will work _if your strings are stored in the UTF-8 encoding. Otherwise you'll have to supply the appropriate encoding, e.g. "latin-1" etc.

In Python 3, the first version should work; you'll have to sort out your encoding issues before you get to this point, i.e. when you first read in your data from disk.

Meghdeep Ray · Accepted Answer · 2018-07-30 11:43:38Z

Use the unicodedata.normalize:

accented_string = u"Málaga" unicodedata.normalize( "NFKD", accented_string ).encode( "ascii", "ignore" )

There are 4 normalized forms that you can use: "NFC", "NFKC", "NFD", and "NFKD".

Here is the details for using it as in the documentation linked above:

The Unicode standard defines various normalization forms of a Unicode string, based on the definition of canonical equivalence and compatibility equivalence. In Unicode, several characters can be expressed in various way. For example, the character U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) can also be expressed as the sequence U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA).

For each character, there are two normal forms: normal form C and normal form D. Normal form D (NFD) is also known as canonical decomposition, and translates each character into its decomposed form. Normal form C (NFC) first applies a canonical decomposition, then composes pre-combined characters again.

In addition to these two forms, there are two additional normal forms based on compatibility equivalence. In Unicode, certain characters are supported which normally would be unified with other characters. For example, U+2160 (ROMAN NUMERAL ONE) is really the same thing as U+0049 (LATIN CAPITAL LETTER I). However, it is supported in Unicode for compatibility with existing character sets (e.g. gb2312).

The normal form KD (NFKD) will apply the compatibility decomposition, i.e. replace all compatibility characters with their equivalents. The normal form KC (NFKC) first applies the compatibility decomposition, followed by the canonical composition.

Even if two unicode strings are normalized and look the same to a human reader, if one has combining characters and the other doesn’t, they may not compare equal.

This doesn't do the same as the unidecode library, eg. for letters like "ç" or "ø".
Anyway, I think the OP has trouble decoding input from a file. I don't understand why you propose a different (maybe inferior) approach to "accent stripping". unidecode works really well if all you want is an ASCII representation.
This is an approach that worked for me for a similar issue, so I mentioned it.

Collectives™ on Stack Overflow

python - conversion of a string to a unicode string

3 Answers 3

1 Comment

Comments

3 Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

3 Comments

Related