0

Im using a library unidecode to convert accentred strings to ascii represented stirngs.

>>> accented_string = u'Málaga' # accented_string is of type 'unicode' >>> import unidecode >>> unidecode.unidecode(accented_string) >>> Malaga 

But the problem is I'm reading the string from a file how do I send it to the 'unidecode' library.

for name in strings: print unidecode.unidecode(u+name) #????? 

I can't get my head around it? if I encode it that just gives me the wrong encoding.

13
  • How are you reading strings? Commented Jul 30, 2018 at 11:34
  • from a csv file to panda data frame then looping over every string value the type is 'string' for evey value. Commented Jul 30, 2018 at 11:37
  • Please include that code in your question too. Commented Jul 30, 2018 at 11:38
  • Ignore the "u" you see in the example; it's just Python 2 notation to tell you it's unicode. If your strings are not yet unicode, you'll need know their encoding and convert them from str to unicode. Commented Jul 30, 2018 at 11:38
  • 4
    If this is not part of a large, existing program, I strongly recommend you install Python 3 today and start using it. Trying to figure out the Python 2 approach to character encodings in 2018 is an exercise in masochism. Commented Jul 30, 2018 at 11:40

3 Answers 3

1

I have a work around which was too simple, just decode the read string back to a unicode string and then pass it to the 'unidecode' library.

>>> accented_string = 'Málaga' >>> accented_string_u = accented_string.decode('utf-8') >>> import unidecode >>> unidecode.unidecode(accented_string_u) >>> Malaga 
Sign up to request clarification or add additional context in comments.

1 Comment

You're not decoding it "back" to unicode, you're decoding it for the first time within Python.
1

We still don't know the type of your pandas column, so here are two versions for Python 2:

  • If strings is already a sequence of Unicode strings (type(name) is unicode):

    for name in strings: print unidecode.unidecode(name) 
  • If the elements of strings are regular Python 2 str (type(name) is str):

    for name in strings: print unidecode.unidecode(name.decode("utf-8")) 

This will work _if your strings are stored in the UTF-8 encoding. Otherwise you'll have to supply the appropriate encoding, e.g. "latin-1" etc.

In Python 3, the first version should work; you'll have to sort out your encoding issues before you get to this point, i.e. when you first read in your data from disk.

Comments

0

Use the unicodedata.normalize:

accented_string = u"Málaga" unicodedata.normalize( "NFKD", accented_string ).encode( "ascii", "ignore" ) 

There are 4 normalized forms that you can use: "NFC", "NFKC", "NFD", and "NFKD".

Here is the details for using it as in the documentation linked above:

The Unicode standard defines various normalization forms of a Unicode string, based on the definition of canonical equivalence and compatibility equivalence. In Unicode, several characters can be expressed in various way. For example, the character U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) can also be expressed as the sequence U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA).

For each character, there are two normal forms: normal form C and normal form D. Normal form D (NFD) is also known as canonical decomposition, and translates each character into its decomposed form. Normal form C (NFC) first applies a canonical decomposition, then composes pre-combined characters again.

In addition to these two forms, there are two additional normal forms based on compatibility equivalence. In Unicode, certain characters are supported which normally would be unified with other characters. For example, U+2160 (ROMAN NUMERAL ONE) is really the same thing as U+0049 (LATIN CAPITAL LETTER I). However, it is supported in Unicode for compatibility with existing character sets (e.g. gb2312).

The normal form KD (NFKD) will apply the compatibility decomposition, i.e. replace all compatibility characters with their equivalents. The normal form KC (NFKC) first applies the compatibility decomposition, followed by the canonical composition.

Even if two unicode strings are normalized and look the same to a human reader, if one has combining characters and the other doesn’t, they may not compare equal.

3 Comments

This doesn't do the same as the unidecode library, eg. for letters like "ç" or "ø".
Anyway, I think the OP has trouble decoding input from a file. I don't understand why you propose a different (maybe inferior) approach to "accent stripping". unidecode works really well if all you want is an ASCII representation.
This is an approach that worked for me for a similar issue, so I mentioned it.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.