How to replace unicode characters by ascii characters in Python (perl script given)?

Question

I am trying to learn python and couldn't figure out how to translate the following perl script to python:

#!/usr/bin/perl -w use open qw(:std :utf8); while(<>) { s/\x{00E4}/ae/; s/\x{00F6}/oe/; s/\x{00FC}/ue/; print; }

The script just changes unicode umlauts to alternative ascii output. (So the complete output is in ascii.) I would be grateful for any hints. Thanks!

The given Perl script will actually only substitute the first occurrence on each line, but that's surely an accident. — tripleee
– tripleee, Commented Dec 15, 2013 at 16:52

Ian Bicking · Accepted Answer · 2010-04-23 20:50:33Z

49

For converting to ASCII you might want to try ASCII, Dammit or this recipe, which boils down to:

>>> title = u"Klüft skräms inför på fédéral électoral große" >>> import unicodedata >>> unicodedata.normalize('NFKD', title).encode('ascii','ignore') 'Kluft skrams infor pa federal electoral groe'

answered Apr 23, 2010 at 20:50

Ian Bicking

9,9406 gold badges36 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user3850 Over a year ago

which does not at all what the original .pl does (mainly properly transliterating german special characters)

user3850 Over a year ago

stripping the dots from german umlauts makes just about as much sense as stripping one leg from "x" and writing "y" or replacing "d" with "b" because the "kinda look the same".

Radio Controlled Over a year ago

No, you might get collisions because you map different strings to the same one.

Dhia · Accepted Answer · 2016-05-04 14:15:12Z

Use the fileinput module to loop over standard input or a list of files,
decode the lines you read from UTF-8 to unicode objects
then map any unicode characters you desire with the translate method

translit.py would look like this:

#!/usr/bin/env python2.6 # -*- coding: utf-8 -*- import fileinput table = { 0xe4: u'ae', ord(u'ö'): u'oe', ord(u'ü'): u'ue', ord(u'ß'): None, } for line in fileinput.input(): s = line.decode('utf8') print s.translate(table),

And you could use it like this:

$ cat utf8.txt sömé täßt sömé täßt sömé täßt $ ./translit.py utf8.txt soemé taet soemé taet soemé taet

Update:

In case you are using python 3 strings are by default unicode and you dont' need to encode it if it contains non-ASCII characters or even a non-Latin characters. So the solution will look as follow:

line = 'Verhältnismäßigkeit, Möglichkeit' table = { ord('ä'): 'ae', ord('ö'): 'oe', ord('ü'): 'ue', ord('ß'): 'ss', } line.translate(table) >>> 'Verhaeltnismaessigkeit, Moeglichkeit'

And to get ascii output the last line should be print s.translate(table).encode('ascii', 'ignore'), I guess.
The objective appears to be de-umlauting German text, leaving it understandable. The effect of ord(u'ß'): None in this code is to delete the ß ("eszett") character. It should be ord(u'ß'): u'ss'. Upvotes?? Accepted answer???
oh. come. on. i tried to show the different possibilities for the map.
You chose a very bad example of how to do something that the OP didn't indicate that he wanted or needed.
@john: if you would take the OP's question literally together with his comment above ('ignore'), it would have the exact same outcome, so stop nitpicking already.

jfs · Accepted Answer · 2014-05-06 19:06:34Z

You could try unidecode to convert Unicode into ascii instead of writing manual regular expressions. It is a Python port of Text::Unidecode Perl module:

#!/usr/bin/env python import fileinput import locale from contextlib import closing from unidecode import unidecode # $ pip install unidecode def toascii(files=None, encoding=None, bufsize=-1): if encoding is None: encoding = locale.getpreferredencoding(False) with closing(fileinput.FileInput(files=files, bufsize=bufsize)) as file: for line in file: print unidecode(line.decode(encoding)), if __name__ == "__main__": import sys toascii(encoding=sys.argv.pop(1) if len(sys.argv) > 1 else None)

It uses FileInput class to avoid global state.

Example:

$ echo 'äöüß' | python toascii.py utf-8 aouss

Climbs_lika_Spyder · Accepted Answer · 2013-12-15 16:42:14Z

I use translitcodec

>>> import translitcodec >>> print '\xe4'.decode('latin-1') ä >>> print '\xe4'.decode('latin-1').encode('translit/long').encode('ascii') ae >>> print '\xe4'.decode('latin-1').encode('translit/short').encode('ascii') a

You can change the decode language to whatever you need. You may want a simple function to reduce length of a single implementation.

def fancy2ascii(s): return s.decode('latin-1').encode('translit/long').encode('ascii')

Radio Controlled · Accepted Answer · 2020-03-08 09:37:14Z

Quick and dirty (python2):

def make_ascii(string): return string.decode('utf-8').replace(u'ü','ue').replace(u'ö','oe').replace(u'ä','ae').replace(u'ß','ss').encode('ascii','ignore');

Collectives™ on Stack Overflow

How to replace unicode characters by ascii characters in Python (perl script given)?

5 Answers 5

3 Comments

8 Comments

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

3 Comments

8 Comments

Comments

Comments

Comments

Linked

Related