29

I am trying to learn python and couldn't figure out how to translate the following perl script to python:

#!/usr/bin/perl -w use open qw(:std :utf8); while(<>) { s/\x{00E4}/ae/; s/\x{00F6}/oe/; s/\x{00FC}/ue/; print; } 

The script just changes unicode umlauts to alternative ascii output. (So the complete output is in ascii.) I would be grateful for any hints. Thanks!

3
  • search SO for "transliteration" to find related questions. Commented Apr 23, 2010 at 18:30
  • stackoverflow.com/questions/816285/… Commented Apr 23, 2010 at 19:30
  • The given Perl script will actually only substitute the first occurrence on each line, but that's surely an accident. Commented Dec 15, 2013 at 16:52

5 Answers 5

49

For converting to ASCII you might want to try ASCII, Dammit or this recipe, which boils down to:

>>> title = u"Klüft skräms inför på fédéral électoral große" >>> import unicodedata >>> unicodedata.normalize('NFKD', title).encode('ascii','ignore') 'Kluft skrams infor pa federal electoral groe' 
Sign up to request clarification or add additional context in comments.

3 Comments

which does not at all what the original .pl does (mainly properly transliterating german special characters)
stripping the dots from german umlauts makes just about as much sense as stripping one leg from "x" and writing "y" or replacing "d" with "b" because the "kinda look the same".
No, you might get collisions because you map different strings to the same one.
18
  • Use the fileinput module to loop over standard input or a list of files,
  • decode the lines you read from UTF-8 to unicode objects
  • then map any unicode characters you desire with the translate method

translit.py would look like this:

#!/usr/bin/env python2.6 # -*- coding: utf-8 -*- import fileinput table = { 0xe4: u'ae', ord(u'ö'): u'oe', ord(u'ü'): u'ue', ord(u'ß'): None, } for line in fileinput.input(): s = line.decode('utf8') print s.translate(table), 

And you could use it like this:

$ cat utf8.txt sömé täßt sömé täßt sömé täßt $ ./translit.py utf8.txt soemé taet soemé taet soemé taet 
  • Update:

In case you are using python 3 strings are by default unicode and you dont' need to encode it if it contains non-ASCII characters or even a non-Latin characters. So the solution will look as follow:

line = 'Verhältnismäßigkeit, Möglichkeit' table = { ord('ä'): 'ae', ord('ö'): 'oe', ord('ü'): 'ue', ord('ß'): 'ss', } line.translate(table) >>> 'Verhaeltnismaessigkeit, Moeglichkeit' 

8 Comments

And to get ascii output the last line should be print s.translate(table).encode('ascii', 'ignore'), I guess.
The objective appears to be de-umlauting German text, leaving it understandable. The effect of ord(u'ß'): None in this code is to delete the ß ("eszett") character. It should be ord(u'ß'): u'ss'. Upvotes?? Accepted answer???
oh. come. on. i tried to show the different possibilities for the map.
You chose a very bad example of how to do something that the OP didn't indicate that he wanted or needed.
@john: if you would take the OP's question literally together with his comment above ('ignore'), it would have the exact same outcome, so stop nitpicking already.
|
8

You could try unidecode to convert Unicode into ascii instead of writing manual regular expressions. It is a Python port of Text::Unidecode Perl module:

#!/usr/bin/env python import fileinput import locale from contextlib import closing from unidecode import unidecode # $ pip install unidecode def toascii(files=None, encoding=None, bufsize=-1): if encoding is None: encoding = locale.getpreferredencoding(False) with closing(fileinput.FileInput(files=files, bufsize=bufsize)) as file: for line in file: print unidecode(line.decode(encoding)), if __name__ == "__main__": import sys toascii(encoding=sys.argv.pop(1) if len(sys.argv) > 1 else None) 

It uses FileInput class to avoid global state.

Example:

$ echo 'äöüß' | python toascii.py utf-8 aouss 

Comments

3

I use translitcodec

>>> import translitcodec >>> print '\xe4'.decode('latin-1') ä >>> print '\xe4'.decode('latin-1').encode('translit/long').encode('ascii') ae >>> print '\xe4'.decode('latin-1').encode('translit/short').encode('ascii') a 

You can change the decode language to whatever you need. You may want a simple function to reduce length of a single implementation.

def fancy2ascii(s): return s.decode('latin-1').encode('translit/long').encode('ascii') 

Comments

-3

Quick and dirty (python2):

def make_ascii(string): return string.decode('utf-8').replace(u'ü','ue').replace(u'ö','oe').replace(u'ä','ae').replace(u'ß','ss').encode('ascii','ignore'); 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.