How do I detect what language a text is written in using NLTK?
The examples I've seen use nltk.detect, but when I've installed it on my mac, I cannot find this package.
Have you come across the following code snippet?
english_vocab = set(w.lower() for w in nltk.corpus.words.words()) text_vocab = set(w.lower() for w in text if w.lower().isalpha()) unusual = text_vocab.difference(english_vocab) from http://groups.google.com/group/nltk-users/browse_thread/thread/a5f52af2cbc4cfeb?pli=1&safe=active
Or the following demo file?
This library is not from NLTK either but certainly helps.
$ sudo pip install langdetect
Supported Python versions 2.6, 2.7, 3.x.
>>> from langdetect import detect >>> detect("War doesn't show who's right, just who's left.") 'en' >>> detect("Ein, zwei, drei, vier") 'de' https://pypi.python.org/pypi/langdetect?
P.S.: Don't expect this to work correctly always:
>>> detect("today is a good day") 'so' >>> detect("today is a good day.") 'so' >>> detect("la vita e bella!") 'it' >>> detect("khoobi? khoshi?") 'so' >>> detect("wow") 'pl' >>> detect("what a day") 'en' >>> detect("yay!") 'so' detect("You made it home!") is giving me "fr". I'm wondering if there is anything better.>>> detect_langs("Hello, I'm christiane amanpour.") [it:0.8571401485770536, en:0.14285811674731527] >>> detect_langs("Hello, I'm christiane amanpour.") [it:0.8571403121803622, fr:0.14285888197332486] >>> detect_langs("Hello, I'm christiane amanpour.") [it:0.999995562246093]import DetectorFactory DetectorFactory.seed = 0Although this is not in the NLTK, I have had great results with another Python-based library :
https://github.com/saffsd/langid.py
This is very simple to import and includes a large number of languages in its model.
Super late but, you could use textcat classifier in nltk, here. This paper discusses the algorithm.
It returns a country code in ISO 639-3, so I would use pycountry to get the full name.
For example, load the libraries
import nltk import pycountry from nltk.stem import SnowballStemmer Now let's look at two phrases, and guess their language:
phrase_one = "good morning" phrase_two = "goeie more" tc = nltk.classify.textcat.TextCat() guess_one = tc.guess_language(phrase_one) guess_two = tc.guess_language(phrase_two) guess_one_name = pycountry.languages.get(alpha_3=guess_one).name guess_two_name = pycountry.languages.get(alpha_3=guess_two).name print(guess_one_name) print(guess_two_name) English Afrikaans You could then pass them into other nltk functions, for example:
stemmer = SnowballStemmer(guess_one_name.lower()) s1 = "walking" print(stemmer.stem(s1)) walk Disclaimer obviously this will not always work, especially for sparse data
Extreme example
guess_example = tc.guess_language("hello") print(pycountry.languages.get(alpha_3=guess_example).name) Konkani (individual language)
langidandlangdetectlibraries do the trick and are super easy to use: github.com/hb20007/hands-on-nltk-tutorial/blob/master/…langdetectis not very reliable (e.g. check github.com/Mimino666/langdetect/issues/51 for instance) andlangidchoked on a test Japanese string when I tested it. YMMV. In 2019, if you are not tied to NLTK, I'd recommend you take a look atcld2,cld3orfastTextinstead.