NLTK and language detection

Question

How do I detect what language a text is written in using NLTK?

The examples I've seen use nltk.detect, but when I've installed it on my mac, I cannot find this package.

The langid and langdetect libraries do the trick and are super easy to use: github.com/hb20007/hands-on-nltk-tutorial/blob/master/… — hb20007
– hb20007, Commented May 17, 2018 at 12:55
langdetect is not very reliable (e.g. check github.com/Mimino666/langdetect/issues/51 for instance) and langid choked on a test Japanese string when I tested it. YMMV. In 2019, if you are not tied to NLTK, I'd recommend you take a look at cld2, cld3 or fastText instead. — Mathieu Rey
– Mathieu Rey, Commented Mar 19, 2019 at 13:35

Mark Cramer · Accepted Answer · 2017-10-14 22:22:26Z

46

+50

Have you come across the following code snippet?

english_vocab = set(w.lower() for w in nltk.corpus.words.words()) text_vocab = set(w.lower() for w in text if w.lower().isalpha()) unusual = text_vocab.difference(english_vocab)

from http://groups.google.com/group/nltk-users/browse_thread/thread/a5f52af2cbc4cfeb?pli=1&safe=active

Or the following demo file?

https://web.archive.org/web/20120202055535/http://code.google.com/p/nltk/source/browse/trunk/nltk_contrib/nltk_contrib/misc/langid.py

edited Oct 14, 2017 at 22:22

Mark Cramer

2,9045 gold badges36 silver badges57 bronze badges

answered Aug 2, 2010 at 2:34

William Niu

15.9k8 gold badges58 silver badges93 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

niklassaers Over a year ago

PS, it still relied on nltk.detect, though. Any idea on how to install that on a Mac?

William Niu Over a year ago

I don't believe detect is a native module for nltk. Here's the code: docs.huihoo.com/nltk/0.9.5/api/nltk.detect-pysrc.html You could probably download it and put it in your python library, which may be in: /Library/Python/2.x/site-packages/nltk...

Anoop Toffy Over a year ago

Check this out.. blog.alejandronolla.com/2013/05/15/…

Mona Jalal Over a year ago

The requested URL /p/nltk/source/browse/trunk/nltk_contrib/nltk_contrib/misc/langid.py was not found on this server. That’s all we know.

whege Over a year ago

This is such a good answer. The simplicity of checking if the words are in the vocab is an amazingly direct approach to this kind of task. Granted it doesn't give you the actual language or translate, but if you simply need to know if it's an outlier, this is brilliant.

Mona Jalal · Accepted Answer · 2017-03-07 00:42:32Z

31

This library is not from NLTK either but certainly helps.

$ sudo pip install langdetect

Supported Python versions 2.6, 2.7, 3.x.

>>> from langdetect import detect >>> detect("War doesn't show who's right, just who's left.") 'en' >>> detect("Ein, zwei, drei, vier") 'de'

https://pypi.python.org/pypi/langdetect?

P.S.: Don't expect this to work correctly always:

>>> detect("today is a good day") 'so' >>> detect("today is a good day.") 'so' >>> detect("la vita e bella!") 'it' >>> detect("khoobi? khoshi?") 'so' >>> detect("wow") 'pl' >>> detect("what a day") 'en' >>> detect("yay!") 'so'

edited Mar 7, 2017 at 0:42

Mona Jalal

38.9k75 gold badges264 silver badges434 bronze badges

answered Aug 3, 2016 at 19:39

SVK

1,03413 silver badges25 bronze badges

5 Comments

Mark Cramer Over a year ago

Thank you for pointing out that it doesn't always work. detect("You made it home!") is giving me "fr". I'm wondering if there is anything better.

Mark Cramer Over a year ago

Here is another fun observation: It doesn't seem to give the same answer each time.

>>> detect_langs("Hello, I'm christiane amanpour.") [it:0.8571401485770536, en:0.14285811674731527] >>> detect_langs("Hello, I'm christiane amanpour.") [it:0.8571403121803622, fr:0.14285888197332486] >>> detect_langs("Hello, I'm christiane amanpour.") [it:0.999995562246093]

J. Taylor Over a year ago

langdetect works much better for longer strings where it can sample more n-grams ... for short strings of a few words, it's extremely unreliable.

Philip Over a year ago

@MarkCramer The algorithm is non-deterministic. If you want the same answer each time, set the seed: import DetectorFactory DetectorFactory.seed = 0

mtefi Over a year ago

Quick to install, easy to use. Maybe not perfect but for my usage, it worked fine. Thank you!

burgersmoke · Accepted Answer · 2013-06-30 03:43:43Z

Although this is not in the NLTK, I have had great results with another Python-based library :

https://github.com/saffsd/langid.py

This is very simple to import and includes a large number of languages in its model.

RK1 · Accepted Answer · 2019-10-17 12:07:09Z

Super late but, you could use textcat classifier in nltk, here. This paper discusses the algorithm.

It returns a country code in ISO 639-3, so I would use pycountry to get the full name.

For example, load the libraries

import nltk import pycountry from nltk.stem import SnowballStemmer

Now let's look at two phrases, and guess their language:

phrase_one = "good morning" phrase_two = "goeie more" tc = nltk.classify.textcat.TextCat() guess_one = tc.guess_language(phrase_one) guess_two = tc.guess_language(phrase_two) guess_one_name = pycountry.languages.get(alpha_3=guess_one).name guess_two_name = pycountry.languages.get(alpha_3=guess_two).name print(guess_one_name) print(guess_two_name) English Afrikaans

You could then pass them into other nltk functions, for example:

stemmer = SnowballStemmer(guess_one_name.lower()) s1 = "walking" print(stemmer.stem(s1)) walk

Disclaimer obviously this will not always work, especially for sparse data

Extreme example

guess_example = tc.guess_language("hello") print(pycountry.languages.get(alpha_3=guess_example).name) Konkani (individual language)

Ryan Xu · Accepted Answer · 2023-02-13 15:16:56Z

polyglot.detect can detect the language:

from polyglot.detect import Detector foreign = 'Este libro ha sido uno de los mejores libros que he leido.' print(Detector(foreign).language) name: Spanish code: es confidence: 98.0 read bytes: 865

Collectives™ on Stack Overflow

NLTK and language detection

5 Answers 5

5 Comments

5 Comments

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

5 Comments

5 Comments

Comments

Comments

Comments

Linked

Related