Preferably, I would use scikit-learn to build a language ID system as I had previously done, see https://github.com/alvations/bayesline.
That being said it is totally possible to build a language ID system using simple classification modules from NLTK and unicode data.
There is no need to do anything special to the NLTK code and they can be used as they are. (this might be useful to you as to how to build a classifier in NLTK: nltk NaiveBayesClassifier training for sentiment analysis)
Now to show that it's totally possible to just use NLTK out of the box for language ID with unicode data, see below
Firstly for language ID, there is a minor difference using unicode character feature and bytecode in feature extraction:
from nltk.corpus import indian # NLTK reads the corpus as bytecodes. hindi = " ".join(indian.words('hindi.pos')) bangla = " ".join(indian.words('bangla.pos')) marathi = " ".join(indian.words('marathi.pos')) telugu = " ".join(indian.words('telugu.pos')) # Prints out first 10 bytes (including spaces). print 'hindi:', hindi[:10] print 'bangla:', bangla[:10] print 'marathi:', marathi[:10] print 'telugu:', telugu[:10] print # Converts bytecodes to utf8. hindi = hindi.decode('utf8') bangla = bangla.decode('utf8') marathi = marathi.decode('utf8') telugu = telugu.decode('utf8') # Prints out first 10 unicode char (including spaces). print 'hindi:', hindi[:10] print 'bangla:', bangla[:10] print 'marathi:', marathi[:10] print 'telugu:', telugu[:10] print
[out]:
hindi: पूर bangla: মহি marathi: '' सन telugu: 4 . ఆడ hindi: पूर्ण प्रत bangla: মহিষের সন্ marathi: '' सनातनवा telugu: 4 . ఆడిట్
Now that you see the difference in using bytecode and unicode, let's train a taggaer.
from nltk import NaiveBayesClassifier as nbc # Allocate some sort of labels for the data. training = [(hindi, 'hi'), (bangla, 'ba'), (marathi, 'ma'), (telugu, 'te')] # This is how you can extract ngrams print ngrams(telugu[:10], 2) print print ngrams(hindi[:10], 3) print vocabulary = set(chain(*[ngrams(txt, 2) for txt,tag in training])) feature_set = [({i:(i in ngrams(sentence, 2)) for i in vocabulary},tag) for sentence, tag in training] classifer = nbc.train(feature_set) test1 = u'पूर्ण प्रत' # hindi test2 = u'মহিষের সন্' # bangla test3 = u'सनातनवा' # marathi test4 = u'ఆడిట్ ' # telugu for testdoc in [test1, test2, test3, test4]: featurized_test_sent = {i:(i in ngrams(testdoc,2)) for i in vocabulary} print "test sent:", testdoc print "tag:", classifer.classify(featurized_test_sent) print
[out]:
[(u'4', u' '), (u' ', u'.'), (u'.', u' '), (u' ', u'\u0c06'), (u'\u0c06', u'\u0c21'), (u'\u0c21', u'\u0c3f'), (u'\u0c3f', u'\u0c1f'), (u'\u0c1f', u'\u0c4d'), (u'\u0c4d', u' ')] [(u'\u092a', u'\u0942', u'\u0930'), (u'\u0942', u'\u0930', u'\u094d'), (u'\u0930', u'\u094d', u'\u0923'), (u'\u094d', u'\u0923', u' '), (u'\u0923', u' ', u'\u092a'), (u' ', u'\u092a', u'\u094d'), (u'\u092a', u'\u094d', u'\u0930'), (u'\u094d', u'\u0930', u'\u0924')] test sent: पूर्ण प्रत tag: hi test sent: মহিষের সন্ tag: ba test sent: सनातनवा tag: ma test sent: ఆడిట్ tag: te
Here's the full code:
# -*- coding: utf-8 -*- from itertools import chain from nltk.corpus import indian from nltk.util import ngrams from nltk import NaiveBayesClassifier as nbc # NLTK reads the corpus as bytecodes. hindi = " ".join(indian.words('hindi.pos')) bangla = " ".join(indian.words('bangla.pos')) marathi = " ".join(indian.words('marathi.pos')) telugu = " ".join(indian.words('telugu.pos')) # Prints out first 10 bytes (including spaces). print 'hindi:', hindi[:10] print 'bangla:', bangla[:10] print 'marathi:', marathi[:10] print 'telugu:', telugu[:10] print # Converts bytecodes to utf8. hindi = hindi.decode('utf8') bangla = bangla.decode('utf8') marathi = marathi.decode('utf8') telugu = telugu.decode('utf8') # Prints out first 10 unicode char (including spaces). print 'hindi:', hindi[:10] print 'bangla:', bangla[:10] print 'marathi:', marathi[:10] print 'telugu:', telugu[:10] print # Allocate some sort of labels for the data. training = [(hindi, 'hi'), (bangla, 'ba'), (marathi, 'ma'), (telugu, 'te')] # This is how you can extract ngrams print ngrams(telugu[:10], 2) print print ngrams(hindi[:10], 3) print vocabulary = set(chain(*[ngrams(txt, 2) for txt,tag in training])) feature_set = [({i:(i in ngrams(sentence, 2)) for i in vocabulary},tag) for sentence, tag in training] classifer = nbc.train(feature_set) test1 = u'पूर्ण प्रत' # hindi test2 = u'মহিষের সন্' # bangla test3 = u'सनातनवा' # marathi test4 = u'ఆడిట్ ' # telugu for testdoc in [test1, test2, test3, test4]: featurized_test_sent = {i:(i in ngrams(testdoc,2)) for i in vocabulary} print "test sent:", testdoc print "tag:", classifer.classify(featurized_test_sent) print
nltk?