1

I am not trying to build a whole new naive bayes classifier. There are plenty already for example scitkit learn has Naive Bayes implementation, NLTK has its own NaiveBayesClassifier.

I have 1000+ sentences for training and 300+ sentences for test set in my language (one of Indic language). All I need to do is pick up one of the classifier (Naive Bayes implemented), train it and test its accuracy.

The problem is texts aren't in English its in Devnagari unicode.

I am seeking for suggestions on which Classifier well fits to cover up the main issue I am having so far is with unicode.

5
  • Did you try any of those classifiers? They will probably work on unicode data. Commented Jul 28, 2014 at 5:02
  • I used this github.com/codebox/bayesian-classifier @BrenBarn but training set in unicode wasn't taken. Resulted in "no text found" error. Commented Jul 28, 2014 at 5:20
  • did you try the naive bayes in nltk? Commented Jul 28, 2014 at 7:19
  • possible duplicate of NLTK and language detection Commented Jul 29, 2014 at 14:29
  • try adapting this code for language ID, github.com/alvations/bayesline ;) Commented Aug 3, 2014 at 22:11

3 Answers 3

4

The naive bayes in scikit-learn operate with number vectors, that (for example) we can get after some vectorizer. For text classification I often use TfidfVectorizer: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

In parameters for constructor TfidfVectorizer exists next parameter: encoding : string, ‘utf-8’ by default. If bytes or files are given to analyze, this encoding is used to decode.

You can use this parameter and use your encoding, also you can specify your own preprocessor function and analyze function (it also can be useful)

Sign up to request clarification or add additional context in comments.

Comments

1

Preferably, I would use scikit-learn to build a language ID system as I had previously done, see https://github.com/alvations/bayesline.

That being said it is totally possible to build a language ID system using simple classification modules from NLTK and unicode data.

There is no need to do anything special to the NLTK code and they can be used as they are. (this might be useful to you as to how to build a classifier in NLTK: nltk NaiveBayesClassifier training for sentiment analysis)

Now to show that it's totally possible to just use NLTK out of the box for language ID with unicode data, see below

Firstly for language ID, there is a minor difference using unicode character feature and bytecode in feature extraction:

from nltk.corpus import indian # NLTK reads the corpus as bytecodes. hindi = " ".join(indian.words('hindi.pos')) bangla = " ".join(indian.words('bangla.pos')) marathi = " ".join(indian.words('marathi.pos')) telugu = " ".join(indian.words('telugu.pos')) # Prints out first 10 bytes (including spaces). print 'hindi:', hindi[:10] print 'bangla:', bangla[:10] print 'marathi:', marathi[:10] print 'telugu:', telugu[:10] print # Converts bytecodes to utf8. hindi = hindi.decode('utf8') bangla = bangla.decode('utf8') marathi = marathi.decode('utf8') telugu = telugu.decode('utf8') # Prints out first 10 unicode char (including spaces). print 'hindi:', hindi[:10] print 'bangla:', bangla[:10] print 'marathi:', marathi[:10] print 'telugu:', telugu[:10] print 

[out]:

hindi: पूर bangla: মহি marathi: '' सन telugu: 4 . ఆడ hindi: पूर्ण प्रत bangla: মহিষের সন্ marathi: '' सनातनवा telugu: 4 . ఆడిట్ 

Now that you see the difference in using bytecode and unicode, let's train a taggaer.

from nltk import NaiveBayesClassifier as nbc # Allocate some sort of labels for the data. training = [(hindi, 'hi'), (bangla, 'ba'), (marathi, 'ma'), (telugu, 'te')] # This is how you can extract ngrams print ngrams(telugu[:10], 2) print print ngrams(hindi[:10], 3) print vocabulary = set(chain(*[ngrams(txt, 2) for txt,tag in training])) feature_set = [({i:(i in ngrams(sentence, 2)) for i in vocabulary},tag) for sentence, tag in training] classifer = nbc.train(feature_set) test1 = u'पूर्ण प्रत' # hindi test2 = u'মহিষের সন্' # bangla test3 = u'सनातनवा' # marathi test4 = u'ఆడిట్ ' # telugu for testdoc in [test1, test2, test3, test4]: featurized_test_sent = {i:(i in ngrams(testdoc,2)) for i in vocabulary} print "test sent:", testdoc print "tag:", classifer.classify(featurized_test_sent) print 

[out]:

[(u'4', u' '), (u' ', u'.'), (u'.', u' '), (u' ', u'\u0c06'), (u'\u0c06', u'\u0c21'), (u'\u0c21', u'\u0c3f'), (u'\u0c3f', u'\u0c1f'), (u'\u0c1f', u'\u0c4d'), (u'\u0c4d', u' ')] [(u'\u092a', u'\u0942', u'\u0930'), (u'\u0942', u'\u0930', u'\u094d'), (u'\u0930', u'\u094d', u'\u0923'), (u'\u094d', u'\u0923', u' '), (u'\u0923', u' ', u'\u092a'), (u' ', u'\u092a', u'\u094d'), (u'\u092a', u'\u094d', u'\u0930'), (u'\u094d', u'\u0930', u'\u0924')] test sent: पूर्ण प्रत tag: hi test sent: মহিষের সন্ tag: ba test sent: सनातनवा tag: ma test sent: ఆడిట్ tag: te 

Here's the full code:

# -*- coding: utf-8 -*- from itertools import chain from nltk.corpus import indian from nltk.util import ngrams from nltk import NaiveBayesClassifier as nbc # NLTK reads the corpus as bytecodes. hindi = " ".join(indian.words('hindi.pos')) bangla = " ".join(indian.words('bangla.pos')) marathi = " ".join(indian.words('marathi.pos')) telugu = " ".join(indian.words('telugu.pos')) # Prints out first 10 bytes (including spaces). print 'hindi:', hindi[:10] print 'bangla:', bangla[:10] print 'marathi:', marathi[:10] print 'telugu:', telugu[:10] print # Converts bytecodes to utf8. hindi = hindi.decode('utf8') bangla = bangla.decode('utf8') marathi = marathi.decode('utf8') telugu = telugu.decode('utf8') # Prints out first 10 unicode char (including spaces). print 'hindi:', hindi[:10] print 'bangla:', bangla[:10] print 'marathi:', marathi[:10] print 'telugu:', telugu[:10] print # Allocate some sort of labels for the data. training = [(hindi, 'hi'), (bangla, 'ba'), (marathi, 'ma'), (telugu, 'te')] # This is how you can extract ngrams print ngrams(telugu[:10], 2) print print ngrams(hindi[:10], 3) print vocabulary = set(chain(*[ngrams(txt, 2) for txt,tag in training])) feature_set = [({i:(i in ngrams(sentence, 2)) for i in vocabulary},tag) for sentence, tag in training] classifer = nbc.train(feature_set) test1 = u'पूर्ण प्रत' # hindi test2 = u'মহিষের সন্' # bangla test3 = u'सनातनवा' # marathi test4 = u'ఆడిట్ ' # telugu for testdoc in [test1, test2, test3, test4]: featurized_test_sent = {i:(i in ngrams(testdoc,2)) for i in vocabulary} print "test sent:", testdoc print "tag:", classifer.classify(featurized_test_sent) print 

1 Comment

although its exactly what I am looking for. I can take it forward it from here.
0

The question is very poorly formulated, but there is a possibility that it might be about language identification rather than sentence classification.

If this is the case, then there is a long way to go before you apply anything like Naive Bayes or other classifiers. Have a look at the character-gram approach used by Damir Cavar's LID, implemented in Python.

4 Comments

yep, its not about sentence classification. its about language identification by machine learning module available.
ok, then my suggestion is to read this question: stackoverflow.com/questions/3182268/nltk-and-language-detection
Basically, you will do the identification somewhere else, not in NLTK or Scikit-learn. You can plug the various statistical models from these libraries in the decision function of any identification solution, after Unicode and the character-grams have been dealt with.
will have a look at it and get back here.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.