I'm trying to use NLTK to perform some NLP NLTK classification for Arabic phrases. If I enter the native words as is in the classifier then it complains about non-ascii characters. Currently, I'm doing word.decode('utf-8') and then entering that as input to the trainer.
When I test the classifier, the results make some sense if there was an exact match. However, if I test a substring of a word in the original training words then results looks somewhat random.
I just want to distinguish if this was a bad classifier or if there's something fundamental in the encoding that degrades the performance of the classifier. Is this a reasonable way to input non-ascii text to classifiers?
#!/usr/bin/python # -*- coding: utf-8 -*- from textblob.classifiers import NaiveBayesClassifier x = "الكتاب".decode('utf-8') ... train = [ (x,'pos'), ] cl = NaiveBayesClassifier(train) t = "كتاب".decode('utf-8') cl.classify(t) The word in t is simply x with the first two letters removed. I'm running this with a much bigger dataset of course.