0

I'm trying to use NLTK to perform some NLP NLTK classification for Arabic phrases. If I enter the native words as is in the classifier then it complains about non-ascii characters. Currently, I'm doing word.decode('utf-8') and then entering that as input to the trainer.

When I test the classifier, the results make some sense if there was an exact match. However, if I test a substring of a word in the original training words then results looks somewhat random.

I just want to distinguish if this was a bad classifier or if there's something fundamental in the encoding that degrades the performance of the classifier. Is this a reasonable way to input non-ascii text to classifiers?

#!/usr/bin/python # -*- coding: utf-8 -*- from textblob.classifiers import NaiveBayesClassifier x = "الكتاب".decode('utf-8') ... train = [ (x,'pos'), ] cl = NaiveBayesClassifier(train) t = "كتاب".decode('utf-8') cl.classify(t) 

The word in t is simply x with the first two letters removed. I'm running this with a much bigger dataset of course.

2
  • Please clarify the following points by editing your answer: (1) Python 2 or 3? (2) Do you read the words from a file or do you enter them as string literals? Please show some code. (3) What kind of substring operations are you talking about? Please also show some code. Commented Mar 30, 2017 at 8:56
  • Is the much bigger dataset also given as string literals in the source code? Or do you read it from a file? Commented Mar 30, 2017 at 9:47

1 Answer 1

1

Your post contains, basically, two questions. The first is concerned with encoding, the second one is about predicting substrings of words seen in training.

For encoding, you should use unicode literals directly, so you can omit the decode() part. Like this:

x = u"الكتاب" 

Then you will have a decoded representation already.

Concerning substrings, the classifier won't do that for you. If you ask for predictions for a token that wasn't included in the training in exactly the same spelling, then it will be treated as an unknown word – no matter if it's a substring of a word that occurred in training or not.

The substring case wouldn't be well-defined, anyway: Let's say you look up the single letter Alif – probably, a whole lot of words seen in training contain it. Which one should be used? A random one? The one with the highest probability? The sum of the probabilities of all matching ones? There's no easy answer to this.

I suspect that you are trying to match morphological variants of the same root. If this is the case, then you should try using a lemmatiser. So, before training, and also before prediction, you preprocess all tokens by converting them to their lemma (which is usually the root in Arabic, I think). I doubt that NLTK ships with a morphological model for Arabic, though, so you probably need to look for that elsewhere (but this is beyond the scope of this answer now).

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.