Naive Bayes for Text Classification - Python 2.7 Data Structure Issue

Question

I am having an issue training my Naive Bayes Classifier. I have a feature set and targets that I want to use but I keep getting errors. I've had a look at other people who have similar problems but I can't seem to figure out the issue. I'm sure there's a simple solution but I'm yet to find it.

Here's an example of the structure of the data that I'm trying to use to train the classifier.

In [1] >> train[0] Out[1] ({ u'profici': [False], u'saver': [False], u'four': [True], u'protest': [False], u'asian': [True], u'upsid': [False], . . . u'captain': [False], u'payoff': [False], u'whose': [False] }, 0)

Where train[0] is the first tuple in a list and contains:

A dictionary of features and boolean values to indicate the presence or absence of words in document[0]
The target label for the binary classification of document[0]

Obviously, the rest of the train list has the features and labels for the other documents that I want to classify.

When running the following code

from nltk.classify.scikitlearn import SklearnClassifier from sklearn.naive_bayes import MultinomialNB MNB_clf = SklearnClassifier(MultinomialNB()) MNB_clf.train(train)

I get the error message:

 TypeError: float() argument must be a string or a number

Edit:

features are created here. From a dataframe post_sent that contains the posts in column 1 and the sentiment classification in column 2.

 stopwords = set(stopwords.words('english')) tokenized = [] filtered_posts = [] punc_tokenizer = RegexpTokenizer(r'\w+') # tokenizing and removing stopwords for post in post_sent.post: tokenized = [word.lower() for word in. punc_tokenizer.tokenize(post)] filtered = ([w for w in tokenized if not w in stopwords]) filtered_posts.append(filtered) # stemming tokened_stemmed = [] for post in filtered_posts: stemmed = [] for w in post: stemmed.append(PorterStemmer().stem_word(w)) tokened_stemmed.append(stemmed) #frequency dist all_words =. list(itertools.chain.from_iterable(tokened_stemmed)) frequency = FreqDist(all_words) # Feature selection word_features = list(frequency.keys())[:3000] # IMPORTANT PART ####################### #------ featuresets creation --------- def find_features(list_of_posts): features = {} wrds = set(post) for w in word_features: features[w] = [w in wrds] return features # zipping inputs with targets words_and_sent = zip(tokened_stemmed, post_sent.sentiment) # IMPORTANT PART ########################## # feature sets created here featuresets = [(find_features(words), sentiment) for words, sentiment in words_and_sent]

The values of your feature dictionaries are all lists with a single value: [False]. Instead, they probably should directly be the boolean values True/False, without being wrapped in a list. — lenz
– lenz, Commented Apr 3, 2017 at 22:16
Ok so now I have a different issue.. ` In [1] >> train[0] out[1] >> ([False, False, True, ... False, ], 0) ` Which gives me the error AttributeError: 'list' object has no attribute 'iteritems' — Diarmaid Finnerty
– Diarmaid Finnerty, Commented Apr 3, 2017 at 22:39

Diarmaid Finnerty · Accepted Answer · 2017-04-04 12:53:53Z

Thanks to help from both Vivek & Lenz, who explained to me the problem, I was able to reorganise my training set and thankfully it now works. Thanks guys!

The problem was very well explained in Vivek's post. This is the code that reorganised the train data into the correct format.

 features_targ = [] for feature in range(0,len(featuresets)): dict_test = featuresets[feature] values = list(itertools.chain.from_iterable(dict_test[0].values())) keys = dict_test[0].keys() target = dict_test[1] dict_ion = {} for key in range(x,len(keys)): dict_ion[keys[key]] = values[key] features_targ.append((dict_ion,target))

Vivek Kumar · Accepted Answer · 2017-04-04 03:27:30Z

You are setting the train wrong. As @lenz said in comment, remove the parentheses in the feature dict values and only use single values.

As given in the official documentation:

labeled_featuresets – A list of (featureset, label) where each featureset is a dict mapping strings to either numbers, booleans or strings.

But you are setting the mapping (value of key in dict) as a list.

You correct train should look like :

[({u'profici':False, u'saver':False, u'four':True, u'protest':False, u'asian':True, u'upsid':False, . . }, 0), .. .. ({u'profici':True, u'saver':False, u'four':False, u'protest':False, u'asian':True, u'upsid':False, . . }, 1)]

You can take a look at more examples here: - http://www.nltk.org/howto/classify.html

Thank you Vivek, that's a lot clearer. I'm not particularly good with dictionary data types. Do you have any suggestions code wise as to how I can transform it from what I have now to what I need it to be?? Cheers
@DiarmaidFinnerty How did you create the feature sets in the first place? Update your answer (or post a new one) to include the code that produces your labeled featureset, then it's going to be straight-forward to show you how to fix it.
Hi Vivek, thanks for the help. If you feel like it you can edit you post to include the code that made it run correctly (posted as an answer below). Appreciate all the help!

Collectives™ on Stack Overflow

Naive Bayes for Text Classification - Python 2.7 Data Structure Issue

2 Answers 2

Comments

4 Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

4 Comments

Related