I am having some difficulties in improving results from running a Naive Bayes algorithm. My dataset consists of 39 columns (some categorical, some numerical). However I only considered the main variable, i.e. Text, which contains all the spam and ham messages.
Since it is a spam filtering, I think that this field can be good. So I used countvectorizer and fit transform using them after removing stopwords.
I am getting a 60% of accuracy which is very very low! What do you think may cause this low result? Is there anything that I can do to improve it?
These are the columns out of 39 that I am considering:
Index(['Date', 'Username', 'Subject', 'Target', 'Country', 'Website','Text', 'Capital', 'Punctuation'], dtype='object') Date is in date format (e.g. 2018-02-06) Username is a string (e.g. Math) Subject is a string (e.g. I need your help) Target is a binary variable (1 -spam or 0-not spam) Country is a string (e.g. US) Website is a string (e.g. www.viagra.com) Text is the corpus of the email and it is a string (e.g. I need your HELP!!) Capital is a string (e.g. HELP) Punctuation is string (!!)
What I have done is the following:
removing stopwords in Text:
def clean_text(text):
lim_pun = [char for char in string.punctuation if char in "&#^_"] nopunc = [char for char in text if char not in lim_pun] nopunc = ''.join(nopunc) other_stop=['•','...in','...the','...you\'ve','–','—','-','⋆','...','C.','c','|','...The','...The','...When','...A','C','+','1','2','3','4','5','6','7','8','9','10', '2016', 'speak','also', 'seen','[5].', 'using', 'get', 'instead', "that's", '......','may', 'e', '...it', 'puts', '...over', '[✯]','happens', "they're",'hwo', '...a', 'called', '50s','c;', '20', 'per', 'however,','it,', 'yet', 'one', 'bs,', 'ms,', 'sr.', '...taking', 'may', '...of', 'course,', 'get', 'likely', 'no,'] ext_stopwords=stopwords.words('english')+other_stop clean_words = [word for word in nopunc.split() if word.lower() not in ext_stopwords] return clean_words
Then applying these changes to my dataset:
from sklearn.feature_extraction.text import CountVectorizer import string from nltk.corpus import stopwords df=df.dropna(subset=['Subject', 'Text']) df['Corpus']=df['Subject']+df['Text'] mex = CountVectorizer(analyzer=clean_text).fit_transform(df['Corpus'].str.lower()) and split my dataset into train and test:
X_train, X_test, y_train, y_test = train_test_split(mex, df['Target'], test_size = 0.80, random_state = 0) df includes 1110 emails with 322 spam emails.
Then I consider my classifier:
# Multinomial Naive Bayes from sklearn.naive_bayes import MultinomialNB classifier = MultinomialNB() classifier.fit(X_train, y_train) print(classifier.predict(X_train)) print(y_train.values) # Train data set from sklearn.metrics import classification_report,confusion_matrix, accuracy_score from sklearn.metrics import accuracy_score pred = classifier.predict(X_train) print(classification_report(y_train ,pred )) print('Confusion Matrix: \n',confusion_matrix(y_train,pred)) print() print("MNB Accuracy Score -> ",accuracy_score(y_train, pred)*100) print('Predicted value: ',classifier.predict(X_test)) print('Actual value: ',y_test.values) and evaluate the model on the test set:
from sklearn.metrics import classification_report,confusion_matrix, accuracy_score pred = classifier.predict(X_test) print(classification_report(y_test ,pred )) print('Confusion Matrix: \n', confusion_matrix(y_test,pred)) print() print("MNB Accuracy Score -> ",accuracy_score(y_test, pred)*100) getting approx 60%, which is not good at all. Output:
precision recall f1-score support 0.0 0.77 0.34 0.47 192 1.0 0.53 0.88 0.66 164 accuracy 0.59 356 macro avg 0.65 0.61 0.57 356 weighted avg 0.66 0.59 0.56 356 Confusion Matrix: [[ 66 126] [ 20 144]] I do not know if the problem are the stopwords or the fact that I am considering only Text or Corpus as column (it would be also good to consider Capital letters and punctuation as variables in the model).