I have to predict the category under which ad was posted using the provided data; I cannot gain accuracy more than 74% for my model. I am not sure what I am missing.
What I have done so far:
- Cleaned the text using re & nltk
- Used stemmer
- CountVectorizer & Tfidftransformer
- Used MultinomialNB, LinearSVC & RandomForestClassifier
Following is my code :
import json import pandas as pd from sklearn.pipeline import Pipeline from sklearn.ensemble import RandomForestClassifier from sklearn.naive_bayes import MultinomialNB from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from sklearn.svm import LinearSVC,SVC x_train = [] y_train = [] with open("training-2.json", "r",encoding= "utf-8") as file: l = file.readline() for line in file: data = json.loads(line) joined_data = data["city"]+ " " + data["section"] + " " + data["heading"] x_train.append(joined_data) y_train.append(data["category"]) import re import nltk nltk.download('stopwords') from nltk.corpus import stopwords from nltk.stem.porter import PorterStemmer corpus = [] for i in range(0,len(x_train)): feature = re.sub("[^a-zA-z]", " ", x_train[i]) feature = feature.lower() feature = feature.split() ps = PorterStemmer() feature = [ps.stem(word) for word in feature if not word in set(stopwords.words("english"))] feature = " ".join(feature) corpus.append(feature) text_clf = Pipeline([('vect', CountVectorizer()),('itdf', Tfidftransformer())('clf', LinearSVC()) ]) text_clf.fit(corpus,y_train) After doing all the above steps I only get accuracy max 74% in the pipeline I have used different models.
Sample Data :
{"city":"newyork","category":"cell-phones","section":"for-sale","heading":"New batteries C-S2 for Blackberry 7100/7130/8700/Curve/Pearl"} {"city":"newyork","category":"cell-phones","section":"for-sale","heading":"******* Brand New Original SAMSUNG GALAXY NOTE 2 BATTERY ******"}