Need help to increase classification accuracy for classified ads posting

Question

I have to predict the category under which ad was posted using the provided data; I cannot gain accuracy more than 74% for my model. I am not sure what I am missing.

What I have done so far:

Cleaned the text using re & nltk
Used stemmer
CountVectorizer & Tfidftransformer
Used MultinomialNB, LinearSVC & RandomForestClassifier

Following is my code :

import json import pandas as pd from sklearn.pipeline import Pipeline from sklearn.ensemble import RandomForestClassifier from sklearn.naive_bayes import MultinomialNB from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from sklearn.svm import LinearSVC,SVC x_train = [] y_train = [] with open("training-2.json", "r",encoding= "utf-8") as file: l = file.readline() for line in file: data = json.loads(line) joined_data = data["city"]+ " " + data["section"] + " " + data["heading"] x_train.append(joined_data) y_train.append(data["category"]) import re import nltk nltk.download('stopwords') from nltk.corpus import stopwords from nltk.stem.porter import PorterStemmer corpus = [] for i in range(0,len(x_train)): feature = re.sub("[^a-zA-z]", " ", x_train[i]) feature = feature.lower() feature = feature.split() ps = PorterStemmer() feature = [ps.stem(word) for word in feature if not word in set(stopwords.words("english"))] feature = " ".join(feature) corpus.append(feature) text_clf = Pipeline([('vect', CountVectorizer()),('itdf', Tfidftransformer())('clf', LinearSVC()) ]) text_clf.fit(corpus,y_train)

After doing all the above steps I only get accuracy max 74% in the pipeline I have used different models.

Sample Data :

{"city":"newyork","category":"cell-phones","section":"for-sale","heading":"New batteries C-S2 for Blackberry 7100/7130/8700/Curve/Pearl"} {"city":"newyork","category":"cell-phones","section":"for-sale","heading":"******* Brand New Original SAMSUNG GALAXY NOTE 2 BATTERY ******"}

spectre · Accepted Answer · 2021-09-30 12:21:10Z

0

You can't say you are not getting better performance after just checking 3 models. There are a whole lot of models that you can use with your dataset to get the best performing one.

Also the data cleaning part can be done using different libraries (depending on the data). I don't know what your dataset looks like but I am sure you can try much more technique than just Countvectorizer and tfidf.

answered Sep 30, 2021 at 12:21

spectre

2,2882 gold badges14 silver badges37 bronze badges

$\begingroup$ I've given a data sample $\endgroup$

Omair
– Omair

2021-10-03 06:54:45 +00:00
Commented Oct 3, 2021 at 6:54
$\begingroup$ scikit-learn.org/stable/supervised_learning.html , scikit-learn.org/stable/tutorial/machine_learning_map/… here are some links where you can find and extensive list of models you can try. Try whatever models you can and see if results improve. $\endgroup$

spectre
– spectre

2021-10-03 11:52:25 +00:00
Commented Oct 3, 2021 at 11:52
$\begingroup$ Also from the code you have provided, you are not doing hyperparameter tuning which might be the biggest cause of performance not improving. Try different models and perform hyperparameter tuning in all of them and then find the best models. $\endgroup$

spectre
– spectre

2021-10-03 11:53:43 +00:00
Commented Oct 3, 2021 at 11:53

Add a comment |

JordiCarrera · Accepted Answer · 2021-10-06 07:40:53Z

Here a are a few things I'd look into:

Are the categories balanced in training-2.json? Class imbalance is a well-known issue in ML development, particularly whenever the class distribution on the training set does not match the distribution on the test set.
More interestingly, even if the classes are balanced (which, again, is a strong assumption that I recommend to verify), the input/texts might not be: given that you're concatenating city, section, and heading, you may face issues if some city or some section have many more datapoints than the rest, as the model may incorrectly correlate particular sections or cities with ad categories, and then unseen ads for those sections or cities would be misclassified in bulk. Are you sure you want the model to consider city and section? Should the category of the ad be dependent on the city? For instance, should the same ad go to "Sports" in San Francisco but to "Politics" in Atlanta? As far as I can tell, this is not the case (since an ad about politics will always be about politics, regardless of the city where it is served), so adding that to the input is likely to be a confounder for the model. I'd recommend to only use the heading for this task, given the available information.
Apply the data clean-up basics, like removing duplicates and near-duplicates.
More generally, do some general EDA (exploratory data analysis) to detect potential issues: related to point 2, sometimes there are well-defined clusters of documents that may bias the model towards specific latent sub-categories. For instance, if a category like Sports has 100 ads belonging to 3 clusters with 80, 15 and 5 documents, respectively, you can be quite certain that the first cluster will dominate the classification. That means that you're not really training a classifier for "Sports", but rather for the first cluster, and that ads in that cluster, regardless of their actual category, will be assigned to Sports, which can be another important source of noise. Again, I'd recommend to balance your dataset as much as possible over the full range of legitimate variance exhibited by your target domain.
Are there any issues with feature covariance or low frequency? Maybe you need to apply regularization to avoid overfitting?
What happens if you disable the stemmer? Stemmers are a rather crude for of preprocessing and they often introduce more errors than correct stemmings. If your dataset is big enough, I'd not use it. Consider using character-level features in the vectorizers instead (it's a parameter that you can set explicitly, and is a much better way of accounting for morphological variants).
Using CountVectorizer and TfidfVectorizer with short texts like these is tricky because their output becomes a bit meaningless: short titles tend to contain a single occurrence of most of their words (at least, of content words), which means that CountVectorizer has essentially no relevant input it can take advantage of (and will return a [0, 0, 0, 1, ...] vector for most datapoints, basically a dictionary encoding applying the identity function), and TfidfVectorizer is also missing the TF term of the TFIDF equation for the same reason, which basically ends up giving you an inverse probability matrix that penalizes all relevant category-defining words. So, I would probably either 1) use dictionary-based encoding or 2) fit the vectorizer over a modified X_train object where I have added a document for every category, and each of those documents contains the concatenated heading text of all the ads in that category (remember you can only fit the vectorizer like this at training time, but nothing prevents you from using it to transform test inputs once it has been fitted). In this way, the TF term will be significant again and it will be boosted according to the terms' appropriate strength in each category (= in the Sports category, "see the game" will be frequent terms), and the IDF term will now be more fair (= terms that are frequent across all categories are probably unrelated to any particular one of them).

Thank you for your great explanation just not understanding the last point clearly if you could explain it again, I will try the described steps and get back to you btw this data set is from hackeranks — Omair
– Omair, Commented Oct 7, 2021 at 11:50
@Omair Sorry for the delay (local holiday!). I think to understand my last point it's better to use an example, I've prepared a Jupyter notebook with a detailed explanation: github.com/JordiCarreraVentura/language_science/blob/main/… — JordiCarrera
– JordiCarrera, Commented Oct 12, 2021 at 13:31

Stack Exchange Network

Need help to increase classification accuracy for classified ads posting

2 Answers 2

Hot Network Questions

Need help to increase classification accuracy for classified ads posting

2 Answers 2

Related

Hot Network Questions