Deciding on the best algorithm for a classification problem

Question

I have a data set consisting of census data (age, sex, employment type, race, education level etc.). My task is to write an algorithm that predicts whether a data point (30, male, white etc.) will have a gross annual income of above $50k.

So far I implemented a KNN algorithm that runs for 30 hours, but achieves ~90% accuracy on test data. I was hoping to achieve higher accuracy using a SVM algorithm, or Naive Bayes, or anything else that might work here.

I'm looking for an algorithm that will be relatively simple to implement(about as hard as KNN) in python, and is likely to achieve good accuracy. What is the best choice in this case? If KNN is the best choice, which algorithm will be easiest to implement for comparison purposes?

If the prediction is continuous why are not trying regression? — Hayat
– Hayat, Commented May 1, 2019 at 4:27
I only need to predict if its above or below 50k, rather than giving an estimate, so it can be treated as a classification problem. Also, I have the income for the test data as "<=50k" or ">50k" — Alex Schwartzman
– Alex Schwartzman, Commented May 1, 2019 at 4:38
You could always try simple NNs with Keras. Here's a quick starter post here — Turtalicious
– Turtalicious, Commented May 1, 2019 at 5:08

A Co · Accepted Answer · 2019-05-01 06:53:20Z

It is hard to tell a priori which algorithm will perform better. Usually, for traditional classification tasks such as yours, random forest, gradient boosted machines and SVM are often giving the best results.

I dont' know what you mean by looking for an algorithm that is "relatively simple to implement", but if you use scikit-learn, a lot of algorithms are already implemented and will fit in one or two lines of code so you can try them all!

Collectives™ on Stack Overflow

Deciding on the best algorithm for a classification problem

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related