Is it possible to implement a classifier according to quarters? What about missing data?

Question

I have a dataset with several individuals and features. I'm studying behavior over the year (for instance, averages or iterations of money gained, jobs, etc.).

My ultimate goal is to implement a classifier since I have a specific feature for every person (which is equal to 0, 1, or 2). When I first tried to implement a SVM, I ended up with bad results because I did not have enough data / features: I have too many number 1 individuals and not enough 0 and 2's, so my classifier almost always put people into category 1. Therefore, I tried to increase my number of lines by separating my data into quarters (i.e JAN, FEV, MAR, then APR, MAY, JUN, then JUL, AUG, SEPT, and finally OCT, NOV, DEC)

I was wondering two things:

Would that be a good idea? Do I have to be cautious of a particular hypothesis that could impact my results?
In case it is a good idea, I have some data available for some quarters of the year but sometimes it is missing (for instance let's imagine I don't have "Age" available for my last quarter) ; do I have to drop the feature ? Or would it be wiser to abandon the last quarter ? Or is it possible to make the classifier work despite that lack of information without actually deleting anything ?

Neil Slater · Accepted Answer · 2017-09-25 13:59:06Z

Would that be a good idea?

That is hard to tell from your description. It is not an immediately bad idea. If it results in a better classifier (according to cross-validation), then it has probably worked.

The main things that would concern me about splitting behaviour data by quarter and treating as independent are:

Your data samples will very likely be correlated when they share a person. You can work around this by careful splitting between training and cross-validation / test sets. Do not make a fully random split, but split by person - any individuals records should appear only in one of the training, cross-validation or test sets (assuming your goal is to take similar data in production from users who are not in your current database, and predict their class).
There could be seasonal variation in the records that reduce the effectiveness of the split. So a "type 1" person's records in APR-JUN might look like a "type 0" person from JAN-MAR.
How will you receive data in production - when you want to classify new users? If you only want to work on single-quarter data, then your new classifier is fine. If you have more data, you have to deal with your classifier maybe predicting different target variable for the same person depending on the quarter. You could combine these in some way - but if you do so, you should also do this in test to see what the impact of doing this is, which may be counter-productive (you end up with the same number of test examples as if you had not done the split). It might also be OK, perhaps it will add some regularisation.

Do I have to be cautious of a particular hypothesis that could impact my results?

You have to be very cautious about testing your classifier, because you could get data leakage from the cross-validation and test sets to the training set, which would make you think the classifier is generalising well when in fact it is not. The fix for this described above - split by person when deciding train/cv/test split.

I have some data available for some quarters of the year but sometimes it is missing (for instance let's imagine I don't have "Age" available for my last quarter) ; do I have to drop the feature ?

Handling missing data is a complicated topic in its own right, there are lots of options. You can start with:

If data is missing at random (i.e. there is no reason to suspect it is related to the target variable, or only impacts certain types of record), you can substitute the mean value of that feature from the training set, or impute it based on a statistical model from the other features.
If data is missing for reasons that might impact the target variable, then you should give that information to the classifier, because it might be an important feature in its own right. You can take the mean or more complex imputed value as before for the original feature, but also you should add a new boolean feature "feature X was missing".

Whether or not you should use the partial data or drop it is not possible to say in general. If you are not sure, then try both and pick the version with the best cross-validation result.

Thanks a lot for this very complete response. I especially like the idea of adding a boolean feature for missing feature ! I still have a question though : If I split my individuals between train/cv/test sets, since I have 4 quarters (so 4 rows which correspond to the same person but at a different quarter), I can only use 3 of these rows. Since my objective was to get more data, wouldn't it be a pity not to use it ? (although I understand that having the same person I a train/cv/test set would most likely affect my classifier) — MBB
– MBB, Commented Sep 25, 2017 at 15:33
@MBB: I am not sure I understand your question. If some people only have data for 3 quarters it would not prevent you trying the idea as your new classifier will work with per-quarter data. Just you have to put all 3 of those records for that person into one of train/cv/test, don't spread them across the different sets. — Neil Slater
– Neil Slater, Commented Sep 25, 2017 at 19:01
My bad I misunderstood one of your point ! Thank you again for the clarification and the explanation everything is clear now ! — MBB
– MBB, Commented Sep 27, 2017 at 16:12

Stack Exchange Network

Is it possible to implement a classifier according to quarters? What about missing data?

1 Answer 1

Linked

Hot Network Questions

Is it possible to implement a classifier according to quarters? What about missing data?

1 Answer 1

Linked

Related

Hot Network Questions