I have a dataset with several individuals and features. I'm studying behavior over the year (for instance, averages or iterations of money gained, jobs, etc.).
My ultimate goal is to implement a classifier since I have a specific feature for every person (which is equal to 0, 1, or 2). When I first tried to implement a SVM, I ended up with bad results because I did not have enough data / features: I have too many number 1 individuals and not enough 0 and 2's, so my classifier almost always put people into category 1. Therefore, I tried to increase my number of lines by separating my data into quarters (i.e JAN, FEV, MAR, then APR, MAY, JUN, then JUL, AUG, SEPT, and finally OCT, NOV, DEC)
I was wondering two things:
- Would that be a good idea? Do I have to be cautious of a particular hypothesis that could impact my results?
- In case it is a good idea, I have some data available for some quarters of the year but sometimes it is missing (for instance let's imagine I don't have "Age" available for my last quarter) ; do I have to drop the feature ? Or would it be wiser to abandon the last quarter ? Or is it possible to make the classifier work despite that lack of information without actually deleting anything ?