3
$\begingroup$

I have an algorithm which would be a rather easy classification task with a set of features and a class output which I would like to solve with a machine learning algo.

But I am having issues and doubts about the data generation. The features my algo uses as inputs are processed beforehand by other algorithms and more importantly, also have a feedback loop to my the algorithm I want to change.

Basically, the better my algo is getting, the less false positives there should be. But with less false positives, I have more and more imbalanced data to work with, which would mean, it is harder to train the algorithm. I could reduce the performance of my algo on purpose and generate data, but then I am not sure if the data I am getting is any meaningful as there is a feedback loop.

To me this seems like a chicken, egg problem.

$\endgroup$

2 Answers 2

1
$\begingroup$

Are you perhaps doing ensembles?

Usually, for imbalanced dataset, the easiest way is to oversample or undersample the data. You either repeat some data on classes containing small samples or cut-off some sample data on classes with very high frequency to make a balanced dataset.

Other technique is to use weights for classes with respect to the frequency of each class.

Another one is to build a model that generates artificial inputs like that in generative adversarial networks.

$\endgroup$
2
$\begingroup$

This does sounds like a bad idea, since you are selecting your data beforehand and hence likely to cause sample bias. Have you looked at anomaly detection approaches?

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.