Binary semi-supervised classification with positive only and unlabeled data set

Question

My data consist of comments (saved in files) and few of them are labelled as positive. I would like to use semi-supervised and PU classification to classify these comments into positive and negative classes. I would like to know if there is any public implementation for semi-supervised and PU implementations in python (scikit-learn)?

Andreas Mueller · Accepted Answer · 2014-09-07 21:40:25Z

5

You could try to train a one-class SVM and see what kind of results that gives you. I haven't heard about the PU paper. I think for all practical purposes you will be much better of labelling some points and then using semi-supervised methods. If finding negative points is hard, I would try to use heuristics to find putative negative points (which I think is similar to the techniques in the PU paper). You could either classify unlabelled vs positive and then only look at the ones that score strongly for unlabelled, or learn a one-class SVM or similar and then look for negative points in the outliers.

If you are interested in actually solving the task, I would much rather invest time in manual labelling than implementing fancy methods.

answered Sep 7, 2014 at 21:40

Andreas Mueller

28.9k8 gold badges65 silver badges78 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

imkhan Over a year ago

Thanks Andreas, manual labeling would be last resort. I was looking at github.com/larsmans/scikit-learn/tree/pu-learning from larsman and some contributions from pemistahl. Does scikit-learn has onc-class SVM implementation?

imkhan Over a year ago

oneclasssvm: scikit-learn.org/stable/modules/generated/…

Andreas Mueller Over a year ago

The code is three years old but you could ask larsmans about it. He didn't seem to have followed up on it, though ;) What kind of scale are you talking about? How many labeled / unlabeled? I think manual labelling should be your first, not last resort. How will you evaluate any results you get without ground truth annotations?

imkhan Over a year ago

"+1", thanks, I am aiming at 100k+ test data. For as of now, my test data set consists of 300 documents which I manually labelled them as positive and negative. While my training data set consist of 50 documents for each positive and negative classes. I don't have ground truth but can reliably label the positive documents. So, I could have one-class in training data set for 100k+ test data set and I think one-class SVM and PU might be good option to try. If I am going with one-class SVM or any other one-class semi-supervised learning how many labels do I need for training?

Collectives™ on Stack Overflow

Binary semi-supervised classification with positive only and unlabeled data set

1 Answer 1

4 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Linked

Related