2

My data consist of comments (saved in files) and few of them are labelled as positive. I would like to use semi-supervised and PU classification to classify these comments into positive and negative classes. I would like to know if there is any public implementation for semi-supervised and PU implementations in python (scikit-learn)?

1 Answer 1

5

You could try to train a one-class SVM and see what kind of results that gives you. I haven't heard about the PU paper. I think for all practical purposes you will be much better of labelling some points and then using semi-supervised methods. If finding negative points is hard, I would try to use heuristics to find putative negative points (which I think is similar to the techniques in the PU paper). You could either classify unlabelled vs positive and then only look at the ones that score strongly for unlabelled, or learn a one-class SVM or similar and then look for negative points in the outliers.

If you are interested in actually solving the task, I would much rather invest time in manual labelling than implementing fancy methods.

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks Andreas, manual labeling would be last resort. I was looking at github.com/larsmans/scikit-learn/tree/pu-learning from larsman and some contributions from pemistahl. Does scikit-learn has onc-class SVM implementation?
The code is three years old but you could ask larsmans about it. He didn't seem to have followed up on it, though ;) What kind of scale are you talking about? How many labeled / unlabeled? I think manual labelling should be your first, not last resort. How will you evaluate any results you get without ground truth annotations?
"+1", thanks, I am aiming at 100k+ test data. For as of now, my test data set consists of 300 documents which I manually labelled them as positive and negative. While my training data set consist of 50 documents for each positive and negative classes. I don't have ground truth but can reliably label the positive documents. So, I could have one-class in training data set for 100k+ test data set and I think one-class SVM and PU might be good option to try. If I am going with one-class SVM or any other one-class semi-supervised learning how many labels do I need for training?

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.