How to learn to score new documents based on a existing set of scored documents?

Question

I have a 50 000 documents of 1000 words or more ranked between 0 and 2000. They all deal with a similar topic. I'd like to create an algorithm that can learn to score new documents.

What approach do you think I should take?

I am a newbie in the field of machine learninig so if you could point me to introduction material about the particular solution you think of. I will be glad.

$\begingroup$ Nearest neighbor regression comes to mind. $\endgroup$

Emre
– Emre

2017-02-11 00:45:45 +00:00
Commented Feb 11, 2017 at 0:45 — Emre
– Emre, Commented Feb 11, 2017 at 0:45

Anshul G. · Accepted Answer · 2017-02-11 09:59:35Z

First come up with some features for the document. Stuff like frequency of some popular words associated with that topic might work. In this way get the features for all the documents and then apply some algorithm. Some ways you could apply them are:

1) k means - cluster the documents on the basis of the features. Each cluster should be predominantly associated with a particular score value. Then see which cluster a new document will be assigned to.

2) Supervised learning - use neural networks, multiclass SVMs etc to classify the new document to a particular class (score) using the model you would have generated.

All of these are examples of classification to a discrete score value. However, since you are dealing with a large score range (0-2000), you could also try something like regression which will give you a continuous value, but could be rounded off to the nearest discrete one.

Check out the Coursera course on Machine Learning for a great introduction!

Krishna Kalyan · Accepted Answer · 2017-02-11 11:48:58Z

Basically this is a classification problem.

You would want to model y(rank) ~ x(Document). And when you get a new document you need an estimated rank.

Things you may want to consider.

Do in need 2000 class labels, or is there any way to reduce the class labels to 3 classes etc. (class labels == score)
How do I represent my documents. In short they must be numerically represented. (Approaches - One Hot Encoding, TFIDF, Embedding etc)
Finally the last question which classification model should I use?.

Linear Algebra followed by Andrew Ngs ML course is a good starting point.

amirouche · Accepted Answer · 2017-02-11 20:26:21Z

This requires that the dataset can stay in RAM but it does what I want using sklearn:

import numpy as np from sklearn.feature_extraction.text import CountVectorizer from sklearn.linear_model import Ridge corpus = [ 'postgresql is great software for database engineering', 'postgresql is spam', 'spam & egg', 'postgresql is the same as pg', 'more database is great stuff', ] test = [ 'postgresql is spam & egg', 'pg is a great database software', ] vectorizer = CountVectorizer(min_df=1) X = vectorizer.fit_transform(corpus) y = np.array([[10, 1, 0, 7, 6]]).T # documents score clf = Ridge(alpha=1.0) clf.fit(X, y) Z = vectorizer.transform(test) print(clf.predict(Z))

Stack Exchange Network

How to learn to score new documents based on a existing set of scored documents?

3 Answers 3

Hot Network Questions

How to learn to score new documents based on a existing set of scored documents?

3 Answers 3

Related

Hot Network Questions