2
$\begingroup$

I have a 50 000 documents of 1000 words or more ranked between 0 and 2000. They all deal with a similar topic. I'd like to create an algorithm that can learn to score new documents.

What approach do you think I should take?

I am a newbie in the field of machine learninig so if you could point me to introduction material about the particular solution you think of. I will be glad.

$\endgroup$
1
  • 2
    $\begingroup$ Nearest neighbor regression comes to mind. $\endgroup$ Commented Feb 11, 2017 at 0:45

3 Answers 3

3
$\begingroup$

First come up with some features for the document. Stuff like frequency of some popular words associated with that topic might work. In this way get the features for all the documents and then apply some algorithm. Some ways you could apply them are:

1) k means - cluster the documents on the basis of the features. Each cluster should be predominantly associated with a particular score value. Then see which cluster a new document will be assigned to.

2) Supervised learning - use neural networks, multiclass SVMs etc to classify the new document to a particular class (score) using the model you would have generated.

All of these are examples of classification to a discrete score value. However, since you are dealing with a large score range (0-2000), you could also try something like regression which will give you a continuous value, but could be rounded off to the nearest discrete one.

Check out the Coursera course on Machine Learning for a great introduction!

$\endgroup$
1
$\begingroup$

Basically this is a classification problem.

You would want to model y(rank) ~ x(Document). And when you get a new document you need an estimated rank.

Things you may want to consider.

  • Do in need 2000 class labels, or is there any way to reduce the class labels to 3 classes etc. (class labels == score)
  • How do I represent my documents. In short they must be numerically represented. (Approaches - One Hot Encoding, TFIDF, Embedding etc)
  • Finally the last question which classification model should I use?.

Linear Algebra followed by Andrew Ngs ML course is a good starting point.

$\endgroup$
1
$\begingroup$

This requires that the dataset can stay in RAM but it does what I want using sklearn:

import numpy as np from sklearn.feature_extraction.text import CountVectorizer from sklearn.linear_model import Ridge corpus = [ 'postgresql is great software for database engineering', 'postgresql is spam', 'spam & egg', 'postgresql is the same as pg', 'more database is great stuff', ] test = [ 'postgresql is spam & egg', 'pg is a great database software', ] vectorizer = CountVectorizer(min_df=1) X = vectorizer.fit_transform(corpus) y = np.array([[10, 1, 0, 7, 6]]).T # documents score clf = Ridge(alpha=1.0) clf.fit(X, y) Z = vectorizer.transform(test) print(clf.predict(Z)) 
$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.