How do I avoid re-training machine learning models

Question

self-learner here.

I am building a web application that predict events.

Let's consider this quick example.

X = [[0], [1], [2], [3]] y = [0, 0, 1, 1] from sklearn.neighbors import KNeighborsClassifier neigh = KNeighborsClassifier(n_neighbors=3) neigh.fit(X, y) print(neigh.predict([[1.1]]))

How can I keep the state of neigh so when I enter a new value like neigh.predict([[1.2]]) I don't need to re-train the model. Is there any good practice, or hint to start solving the problem ?

senderle · Accepted Answer · 2015-01-18 16:12:07Z

You've chosen a slightly confusing example for a couple of reasons. First, when you say neigh.predict([[1.2]]), you aren't adding a new training point, you're just doing a new prediction, so that doesn't require any changes at all. Second, KNN algorithms aren't really "trained" -- KNN is an instance-based algorithm, which means that "training" amounts to storing the training data in a suitable structure. As a result, this question has two different answers. I'll try to answer the KNN question first.

K Nearest Neighbors

For KNN, adding new training data amounts to appending new data points to the structure. However, it appears that scikit-learn doesn't provide any such functionality. (That's reasonable enough -- since KNN explicitly stores every training point, you can't just keep giving it new training points indefinitely.)

If you aren't using many training points, a simple list might be good enough for your needs! In that case, you could skip sklearn altogether, and just append new data points to your list. To make a prediction, do a linear search, saving the k nearest neighbors, and then make a prediction based on a simple "majority vote" -- if out of five neighbors, three or more are red, then return red, and so on. But keep in mind that every training point you add will slow the algorithm.

If you need to use many training points, you'll want to use a more efficient structure for nearest neighbor search, like a K-D Tree. There's a scipy K-D Tree implementation that ought to work. The query method allows you to find the k nearest neighbors. It will be more efficient than a list, but it will still get slower as you add more training data.

Online Learning

A more general answer to your question is that you are (unbeknownst to yourself) trying to do something called online learning. Online learning algorithms allow you to use individual training points as they arrive, and discard them once they've been used. For this to make sense, you need to be storing not the training points themselves (as in KNN) but a set of parameters, which you optimize.

This means that some algorithms are better suited to this than others. sklearn provides just a few algorithms capable of online learning. These all have a partial_fit method that will allow you to pass training data in batches. The SKDClassifier with 'hinge' or 'log' loss is probably a good starting point.

klubow · Accepted Answer · 2015-01-18 07:33:21Z

7

Or maybe you just want to save your model after fitting

joblib.dump(neigh, FName)

and load it when needed

neigh = joblib.load(FName) neigh.predict([[1.1]])

answered Jan 18, 2015 at 7:33

klubow

4313 silver badges11 bronze badges

2 Comments

senderle Over a year ago

This seems to be based on a different interpretation of the question than mine. @user3378649, could you clarify which interpretation is correct? If you only meant that you wanted to save the model before your app ends and load it when it starts up again, then this answer is good. (But mine will still be applicable if you want to add more training data.)

user3378649 Over a year ago

@enderle This should be the right answer! I have 1 GO trainig dataset, and I don't wanna re-train the model everytime when the user hit bottom of "predict".

Collectives™ on Stack Overflow

How do I avoid re-training machine learning models

2 Answers 2

Comments

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Related