Update the saved model after training

Question

Will saving a trained model this way give me a model trained on every chunk of data or just the last chunk?

df = pd.read_csv(, chunksize=10000) for chunk in df: text = chunk['body'] label = chunk['user_id'] print(text.shape, label.shape) X_train, X_test, y_train, y_test = train_test_split(text, label, test_size=0.3) text_clf.fit(X_train, y_train) filename = 'finalized_model.sav' joblib.dump(text_clf, filename) # load the model from disk loaded_model = joblib.load(filename)

For example, if the first chunk had labels 1 and 2, and the second chunk 3 and 4, will the final model be able to predict just 3 and 4? Or 1 and 2 as well, given the testing data has all the labels. Any help?

UPDATE The chunk is used to get text from the csv. I have updated my code.

It is difficult to provide you with advise because your code is not clear. For example, chunk is not used inside the for loop - in fact, nothing is changing inside the loop. — nwaldo
– nwaldo, Commented Apr 19, 2020 at 18:06
@nwaldo: Apologies. I have updated my code to make clear how chunk is being used inside the loop. Any answer would be helpful. — Anan Srivastava
– Anan Srivastava, Commented Apr 20, 2020 at 8:30
I just noticed the problem is if Chunking and chunksize while reading dataframe could have negative impact on saving the updated learning/training (because of different labels in the used chunks) . I was about saying that this is duplication of this post, however OP asked for Save the updated training which I assume is going to be done automatically via .partial_fit()! — Mario
– Mario, Commented Nov 10 at 17:26

Rusoiba · Accepted Answer · 2020-04-20 07:53:19Z

0

Short answer : a model will not be able to predict attribute it has not seen during the training process. From what I understand from your code, each model will erase the previous one.

answered Apr 20, 2020 at 7:53

Rusoiba

8595 silver badges15 bronze badges

$\begingroup$ Thank you for your answer. So how can I iteratively train the model (because the whole data set doesn't fit in the memory at once) and save the model trained on every chunk? Any answer would be helpful. I have updated my question for better clarity. $\endgroup$

Anan Srivastava
– Anan Srivastava

2020-04-20 08:32:18 +00:00
Commented Apr 20, 2020 at 8:32
$\begingroup$ With multiple different models it will be difficult to know which one use to predict if you dont know if the model was trained with data similar as the one you want to predict $\endgroup$

Rusoiba
– Rusoiba

2020-04-20 12:59:40 +00:00
Commented Apr 20, 2020 at 12:59
$\begingroup$ You could consider splitting the space into 4 or more subsets that groups similar data and then train a model for each of them. With the new observation to predict, you will have to determine in which subset it falls, so you will be able to ask the right model. If this answer helps, please consider voting up my answer :) $\endgroup$

Rusoiba
– Rusoiba

2020-04-20 13:01:44 +00:00
Commented Apr 20, 2020 at 13:01
$\begingroup$ Thank you. I am not sure if you are quite understanding my question. Your solution will not scale as the data scales. In fact, it will take up more memory. So basically, there is no way to save a model trained on the whole data? $\endgroup$

Anan Srivastava
– Anan Srivastava

2020-04-20 13:19:18 +00:00
Commented Apr 20, 2020 at 13:19

Add a comment |

Brandon Donehoo · Accepted Answer · 2020-09-27 05:04:17Z

I'm assuming you do not have access to a larger machine (e.g. leveraging cloud compute) and you have to run this locally in some way. If that's the case, one suggestion I would make is to try a combination of Vaex and sklearn's SGDRegressor functionality.

You can think of Vaex as providing easy manipulation of dataframes which are larger than your memory (i.e. out-of-core Dataframes). With this, you can do your feature engineering and EDA in a more straightforward way than loading chunks.

When you are ready to model, you can leverage some of the models in sklearn which allow for training on batches of data at a time (e.g. SGDRegressor or SGDClassifier). These options provide a partial_fit method for training you model sequentially on different batches from your dataset.

Some links if you are interested in going down this path:

https://github.com/vaexio/vaex

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html

https://towardsdatascience.com/ml-impossible-train-a-1-billion-sample-model-in-20-minutes-with-vaex-and-scikit-learn-on-your-9e2968e6f385

Mario · Accepted Answer · 2025-11-10 17:16:32Z

What you need is so called: Incremental Learning via partial_fit() with scikit-learn (source for further details)

"... supports incremental learning using models that implement the partial_fit() method which allows you to train your model on fone batch at a time, update it with new data continuously and avoid retraining from scratch."

example:

import pandas as pd import numpy as np from sklearn.linear_model import SGDClassifier from sklearn.preprocessing import StandardScaler from sklearn.metrics import accuracy_score, classification_report from sklearn.utils import shuffle ... X, y = shuffle(X, y, random_state=42) # Initialize the Incremental Model model = SGDClassifier(loss='log_loss', max_iter=1, warm_start=True) # Define Batch Size and Number of Batches batch_size = 10000 n_batches = X.shape[0] // batch_size # Train Model Incrementally in Batches for i in range(n_batches): start = i * batch_size end = start + batch_size X_batch = X[start:end] y_batch = y[start:end] if i == 0: model.partial_fit(X_batch, y_batch, classes=classes) else: model.partial_fit(X_batch, y_batch) if i % 5 == 0: y_pred = model.predict(X_batch) acc = accuracy_score(y_batch, y_pred) print(f"Batch {i + 1}, Accuracy: {acc:.4f}")

Note: don't forget to shuffle when you use partial_fit() (see this post)

Stack Exchange Network

Update the saved model after training

3 Answers 3

Linked

Hot Network Questions

Update the saved model after training

3 Answers 3

Linked

Related

Hot Network Questions