13
$\begingroup$

I have been running few ML models on same set of data for a binary classification problem with class proportion of 33:67.

I had the same algorithms and same set of hyperparamters during yesterday and today's run.

Please note that I also have the parameter random_state in each estimator function as shown below

np.random.seed(42) svm=SVC() # i replace the estimator here for diff algos svm_cv=GridSearchCV(svm,op_param_grid,cv=10,scoring='f1') svm_cv.fit(X_train_std,y_train) 

q1) Why does this change happens even when I have random_state configured?

q2) Is there anything else that I should do to reproduce the same results every time I run?

Please find below the results that are different? Here auc-Y denotes yesterday's run

enter image description here

$\endgroup$

1 Answer 1

16
$\begingroup$

Not every seed is the same.

Here is a definitive function that sets ALL of your seeds and you can expect complete reproducibility:

def seed_everything(seed=42): """" Seed everything. """ random.seed(seed) os.environ['PYTHONHASHSEED'] = str(seed) np.random.seed(seed) torch.manual_seed(seed) torch.cuda.manual_seed(seed) torch.cuda.manual_seed_all(seed) torch.backends.cudnn.deterministic = True 

You have to import torch, numpy etc.

UPDATE: How to set global randomseed for sklearn models:

Given that sklearn does not have its own global random seed but uses the numpy random seed we can set it globally with the above :

np.random.seed(seed) 

Here is a little experiment for scipy library, analogous would be sklearn (generating random numbers-usually weights):

import numpy as np from scipy.stats import norm print('Without seed') print(norm.rvs(100, size = 5)) print(norm.rvs(100, size = 5)) print('With the same seed') np.random.seed(42) print(norm.rvs(100, size = 5)) np.random.seed(42) # reset the random seed back to 42 print(norm.rvs(100, size = 5)) print('Without seed') np.random.seed(None) print(norm.rvs(100, size = 5)) print(norm.rvs(100, size = 5)) 

outputing and confirming

Without seed [100.27042599 100.9258397 100.20903163 99.88255017 99.29165699] [100.53127275 100.17750482 98.38604284 100.74109598 101.54287085] With the same seed **[101.36242188 101.13410818 102.36307449 99.74043318 98.83044407]** **[101.36242188 101.13410818 102.36307449 99.74043318 98.83044407]** Without seed [101.2933838 100.52176902 101.38602156 100.72865231 99.02271004] [100.19080241 99.11010957 99.51578106 101.56403284 100.37350788] 
$\endgroup$
9
  • $\begingroup$ Hi, thanks for the response. Upvoted. So I shouldn't be using random_state? I only use scikit-learn and classic machine learning models like Linear and Logistic, SVM, RF,Boosting... No deep learning. $\endgroup$ Commented Jan 12, 2020 at 11:56
  • $\begingroup$ So I don't have to worry about torch etc right? $\endgroup$ Commented Jan 12, 2020 at 11:57
  • $\begingroup$ This s a general answer, but yes for your specific case sklearn suffices. Question is do you have any dependencies, I dont know your whole code $\endgroup$ Commented Jan 12, 2020 at 11:57
  • $\begingroup$ updated my code. can you advise now? $\endgroup$ Commented Jan 12, 2020 at 11:59
  • 3
    $\begingroup$ This solves the problem but leaves the question about why the random_state argument of scikit-learn models doesn't ensure repeatability unanswered. $\endgroup$ Commented Jun 22, 2021 at 19:46

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.