I'm trying to write a unit test for some of my code that uses scikit-learn. However, my unit tests seem to be non-deterministic.
AFAIK, the only places in my code where scikit-learn uses any randomness are in its LogisticRegression model and its train_test_split, so I have the following:
RANDOM_SEED = 5 self.lr = LogisticRegression(random_state=RANDOM_SEED) X_train, X_test, y_train, test_labels = train_test_split(docs, labels, test_size=TEST_SET_PROPORTION, random_state=RANDOM_SEED) But this doesn't seem to work -- even when I pass a fixed docs and a fixed labels, the prediction probabilities on a fixed validation set vary from run to run.
I also tried adding a numpy.random.seed(RANDOM_SEED) call at the top of my code, but that didn't seem to work either.
Is there anything I'm missing? Is there a way to pass a seed to scikit-learn in a single place, so that seed is used throughout all of scikit-learn's invocations?
scikit-learn. Instantiate aprng=numpy.random.RandomState(RANDOM_SEED)instance, then pass that asrandom_state=prngto each individual function. If you just passRANDOM_SEED, each individual function will restart and give the same numbers in different places, causing bad correlations.scikit-learnfunctions in the same pipeline. You want exactly oneRandomStateinstance to be shared by all functions in the pipeline.