5

I'm trying to write a unit test for some of my code that uses scikit-learn. However, my unit tests seem to be non-deterministic.

AFAIK, the only places in my code where scikit-learn uses any randomness are in its LogisticRegression model and its train_test_split, so I have the following:

RANDOM_SEED = 5 self.lr = LogisticRegression(random_state=RANDOM_SEED) X_train, X_test, y_train, test_labels = train_test_split(docs, labels, test_size=TEST_SET_PROPORTION, random_state=RANDOM_SEED) 

But this doesn't seem to work -- even when I pass a fixed docs and a fixed labels, the prediction probabilities on a fixed validation set vary from run to run.

I also tried adding a numpy.random.seed(RANDOM_SEED) call at the top of my code, but that didn't seem to work either.

Is there anything I'm missing? Is there a way to pass a seed to scikit-learn in a single place, so that seed is used throughout all of scikit-learn's invocations?

7
  • 2
    It's very likely that there is something else wrong in your code! Using a seed in LR and Splitting will be enough to make sure it's behaving deterministically! Commented Nov 22, 2016 at 19:49
  • 2
    I'm not sure if it will solve your determinism problem, but this isn't the right way to use a fixed seed with scikit-learn. Instantiate a prng=numpy.random.RandomState(RANDOM_SEED) instance, then pass that as random_state=prng to each individual function. If you just pass RANDOM_SEED, each individual function will restart and give the same numbers in different places, causing bad correlations. Commented Nov 22, 2016 at 21:01
  • @RobertKern Can you elaborate? I don't quite understand what you are trying to explain. But of course using an int-seed is a valid approach of making these functions deterministic. Maybe you are talking about problems with distributed-seeding but even if so, i can't understand where that is coming from and there also much better approaches then. Commented Nov 22, 2016 at 21:07
  • Determinism isn't the only important thing. Statistical independence is also important, and you don't get that by passing the same integer seed to multiple scikit-learn functions in the same pipeline. You want exactly one RandomState instance to be shared by all functions in the pipeline. Commented Nov 22, 2016 at 21:33
  • @RobertKern That depends on the environment / task (and of course the PRNG), but is not applying to the OP's problem here. Commented Nov 22, 2016 at 21:37

1 Answer 1

5
from sklearn import datasets, linear_model iris = datasets.load_iris() (X, y) = iris.data, iris.target RANDOM_SEED = 5 lr = linear_model.LogisticRegression(random_state=RANDOM_SEED) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=RANDOM_SEED) lr.fit(X_train, y_train) lr.score(X_test, y_test) 

produced 0.93333333333333335 several times now. The way you did it seems ok. Another way is to set np.random.seed() or use Sacred for documented randomness. Using random_state is what the docs describe:

If your code relies on a random number generator, it should never use functions like numpy.random.random or numpy.random.normal. This approach can lead to repeatability issues in unit tests. Instead, a numpy.random.RandomState object should be used, which is built from a random_state argument passed to the class or function.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.