Is Numpy giving unexpected results?

Question

I have a dataset X such that X.shape yields (10000, 9). I want to choose a subset of X with the following code:

X = np.asarray(np.random.normal(size = (10000,9))) train_fraction = 0.7 # fraction of X that will be marked as train data train_size = int(X.shape[0]*train_fraction) # fraction converted to number test_size = X.shape[0] - train_size # remaining rows will be marked as test data train_ind = np.asarray([False]*X.shape[0]) train_ind[np.random.randint(low = X.shape[0], size = (train_size,))] = True # mark True at 70% of the places

The problem is that np.sum(train_ind) is not the expected value of 7000. Instead it gives random values like 5033, etc.

I initially thought that np.random.randint(low = X.shape[0], size = (train_size,)) might be the culprit. But when I do np.random.randint(low = X.shape[0], size = (train_size,)).shape I get (7000,).

Where am I going wrong?

There are better ways to initialize a boolean numpy array, have a look here, I suggest the second best answer, not the accepted one. — Jürg W. Spaak
– Jürg W. Spaak, Commented Aug 10, 2017 at 10:56

Jürg W. Spaak · Accepted Answer · 2017-08-10 10:45:36Z

2

Take np.random.choice(np.arange(0,X.shape[0]), size = train_size, replace = False)

The problem is, that np.random.randint will not be injectiv, basically the number 1 might apear twice. This means that index 1 will be set to True twice, while another one will not be set to True.

The np.random.choice function ensures, that every number will occur at most once (if you set replace = False

answered Aug 10, 2017 at 10:45

Jürg W. Spaak

2,1591 gold badge18 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Clock Slave Over a year ago

This works. Thanks. You mention: 'basically the number 1 might apear twice. This means that index 1 will be set to True twice'. Yes, agreed. I realised that might be the problem. But when I run the code I pasted and take the sum, I get values above 7000 as well. Maybe I should state that more explicitly in the question. Your post answers the question why the sum could be less than 7000 but in the cases where sum is above 7000, what is going on is my main concern

Jürg W. Spaak Over a year ago

Yes, I was wondering that as well, it's kind of weird... I claim that you run the line train_ind[np.random.randint(low = X.shape[0], size = (train_size,))] = True several times, without resetting train_ind, I don't see how else this could happen. Tell me if that's not the case

Clock Slave Over a year ago

I ran the following code to see if I could reproduce the case where I got the sum to be in excess of 7000.

num_cases = 0 for i in range(10000): train_ind = np.asarray([False]*X.shape[0]) train_ind[np.random.randint(low = X.shape[0], size = (train_size,))] = True if sum(train_ind) > 7000: print(sum(train_ind)) num_cases+=1 print(num_cases)

At the end of this loop I got num_cases to be zero. I guess I might have run the assignment line twice before initializing the train_ind array. Edited the question. Thanks

Collectives™ on Stack Overflow

Is Numpy giving unexpected results?

1 Answer 1

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Linked

Related