1

I have a dataset X such that X.shape yields (10000, 9). I want to choose a subset of X with the following code:

X = np.asarray(np.random.normal(size = (10000,9))) train_fraction = 0.7 # fraction of X that will be marked as train data train_size = int(X.shape[0]*train_fraction) # fraction converted to number test_size = X.shape[0] - train_size # remaining rows will be marked as test data train_ind = np.asarray([False]*X.shape[0]) train_ind[np.random.randint(low = X.shape[0], size = (train_size,))] = True # mark True at 70% of the places 

The problem is that np.sum(train_ind) is not the expected value of 7000. Instead it gives random values like 5033, etc.

I initially thought that np.random.randint(low = X.shape[0], size = (train_size,)) might be the culprit. But when I do np.random.randint(low = X.shape[0], size = (train_size,)).shape I get (7000,).

Where am I going wrong?

2
  • 1
    There are better ways to initialize a boolean numpy array, have a look here, I suggest the second best answer, not the accepted one. Commented Aug 10, 2017 at 10:56
  • @JürgMerlinSpaak Thanks. This was helpful. Commented Aug 10, 2017 at 10:57

1 Answer 1

2

Take np.random.choice(np.arange(0,X.shape[0]), size = train_size, replace = False)

The problem is, that np.random.randint will not be injectiv, basically the number 1 might apear twice. This means that index 1 will be set to True twice, while another one will not be set to True.

The np.random.choice function ensures, that every number will occur at most once (if you set replace = False

Sign up to request clarification or add additional context in comments.

3 Comments

This works. Thanks. You mention: 'basically the number 1 might apear twice. This means that index 1 will be set to True twice'. Yes, agreed. I realised that might be the problem. But when I run the code I pasted and take the sum, I get values above 7000 as well. Maybe I should state that more explicitly in the question. Your post answers the question why the sum could be less than 7000 but in the cases where sum is above 7000, what is going on is my main concern
Yes, I was wondering that as well, it's kind of weird... I claim that you run the line train_ind[np.random.randint(low = X.shape[0], size = (train_size,))] = True several times, without resetting train_ind, I don't see how else this could happen. Tell me if that's not the case
I ran the following code to see if I could reproduce the case where I got the sum to be in excess of 7000. num_cases = 0 for i in range(10000): train_ind = np.asarray([False]*X.shape[0]) train_ind[np.random.randint(low = X.shape[0], size = (train_size,))] = True if sum(train_ind) > 7000: print(sum(train_ind)) num_cases+=1 print(num_cases) At the end of this loop I got num_cases to be zero. I guess I might have run the assignment line twice before initializing the train_ind array. Edited the question. Thanks

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.