3

I have a list of words

count=100 list = ['apple','orange','mango'] 

for the count above using random function is it possible to select 40% of the time apple, 30% of the time orange and 30% of the time mango?

for ex:

for the count=100, 40 times apple, 30 times orange and 30 times mango. 

this select has to happen randomly

2 Answers 2

4

Based on an answer to the question about generating discrete random variables with specified weights, you can use numpy.random.choice to get 20 times faster code than with random.choice:

from numpy.random import choice sample = choice(['apple','orange','mango'], p=[0.4, 0.3, 0.3], size=1000000) from collections import Counter print(Counter(sample)) 

Outputs:

Counter({'apple': 399778, 'orange': 300317, 'mango': 299905}) 

Not to mention that it is actually easier than "to build a list in the required proportions and then shuffle it".

Also, shuffle would always produce exactly 40% apples, 30% orange and 30% mango, which is not the same as saying "produce a sample of million fruits according to a discrete probability distribution". The latter is what both choice solutions do (and the bisect too). As can be seen above, there is about 40% apples, etc., when using numpy.

Sign up to request clarification or add additional context in comments.

Comments

3

The easiest way is to build a list in the required proportions and then shuffle it.

>>> import random >>> result = ['apple'] * 40 + ['orange'] * 30 + ['mango'] * 30 >>> random.shuffle(result) 

Edit for the new requirement that the count is really 1,000,000:

>>> count = 1000000 >>> pool = ['apple'] * 4 + ['orange'] * 3 + ['mango'] * 3 >>> for i in xrange(count): print random.choice(pool) 

A slower but more general alternative approach is to bisect a cumulative probability distribution:

>>> import bisect >>> choices = ['apple', 'orange', 'mango'] >>> cum_prob_dist = [0.4, 0.7] >>> for i in xrange(count): print choices[bisect.bisect(cum_prob_dist, random.random())] 

2 Comments

But if the count=1000000 then the list size will increase right, actually I'm trying to simulate a data set like 1000000 rows per day over a period of one month, would it be good if i use the same logic?
The concept is perfectly general and there are many ways to build on it. I editted the answer to show how to use random.choice() to make one selection at a time from of pool where the elements are in proper proportion. You could also make a cumulative distribution and use bisect for the selection but that would have been overkill for the way you described your problem.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.