61

I am looking for a simple function that can generate an array of specified random values based on their corresponding (also specified) probabilities. I only need it to generate float values, but I don't see why it shouldn't be able to generate any scalar. I can think of many ways of building this from existing functions, but I think I probably just missed an obvious SciPy or NumPy function.

E.g.:

>>> values = [1.1, 2.2, 3.3] >>> probabilities = [0.2, 0.5, 0.3] >>> print some_function(values, probabilities, size=10) (2.2, 1.1, 3.3, 3.3, 2.2, 2.2, 1.1, 2.2, 3.3, 2.2) 

Note: I found scipy.stats.rv_discrete but I don't understand how it works. Specifically, I do not understand what this (below) means nor what it should do:

numargs = generic.numargs [ <shape(s)> ] = ['Replace with resonable value', ]*numargs 

If rv_discrete is what I should be using, could you please provide me with a simple example and an explanation of the above "shape" statement?

5 Answers 5

99

Drawing from a discrete distribution is directly built into numpy. The function is called random.choice (difficult to find without any reference to discrete distributions in the numpy docs).

elements = [1.1, 2.2, 3.3] probabilities = [0.2, 0.5, 0.3] np.random.choice(elements, 10, p=probabilities) 
Sign up to request clarification or add additional context in comments.

4 Comments

Great! But, the correct syntax is: np.random.choice(elements, 10, p=list(probabilities))
Nice. I think this version came out after I posted my original question (I think this was first released in 1.7.0 which I believe came in 2013).
Very nice! Seems to work also without casting to list: np.random.choice(elements, 10, p=probabilities)).
In addition to comments by Sina and zeycus, elements and probabilites could have been ordinary lists instead of numpy.arrays and the code would work the same.
26

Here is a short, relatively simple function that returns weighted values, it uses NumPy's digitize, accumulate, and random_sample.

import numpy as np from numpy.random import random_sample def weighted_values(values, probabilities, size): bins = np.add.accumulate(probabilities) return values[np.digitize(random_sample(size), bins)] values = np.array([1.1, 2.2, 3.3]) probabilities = np.array([0.2, 0.5, 0.3]) print weighted_values(values, probabilities, 10) #Sample output: [ 2.2 2.2 1.1 2.2 2.2 3.3 3.3 2.2 3.3 3.3] 

It works like this:

  1. First using accumulate we create bins.
  2. Then we create a bunch of random numbers (between 0, and 1) using random_sample
  3. We use digitize to see which bins these numbers fall into.
  4. And return the corresponding values.

6 Comments

Yes this is basically what I was thinking of, but I just thought there may be a built-in function that does exactly that. From the sound of it, there is no such thing. I must admit - I would have not done it as elegantly. - Thanks
NumPy directly offers numpy.cumsum(), which can be used instead of np.add.accumulate() (np.add() is not very commonly used, so I recommend using cumsum()).
+1 for the useful numpy.digitize()! However, SciPy actually offers a function that directly answers the question—see my answer.
PS:… As noted by Tim_Y, using SciPy's function is much slower than using your "manual" solution (on 10k elements).
Do the probabilities need to be normalized for this ?
|
19

You were going in a good direction: the built-in scipy.stats.rv_discrete() quite directly creates a discrete random variable. Here is how it works:

>>> from scipy.stats import rv_discrete >>> values = numpy.array([1.1, 2.2, 3.3]) >>> probabilities = [0.2, 0.5, 0.3] >>> distrib = rv_discrete(values=(range(len(values)), probabilities)) # This defines a Scipy probability distribution >>> distrib.rvs(size=10) # 10 samples from range(len(values)) array([1, 2, 0, 2, 2, 0, 2, 1, 0, 2]) >>> values[_] # Conversion to specific discrete values (the fact that values is a NumPy array is used for the indexing) [2.2, 3.3, 1.1, 3.3, 3.3, 1.1, 3.3, 2.2, 1.1, 3.3] 

The distribution distrib above thus returns indexes from the values list.

More generally, rv_discrete() takes a sequence of integer values in the first elements of its values=(…,…) argument, and returns these values, in this case; there is no need to convert to specific (float) values. Here is an example:

>>> values = [10, 20, 30] >>> probabilities = [0.2, 0.5, 0.3] >>> distrib = rv_discrete(values=(values, probabilities)) >>> distrib.rvs(size=10) array([20, 20, 20, 20, 20, 20, 20, 30, 20, 20]) 

where (integer) input values are directly returned with the desired probability.

7 Comments

NOTE: I tried running timeit on it, and it appears to be a good 100x slower than fraxel's purely numpy version. Do you by any chance know why that is?
Wow, interesting! On 10k elements, I even get a factor of 300x slower. I had a quick look at the code: there are many checks performed, but I guess that they cannot explain such a big difference in running time; I did not go deep enough into the Scipy code to have been able to see where the difference could come from…
@TimY my naive guess is that the slowness is due to more work being done in pure Python, less work being done (under the hood) in C. (the mathematical/scientific packages in Python tend to wrap C code.)
suppose i were to start with an equation for my probability distribution. it seems silly to have to use that to generate a probability for each value, feed that to rv_discrete, and then get back from rv_discrete an approximation of the distribution i started with. is there any way to use user-defined equations directly with scipy?
@dbliss Now I see that you had in mind the case of a discrete distribution with an infinite number of possible values (which does not fit into this question). rv_discrete() does not have an option for this. I am not sure what the standard method for doing this is. (I can only think of slightly complicated variations of the usual method that transforms a uniform random variable into a variable with a non-uniform distribution, where the cumulative probability is only calculated for the most common values and extended beyond that when needed.)
|
6

The simplest DIY way would be to sum up the probabilities into a cumulative distribution. This way, you split the unit interval into sub-intervals of the length equal to your original probabilities. Now generate a single random number uniform on [0,1), and and see to which interval it lands.

1 Comment

Appreciate this math-y, less python package dependent, way.
4

You could also use Lea, a pure Python package dedicated to discrete probability distributions.

>>> distrib = Lea.fromValFreqs((1.1,2),(2.2,5),(3.3,3)) >>> distrib 1.1 : 2/10 2.2 : 5/10 3.3 : 3/10 >>> distrib.random(10) (2.2, 2.2, 1.1, 2.2, 2.2, 2.2, 1.1, 3.3, 1.1, 3.3) 

Et voilà!

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.