Generating Discrete random variables with specified weights using SciPy or NumPy

Question

I am looking for a simple function that can generate an array of specified random values based on their corresponding (also specified) probabilities. I only need it to generate float values, but I don't see why it shouldn't be able to generate any scalar. I can think of many ways of building this from existing functions, but I think I probably just missed an obvious SciPy or NumPy function.

E.g.:

>>> values = [1.1, 2.2, 3.3] >>> probabilities = [0.2, 0.5, 0.3] >>> print some_function(values, probabilities, size=10) (2.2, 1.1, 3.3, 3.3, 2.2, 2.2, 1.1, 2.2, 3.3, 2.2)

Note: I found scipy.stats.rv_discrete but I don't understand how it works. Specifically, I do not understand what this (below) means nor what it should do:

numargs = generic.numargs [ <shape(s)> ] = ['Replace with resonable value', ]*numargs

If rv_discrete is what I should be using, could you please provide me with a simple example and an explanation of the above "shape" statement?

Hari · Accepted Answer · 2020-06-28 12:15:53Z

99

Drawing from a discrete distribution is directly built into numpy. The function is called random.choice (difficult to find without any reference to discrete distributions in the numpy docs).

elements = [1.1, 2.2, 3.3] probabilities = [0.2, 0.5, 0.3] np.random.choice(elements, 10, p=probabilities)

edited Jun 28, 2020 at 12:15

Hari

1,9854 gold badges21 silver badges31 bronze badges

answered Aug 24, 2015 at 10:11

goebbe

1,0148 silver badges2 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Sina Over a year ago

Great! But, the correct syntax is: np.random.choice(elements, 10, p=list(probabilities))

TimY Over a year ago

Nice. I think this version came out after I posted my original question (I think this was first released in 1.7.0 which I believe came in 2013).

zeycus Over a year ago

Very nice! Seems to work also without casting to list: np.random.choice(elements, 10, p=probabilities)).

arekolek Over a year ago

In addition to comments by Sina and zeycus, elements and probabilites could have been ordinary lists instead of numpy.arrays and the code would work the same.

Eric O. Lebigot · Accepted Answer · 2016-01-16 10:42:50Z

Here is a short, relatively simple function that returns weighted values, it uses NumPy's digitize, accumulate, and random_sample.

import numpy as np from numpy.random import random_sample def weighted_values(values, probabilities, size): bins = np.add.accumulate(probabilities) return values[np.digitize(random_sample(size), bins)] values = np.array([1.1, 2.2, 3.3]) probabilities = np.array([0.2, 0.5, 0.3]) print weighted_values(values, probabilities, 10) #Sample output: [ 2.2 2.2 1.1 2.2 2.2 3.3 3.3 2.2 3.3 3.3]

It works like this:

First using accumulate we create bins.
Then we create a bunch of random numbers (between 0, and 1) using random_sample
We use digitize to see which bins these numbers fall into.
And return the corresponding values.

Yes this is basically what I was thinking of, but I just thought there may be a built-in function that does exactly that. From the sound of it, there is no such thing. I must admit - I would have not done it as elegantly. - Thanks
NumPy directly offers numpy.cumsum(), which can be used instead of np.add.accumulate() (np.add() is not very commonly used, so I recommend using cumsum()).
+1 for the useful numpy.digitize()! However, SciPy actually offers a function that directly answers the question—see my answer.
PS:… As noted by Tim_Y, using SciPy's function is much slower than using your "manual" solution (on 10k elements).

Eric O. Lebigot · Accepted Answer · 2014-09-07 02:15:13Z

You were going in a good direction: the built-in scipy.stats.rv_discrete() quite directly creates a discrete random variable. Here is how it works:

>>> from scipy.stats import rv_discrete >>> values = numpy.array([1.1, 2.2, 3.3]) >>> probabilities = [0.2, 0.5, 0.3] >>> distrib = rv_discrete(values=(range(len(values)), probabilities)) # This defines a Scipy probability distribution >>> distrib.rvs(size=10) # 10 samples from range(len(values)) array([1, 2, 0, 2, 2, 0, 2, 1, 0, 2]) >>> values[_] # Conversion to specific discrete values (the fact that values is a NumPy array is used for the indexing) [2.2, 3.3, 1.1, 3.3, 3.3, 1.1, 3.3, 2.2, 1.1, 3.3]

The distribution distrib above thus returns indexes from the values list.

More generally, rv_discrete() takes a sequence of integer values in the first elements of its values=(…,…) argument, and returns these values, in this case; there is no need to convert to specific (float) values. Here is an example:

>>> values = [10, 20, 30] >>> probabilities = [0.2, 0.5, 0.3] >>> distrib = rv_discrete(values=(values, probabilities)) >>> distrib.rvs(size=10) array([20, 20, 20, 20, 20, 20, 20, 30, 20, 20])

where (integer) input values are directly returned with the desired probability.

NOTE: I tried running timeit on it, and it appears to be a good 100x slower than fraxel's purely numpy version. Do you by any chance know why that is?
Wow, interesting! On 10k elements, I even get a factor of 300x slower. I had a quick look at the code: there are many checks performed, but I guess that they cannot explain such a big difference in running time; I did not go deep enough into the Scipy code to have been able to see where the difference could come from…
@TimY my naive guess is that the slowness is due to more work being done in pure Python, less work being done (under the hood) in C. (the mathematical/scientific packages in Python tend to wrap C code.)
suppose i were to start with an equation for my probability distribution. it seems silly to have to use that to generate a probability for each value, feed that to rv_discrete, and then get back from rv_discrete an approximation of the distribution i started with. is there any way to use user-defined equations directly with scipy?
@dbliss Now I see that you had in mind the case of a discrete distribution with an infinite number of possible values (which does not fit into this question). rv_discrete() does not have an option for this. I am not sure what the standard method for doing this is. (I can only think of slightly complicated variations of the usual method that transforms a uniform random variable into a variable with a non-uniform distribution, where the cumulative probability is only calculated for the most common values and extended beyond that when needed.)

ev-br · Accepted Answer · 2012-07-07 12:06:25Z

The simplest DIY way would be to sum up the probabilities into a cumulative distribution. This way, you split the unit interval into sub-intervals of the length equal to your original probabilities. Now generate a single random number uniform on [0,1), and and see to which interval it lands.

Pierre Denis · Accepted Answer · 2013-12-13 12:05:08Z

You could also use Lea, a pure Python package dedicated to discrete probability distributions.

>>> distrib = Lea.fromValFreqs((1.1,2),(2.2,5),(3.3,3)) >>> distrib 1.1 : 2/10 2.2 : 5/10 3.3 : 3/10 >>> distrib.random(10) (2.2, 2.2, 1.1, 2.2, 2.2, 2.2, 1.1, 3.3, 1.1, 3.3)

Et voilà!

Collectives™ on Stack Overflow

Generating Discrete random variables with specified weights using SciPy or NumPy

5 Answers 5

4 Comments

6 Comments

7 Comments

1 Comment

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

4 Comments

6 Comments

7 Comments

1 Comment

Comments

Linked

Related