How to sample Pandas DataFrame using a normal distribution by using random_state and numpy Generators

Question

I am trying to write Pandas code that would allow me to sample DataFrame using a normal distribution. The most convinient way is to use random_state parameter of the sample method to draw random samples, but somehow employ numpy.random.Generator.normal to draw random samples using a normal (Gaussian) distribution.

import pandas as pd import numpy as np import random # Generate a list of unique random numbers temp = random.sample(range(1, 101), 100) df = pd.DataFrame({'temperature': temp}) # Sample normal rng = np.random.default_rng() triangle_df.sample(n=10, random_state=rng.normal())

This obviously doesn't work. There is an issue with random_state=rng.normal().

I don't think you understand correctly what is a random state. See stackoverflow.com/questions/28064634/… for an explanation. So we can help you further, could you mathematically define what drawing random samples using a normal distribution means in your case? — deep-learnt-nerd
– deep-learnt-nerd, Commented Jan 30 at 12:51
I do understand Generators, but what I don't understand is why I cannot use a specific one - instead of the default uniform generator. Your link explains the case of reproducibility using pseudorandom numbers and seeds, which is not the issue in my question. — pjercic
– pjercic, Commented Jan 30 at 20:54

mozway · Accepted Answer · 2025-01-31 09:41:55Z

Passing a Generator to sample just changes the way the generator is initialized, it won't change the distribution that is used. Random sampling is uniform (choice is used internally [source]) and you can't change that directly with the random_state parameter.

Also note that normal sampling doesn't really make sense for discrete values (like the rows of a DataFrame).

Now let's assume that you want to sample the rows of your DataFrame in a non-uniform way (for example with weights that follow a normal distribution) you could use the weights parameter to pass custom weights for each row.

Here is an example with normal weights (although I'm not sure if this makes much sense):

rng = np.random.default_rng() weights = abs(rng.normal(size=len(df))) sampled = df.sample(n=10000, replace=True, weights=weights)

Another example based on this Q/A. Here we'll give higher probabilities to the rows from the middle of the DataFrame:

from scipy.stats import norm N = len(df) weights = norm.pdf(np.arange(N)-N//2, scale=5) df.sample(n=10, weights=weights).sort_index()

Output (mostly rows around 50):

 temperature 43 94 44 50 47 80 48 99 50 63 51 52 52 1 53 20 54 41 63 3

Probabilities of sampling with a bias for the center (and sampled points):

Thank you for the explanation and the example. You have really got my idea right, however quirky it sounds - this is the solution I was looking for. BUT I was really looking forward to using Generators instead of weights, for the code simplicity. Not to mention there IS a normal Generator in NumPy. I mean, if a Generator exists - why couldn't use it? That was my line of reasoning.
@pjercic actually that's not exactly true, rng.normal is not a Generator, rng is. Internally, when you pass a Generator to sample it will use rng.choice, which is uniform (unless you pass weights). So you really cannot make sample use rng.normal (it wouldn't make sense since the distribution is neither discrete nor bounded). The best you can do is really to use the weights.

Collectives™ on Stack Overflow

How to sample Pandas DataFrame using a normal distribution by using random_state and numpy Generators

1 Answer 1

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Linked

Related