simulated samples for central limit theorem

Question

I am trying to help students visualize the central limit theorem and wanted to do this with simulated data.

I created a population dataset with three variables:

from random import seed from numpy.random import normal, negative_binomial, binomial import pandas as pd data = pd.DataFrame({ "Variable A": normal(0, 1, 100000), "Variable B": negative_binomial(1, 0.5, 100000), "Variable C": binomial(1, 0.5, 100000) })

I then wrote a function that allows me to specify different sample sizes, whether the sampling is conditional, and whether I collect multiple repeated samples.

def iterated_sample(data_frame, type = "random", sample_size = [20, 50, 100, 200, 500, 1000, 2000], number_of_samples = 1): def single_sample(data_frame, type, sample_size): if (type == "random"): single_sample_data = list(map(lambda x: data_frame.sample(x), sample_size)) else: condition_normal = data_frame["Variable B"] != 0 condition_poisson = data_frame["Variable C"] < 1 single_sample_data = list(map(lambda x: data_frame[condition_normal & condition_poisson].sample(x), sample_size)) return single_sample_data result = list(map(lambda x: single_sample(data_frame, type, sample_size), range(number_of_samples))) return result

The problem is that I have a list of list that is kind of a mess. I want to make a list of lists for each sample based on its size.

So my first thought was to jump to list comprehension:

df = iterated_sample(data_frame = data, number_of_samples = 10) sample_20 = [[el for el in element if len(el) == 20] for element in df] sample_50 = [[el for el in element if len(el) == 50] for element in df] ... sample_2000 = [[el for el in element if len(el) == 2000] for element in df]

This is absolutely gross. Is there a way I can avoid having to write a list comprehension for each of the sample sizes? Or how could I adjust iterated_sample() as it's pretty stupid and can be improved significantly

Welcome to Code Review! I changed the title so that it describes what the code does per site goals: "State what your code does in your title, not your main concerns about it.". Feel free to edit and give it a different title if there is something more appropriate. — Sᴀᴍ Onᴇᴌᴀ
– Sᴀᴍ Onᴇᴌᴀ ♦, Commented Feb 10, 2023 at 8:22

J_H · Accepted Answer · 2023-02-10 05:41:47Z

help students visualize the central limit theorem

Excellent!

 "Variable A": normal(0, 1, 100000), "Variable B": negative_binomial(1, 0.5, 100000), "Variable C": binomial(1, 0.5, 100000)

From both software engineering and pedagogic perspectives, I'm not excited about those three names. Part of this is driven by my desire to refer to e.g. binomial rows with data.C -- the two-word identifier you chose doesn't support such getattr() calls. The other part is to avoid the indirection -- please offer a citation to an author who used A, B, C, or spell them out with e.g. data.binomial. No biggie, that's my input, clearly the existing form also works.

def iterated_sample(data_frame, type = "random", ...

We usually try to avoid shadowing builtins such as dir & list. Here, I recommend you go with the usual convention of appending a suffix to the identifier: type_.

Also, df is a conventional identifier, but yeah I get it, you're spelling it out in a teaching context, very good.

 def single_sample(data_frame, ...

This is a nested def. There is nothing wrong with that, exactly. I will just confess that I am prejudiced against such nesting. If it is very very simple, and especially if there's motivation for sharing the same local variables, it can be a net win. The downsides are coupling (shared variables), and it is completely inaccessible for unit tests to probe.

Students copy what they see. This code is for teaching purposes, and there would be no downside to breaking out single_sample as a non-nested function. I recommend you do that.

As a separate concern, I would really like to see a """docstring""" describing the single responsibility of single_sample. It is admirably short. Yet I find I cannot articulate its single responsibility.

We have some anonymous lambdas in this function. Consider giving them informative names. Students copy what they see. Offer them self-descriptive examples, in the hope that they, too, will write such code.

Often result is a good identifier choice. But here consider choosing a name from the domain, such as sample.

tiny style nit:

df = iterated_sample(data_frame = data, number_of_samples = 10)

I encourage you to blacken your source code every now and again, to improve PEP-8 conformance. Alternatively, consider linting occasionally, and manually adopting suggested edits.

sample_20 = [[el for el in element if len(el) == 20] for element in df]

No, please don't do that.

We picked out elements from the data_frame. And then we picked out more than one el from each element? Honestly, this is just lazy naming. Please help me to reason about this code. Tell me the meaning of each element, and each el. Use the language of the problem domain.

Often ... for row in df] will make sense. And then we can speak of row.C or row.binomial or whatever.

Is there a way I can avoid having to write a list comprehension for each of the sample sizes?

Yes, certainly. A simple loop should suffice:

sample_sizes = [20, 50, 100, 200, 500, 1000, 2000] for size in sample_sizes: ...

Also, notice that instead of e.g.

 df2["sample_20"] = ...

you can write

 size = 20 df2[f"sample_{size}"] = ...

Overall?

Crafting good source code is hard, because communicating technical ideas to other people is hard.

And it is more difficult to write code that will pass muster with random reviewers and the perspectives they hold.

And it is more difficult still to write good pedagogic code. It is just amazing what students will copy-n-paste, and incorporate into their work going forward. What "works" for correct & maintainable production code is seldom "good enough" for teaching purposes. Alas, teachers have such a high standard to meet when assembling a lesson! But the pay-off is the change you see in your students when they absorb what is important.

Is this good enough to ship, to teach? Yes!

Should we learn from experience, from how it is received in the classroom, and revise it for subsequent academic years? Yes, again!

This is absolutely wonderful! Thank you for all of the advice and for all of your encouragement! I agree that a loop would be better than the list comprehension, it is super gross. But since they can be so darn inefficient, I was hoping I could come up with some alternative. But, I guess if it works .... shrugs — Damon C. Roberts
– Damon C. Roberts, Commented Feb 10, 2023 at 10:26

Stack Exchange Network

simulated samples for central limit theorem

1 Answer 1

You must log in to answer this question.

Hot Network Questions

simulated samples for central limit theorem

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions