help students visualize the central limit theorem
Excellent!
"Variable A": normal(0, 1, 100000), "Variable B": negative_binomial(1, 0.5, 100000), "Variable C": binomial(1, 0.5, 100000)
From both software engineering and pedagogic perspectives, I'm not excited about those three names. Part of this is driven by my desire to refer to e.g. binomial rows with data.C -- the two-word identifier you chose doesn't support such getattr() calls. The other part is to avoid the indirection -- please offer a citation to an author who used A, B, C, or spell them out with e.g. data.binomial. No biggie, that's my input, clearly the existing form also works.
def iterated_sample(data_frame, type = "random", ...
We usually try to avoid shadowing builtins such as dir & list. Here, I recommend you go with the usual convention of appending a suffix to the identifier: type_.
Also, df is a conventional identifier, but yeah I get it, you're spelling it out in a teaching context, very good.
def single_sample(data_frame, ...
This is a nested def. There is nothing wrong with that, exactly. I will just confess that I am prejudiced against such nesting. If it is very very simple, and especially if there's motivation for sharing the same local variables, it can be a net win. The downsides are coupling (shared variables), and it is completely inaccessible for unit tests to probe.
Students copy what they see. This code is for teaching purposes, and there would be no downside to breaking out single_sample as a non-nested function. I recommend you do that.
As a separate concern, I would really like to see a """docstring""" describing the single responsibility of single_sample. It is admirably short. Yet I find I cannot articulate its single responsibility.
We have some anonymous lambdas in this function. Consider giving them informative names. Students copy what they see. Offer them self-descriptive examples, in the hope that they, too, will write such code.
Often result is a good identifier choice. But here consider choosing a name from the domain, such as sample.
tiny style nit:
df = iterated_sample(data_frame = data, number_of_samples = 10)
I encourage you to blacken your source code every now and again, to improve PEP-8 conformance. Alternatively, consider linting occasionally, and manually adopting suggested edits.
sample_20 = [[el for el in element if len(el) == 20] for element in df]
No, please don't do that.
We picked out elements from the data_frame. And then we picked out more than one el from each element? Honestly, this is just lazy naming. Please help me to reason about this code. Tell me the meaning of each element, and each el. Use the language of the problem domain.
Often ... for row in df] will make sense. And then we can speak of row.C or row.binomial or whatever.
Is there a way I can avoid having to write a list comprehension for each of the sample sizes?
Yes, certainly. A simple loop should suffice:
sample_sizes = [20, 50, 100, 200, 500, 1000, 2000] for size in sample_sizes: ...
Also, notice that instead of e.g.
df2["sample_20"] = ...
you can write
size = 20 df2[f"sample_{size}"] = ...
Overall?
Crafting good source code is hard, because communicating technical ideas to other people is hard.
And it is more difficult to write code that will pass muster with random reviewers and the perspectives they hold.
And it is more difficult still to write good pedagogic code. It is just amazing what students will copy-n-paste, and incorporate into their work going forward. What "works" for correct & maintainable production code is seldom "good enough" for teaching purposes. Alas, teachers have such a high standard to meet when assembling a lesson! But the pay-off is the change you see in your students when they absorb what is important.
Is this good enough to ship, to teach? Yes!
Should we learn from experience, from how it is received in the classroom, and revise it for subsequent academic years? Yes, again!