It's easy to see thatIn the general case, the problem of finding the sample of words with the "most even" distribution of letters is NP-hard. Here I'm considering a general instance of this problem to be:
Given an alphabet \$ Σ \$, a set \$ W \$ of 4-letter words over that alphabet, and an integer \$ n ≤ \left|W\right| \$, find the subset \$ W^* ⊆ W \$ with size \$ \left|W^*\right| = n \$ having the smallest variance of letter counts among all subsets of that size.
Here's a proof that this is NP-hard, by a reduction from EXACT COVER BY 3-SETS (that is, EXACT COVER restricted to sets of size 3, also known as "X3C").
This translates into the word problem as follows. Each element ofLet \$ X \$ corresponds to a letter, as does each element of\$ Σ = X ∪ S \$; let \$ S \$. Each\$ W \$ be the set \$ s ∈ S \$ corresponds to aof four-letter wordwords (consisting of the three letters corresponding to the elements of\$ abcd \$ such that \$ s \$\$ \{ a, b, c \} = d ∈ S \$, and the single letter corresponding tolet \$ s \$ itself)\$ n = { \left|X\right| \over 3 } \$. FindThen solve the sample of words of sizeword problem to find \$ \left|X\right| \over 3 \$\$ W^* \$ with the smallest variance of letter counts. There is an exact cover if and only if the variance of the letter counts in the sample\$ W^* \$ is zero. (Because there can only be one instance of eachany letter corresponding to the elements offrom \$ S \$, and so if the variance is zero, there must be exactly one instance of each letter corresponding to the elements offrom \$ X \$, and since there are \$ \left|X\right| \over 3 \$ words, each element of \$ X \$ is covered exactly once.)
Having code at the top-level of a module makes it hard to test from the interactive interpreter, because whenever you reload the module, the code runs. It's best to guard top-level code with
if __name__ == '__main__':.The comment for
sampldoesn't explain the most important points about the behaviour of the function, namely: what arguments does it take? and what does it return?The number 100 is arbitrary, so it ought to be a parameter.
When finding the minimum or maximum, it's best to avoid a starting point like this, but instead to rearrange the code into a generator that can then be passed to
min. That way you can avoid arbitrary starting points like 1000000.Writing
for w in range(r):is misleading, becausewis not actually used (it's immediately overwritten by the random sample). It's conventional to write_for unused loop variables.The code uses NumPy to do the statistics on the sample. But there are at most 400 letters in the sample, so this doesn't really gain much: NumPy only shows a benefit for large volumes of data. I think that using the standard Python functions
statistics.pstdev,maxandminwould be adequate for this amount of data.Since the standard deviation is only used for comparison, you'd get the same results if you used the variance instead, and this would have the advantage of avoiding the square root.
Instead of
Counter("".join(w)), useitertools.chainitertools.chain.from_iterableand writeCounter(chain.from_iterable(*ww)). This avoids concatenating the words. [Improved by Veedrac in comments.]Instead of:
counts = [v for k, v in Counter("".join(w)).items()]
counts = list(Counter(chain.from_iterable(*ww)).values()) from collections import Counter from itertools import chain import random from statistics import pvariance def letter_counts(words): """Generate the letter counts of the words.""" return Counter(chain.from_iterable(*wordswords)).values() def score_variance(words): """Score words according to variance of letter counts.""" return pvariance(letter_counts(words)) def score_range(words): """Score words according to range of letter counts.""" counts = list(letter_counts(words)) return max(counts) - min(counts) def best_sample(words, score, n=100, r=100): """Generate r (default 100) random samples of n (default 100) elements from words, and return the sample with the smallest score. """ return min((random.sample(words, n) for _ in range(r)), key=score)