Revisions to Approximate (250 over 100) permutation best fitting certain criteria

be explicit about how I'm generalizing the OP's problem; implement the chain.from_iterable suggestion

edited May 12, 2015 at 19:20

50.1k
3
130
211

It's easy to see thatIn the general case, the problem of finding the sample of words with the "most even" distribution of letters is NP-hard. Here I'm considering a general instance of this problem to be:

Given an alphabet \$ Σ \$, a set \$ W \$ of 4-letter words over that alphabet, and an integer \$ n ≤ \left|W\right| \$, find the subset \$ W^* ⊆ W \$ with size \$ \left|W^*\right| = n \$ having the smallest variance of letter counts among all subsets of that size.

Here's a proof that this is NP-hard, by a reduction from EXACT COVER BY 3-SETS (that is, EXACT COVER restricted to sets of size 3, also known as "X3C").

This translates into the word problem as follows. Each element ofLet \$ X \$ corresponds to a letter, as does each element of\$ Σ = X ∪ S \$; let \$ S \$. Each\$ W \$ be the set \$ s ∈ S \$ corresponds to aof four-letter wordwords (consisting of the three letters corresponding to the elements of\$ abcd \$ such that \$ s \$\$ \{ a, b, c \} = d ∈ S \$, and the single letter corresponding tolet \$ s \$ itself)\$ n = { \left|X\right| \over 3 } \$. FindThen solve the sample of words of sizeword problem to find \$ \left|X\right| \over 3 \$\$ W^* \$ with the smallest variance of letter counts. There is an exact cover if and only if the variance of the letter counts in the sample\$ W^* \$ is zero. (Because there can only be one instance of eachany letter corresponding to the elements offrom \$ S \$, and so if the variance is zero, there must be exactly one instance of each letter corresponding to the elements offrom \$ X \$, and since there are \$ \left|X\right| \over 3 \$ words, each element of \$ X \$ is covered exactly once.)

Having code at the top-level of a module makes it hard to test from the interactive interpreter, because whenever you reload the module, the code runs. It's best to guard top-level code with if __name__ == '__main__':.
The comment for sampl doesn't explain the most important points about the behaviour of the function, namely: what arguments does it take? and what does it return?
The number 100 is arbitrary, so it ought to be a parameter.
When finding the minimum or maximum, it's best to avoid a starting point like this, but instead to rearrange the code into a generator that can then be passed to min. That way you can avoid arbitrary starting points like 1000000.
Writing for w in range(r): is misleading, because w is not actually used (it's immediately overwritten by the random sample). It's conventional to write _ for unused loop variables.
The code uses NumPy to do the statistics on the sample. But there are at most 400 letters in the sample, so this doesn't really gain much: NumPy only shows a benefit for large volumes of data. I think that using the standard Python functions statistics.pstdev, max and min would be adequate for this amount of data.
Since the standard deviation is only used for comparison, you'd get the same results if you used the variance instead, and this would have the advantage of avoiding the square root.
Instead of Counter("".join(w)), use itertools.chainitertools.chain.from_iterable and write Counter(chain.from_iterable(*ww)). This avoids concatenating the words. [Improved by Veedrac in comments.]

Instead of:

 counts = [v for k, v in Counter("".join(w)).items()]

 counts = list(Counter(chain.from_iterable(*ww)).values())

from collections import Counter from itertools import chain import random from statistics import pvariance def letter_counts(words): """Generate the letter counts of the words.""" return Counter(chain.from_iterable(*wordswords)).values() def score_variance(words): """Score words according to variance of letter counts.""" return pvariance(letter_counts(words)) def score_range(words): """Score words according to range of letter counts.""" counts = list(letter_counts(words)) return max(counts) - min(counts) def best_sample(words, score, n=100, r=100): """Generate r (default 100) random samples of n (default 100) elements from words, and return the sample with the smallest score. """ return min((random.sample(words, n) for _ in range(r)), key=score)

It's easy to see that the general problem of finding the sample of words with the "most even" distribution of letters is NP-hard, by a reduction from EXACT COVER BY 3-SETS (that is, EXACT COVER restricted to sets of size 3, also known as "X3C").

This translates into the word problem as follows. Each element of \$ X \$ corresponds to a letter, as does each element of \$ S \$. Each set \$ s ∈ S \$ corresponds to a four-letter word (consisting of the three letters corresponding to the elements of \$ s \$ and the single letter corresponding to \$ s \$ itself). Find the sample of words of size \$ \left|X\right| \over 3 \$ with smallest variance of letter counts. There is an exact cover if and only if the variance of the letter counts in the sample is zero. (Because there can only be one instance of each letter corresponding to the elements of \$ S \$, and so if the variance is zero, there must be exactly one instance of each letter corresponding to the elements of \$ X \$, and since there are \$ \left|X\right| \over 3 \$ words, each element of \$ X \$ is covered exactly once.)

Having code at the top-level of a module makes it hard to test from the interactive interpreter, because whenever you reload the module, the code runs. It's best to guard top-level code with if __name__ == '__main__':.
The comment for sampl doesn't explain the most important points about the behaviour of the function, namely: what arguments does it take? and what does it return?
The number 100 is arbitrary, so it ought to be a parameter.
When finding the minimum or maximum, it's best to avoid a starting point like this, but instead to rearrange the code into a generator that can then be passed to min. That way you can avoid arbitrary starting points like 1000000.
Writing for w in range(r): is misleading, because w is not actually used (it's immediately overwritten by the random sample). It's conventional to write _ for unused loop variables.
The code uses NumPy to do the statistics on the sample. But there are at most 400 letters in the sample, so this doesn't really gain much: NumPy only shows a benefit for large volumes of data. I think that using the standard Python functions statistics.pstdev, max and min would be adequate for this amount of data.
Since the standard deviation is only used for comparison, you'd get the same results if you used the variance instead, and this would have the advantage of avoiding the square root.
Instead of Counter("".join(w)), use itertools.chain and write Counter(chain(*w)). This avoids concatenating the words.

Instead of:

 counts = [v for k, v in Counter("".join(w)).items()]

 counts = list(Counter(chain(*w)).values())

from collections import Counter from itertools import chain import random from statistics import pvariance def letter_counts(words): """Generate the letter counts of the words.""" return Counter(chain(*words)).values() def score_variance(words): """Score words according to variance of letter counts.""" return pvariance(letter_counts(words)) def score_range(words): """Score words according to range of letter counts.""" counts = list(letter_counts(words)) return max(counts) - min(counts) def best_sample(words, score, n=100, r=100): """Generate r (default 100) random samples of n (default 100) elements from words, and return the sample with the smallest score. """ return min((random.sample(words, n) for _ in range(r)), key=score)

In the general case, the problem of finding the sample of words with the "most even" distribution of letters is NP-hard. Here I'm considering a general instance of this problem to be:

Given an alphabet \$ Σ \$, a set \$ W \$ of 4-letter words over that alphabet, and an integer \$ n ≤ \left|W\right| \$, find the subset \$ W^* ⊆ W \$ with size \$ \left|W^*\right| = n \$ having the smallest variance of letter counts among all subsets of that size.

Here's a proof that this is NP-hard, by a reduction from EXACT COVER BY 3-SETS (that is, EXACT COVER restricted to sets of size 3, also known as "X3C").

This translates into the word problem as follows. Let \$ Σ = X ∪ S \$; let \$ W \$ be the set of four-letter words \$ abcd \$ such that \$ \{ a, b, c \} = d ∈ S \$, and let \$ n = { \left|X\right| \over 3 } \$. Then solve the word problem to find \$ W^* \$ with the smallest variance of letter counts. There is an exact cover if and only if the variance of the letter counts in \$ W^* \$ is zero. (Because there can only be one instance of any letter from \$ S \$, and so if the variance is zero, there must be exactly one instance of each letter from \$ X \$, and since there are \$ \left|X\right| \over 3 \$ words, each element of \$ X \$ is covered exactly once.)

Having code at the top-level of a module makes it hard to test from the interactive interpreter, because whenever you reload the module, the code runs. It's best to guard top-level code with if __name__ == '__main__':.
The comment for sampl doesn't explain the most important points about the behaviour of the function, namely: what arguments does it take? and what does it return?
The number 100 is arbitrary, so it ought to be a parameter.
When finding the minimum or maximum, it's best to avoid a starting point like this, but instead to rearrange the code into a generator that can then be passed to min. That way you can avoid arbitrary starting points like 1000000.
Writing for w in range(r): is misleading, because w is not actually used (it's immediately overwritten by the random sample). It's conventional to write _ for unused loop variables.
The code uses NumPy to do the statistics on the sample. But there are at most 400 letters in the sample, so this doesn't really gain much: NumPy only shows a benefit for large volumes of data. I think that using the standard Python functions statistics.pstdev, max and min would be adequate for this amount of data.
Since the standard deviation is only used for comparison, you'd get the same results if you used the variance instead, and this would have the advantage of avoiding the square root.
Instead of Counter("".join(w)), use itertools.chain.from_iterable and write Counter(chain.from_iterable(w)). This avoids concatenating the words. [Improved by Veedrac in comments.]

Instead of:

 counts = [v for k, v in Counter("".join(w)).items()]

 counts = list(Counter(chain.from_iterable(w)).values())

from collections import Counter from itertools import chain import random from statistics import pvariance def letter_counts(words): """Generate the letter counts of the words.""" return Counter(chain.from_iterable(words)).values() def score_variance(words): """Score words according to variance of letter counts.""" return pvariance(letter_counts(words)) def score_range(words): """Score words according to range of letter counts.""" counts = list(letter_counts(words)) return max(counts) - min(counts) def best_sample(words, score, n=100, r=100): """Generate r (default 100) random samples of n (default 100) elements from words, and return the sample with the smallest score. """ return min((random.sample(words, n) for _ in range(r)), key=score)

improved np-hardness analysis

Source Link

edited May 12, 2015 at 13:19

Gareth Rees

50.1k
3
130
211

This translates into the word problem as follows: each. Each element of \$ X \$ becomescorresponds to a letter, andas does each element of \$ S \$ becomes. Each set \$ s ∈ S \$ corresponds to a threefour-letter word (consisting of the three letters corresponding to the elements of \$ s \$ and the single letter corresponding to \$ s \$ itself). Find the sample of words of size \$ \left|X\right| \over 3 \$ with smallest variance of letter counts. There is an exact cover if and only if the variance of the letter counts in the sample is zero. (Because there can only be one instance of each letter corresponding to the elements of \$ S \$, and so if the variance is zero, there must be exactly one instance of each letter corresponding to the elements of \$ X \$, and since there are \$ \left|X\right| \over 3 \$ words, each element of \$ X \$ is covered exactly once.)

Source Link

answered May 12, 2015 at 11:23

Gareth Rees

50.1k
3
130
211

1. Analysis

It's easy to see that the general problem of finding the sample of words with the "most even" distribution of letters is NP-hard, by a reduction from EXACT COVER BY 3-SETS (that is, EXACT COVER restricted to sets of size 3, also known as "X3C").

Suppose that we have an instance of EXACT COVER BY 3-SETS in the form of a set \$ X \$ and a collection \$ S \$ of 3-element subsets of \$ X \$, and the problem is to determine if there is a subcollection \$ S^* ⊆ S \$ such that each element in \$ X \$ is contained in exactly one subset in \$ S^* \$.

This translates into the word problem as follows: each element of \$ X \$ becomes a letter, and each element of \$ S \$ becomes a three-letter word. Find the sample of words of size \$ \left|X\right| \over 3 \$ with smallest variance of letter counts. There is an exact cover if and only if the variance of the letter counts in the sample is zero.

So there's no efficient (polynomial time) algorithm that solves your problem. Exhaustive search is out, because \$ {250 \choose 100} \approx 6×10^{71} \$. So the best you can do is to use heuristics to find good approximate solutions. I'll describe one possibility in §4 below.

2. Review

Having code at the top-level of a module makes it hard to test from the interactive interpreter, because whenever you reload the module, the code runs. It's best to guard top-level code with if __name__ == '__main__':.
The comment for sampl doesn't explain the most important points about the behaviour of the function, namely: what arguments does it take? and what does it return?
The number 100 is arbitrary, so it ought to be a parameter.
When finding the minimum or maximum, it's best to avoid a starting point like this, but instead to rearrange the code into a generator that can then be passed to min. That way you can avoid arbitrary starting points like 1000000.
Writing for w in range(r): is misleading, because w is not actually used (it's immediately overwritten by the random sample). It's conventional to write _ for unused loop variables.
The code uses NumPy to do the statistics on the sample. But there are at most 400 letters in the sample, so this doesn't really gain much: NumPy only shows a benefit for large volumes of data. I think that using the standard Python functions statistics.pstdev, max and min would be adequate for this amount of data.
Since the standard deviation is only used for comparison, you'd get the same results if you used the variance instead, and this would have the advantage of avoiding the square root.
Instead of Counter("".join(w)), use itertools.chain and write Counter(chain(*w)). This avoids concatenating the words.

Instead of:

 counts = [v for k, v in Counter("".join(w)).items()]

write:

 counts = list(Counter(chain(*w)).values())

since you don't use the keys.

I dislike the logic for selecting w_candidate_3. This is only updated if both scoring functions show an improvement for this sample, and so the "best" sample according to this logic will depend on the order in which samples are examined. This is a bad property for a measure to have. It would be better to find some numeric way to combine the two scores (for example, their sum or product), so that all samples are comparable according to this measure.
The logic in sampl is complicated because you aren't sure of the right way to score the samples. In this kind of situation a better approach would be for sampl to take a scoring function as an argument. Then you could easily experiment with different kinds of score.

3. Revised code

from collections import Counter from itertools import chain import random from statistics import pvariance def letter_counts(words): """Generate the letter counts of the words.""" return Counter(chain(*words)).values() def score_variance(words): """Score words according to variance of letter counts.""" return pvariance(letter_counts(words)) def score_range(words): """Score words according to range of letter counts.""" counts = list(letter_counts(words)) return max(counts) - min(counts) def best_sample(words, score, n=100, r=100): """Generate r (default 100) random samples of n (default 100) elements from words, and return the sample with the smallest score. """ return min((random.sample(words, n) for _ in range(r)), key=score)

4. Hill climbing

A useful search technique in this kind of global optimization problem is hill climbing. The idea is to pick a random sample to start with, explore its neighbourhood by examining "nearby" samples, pick the best of the neighbours, and continue until no further improvement can be made. For example:

def best_sample_2(words, score, n=100, r=100): """Return a sample of n (default 100) elements from words, found by starting with a random sample and repeatedly hill climbing by generating r (default 100) neighbours and picking the one with smallest score, until no more progress is made. """ words = set(words) current_sample = set(random.sample(words, n)) current_score = score(current_sample) # Generate neighbouring samples (one word swapped). def neighbours(): nonsample = words - current_sample for _ in range(r): s = set(current_sample) s.difference_update(random.sample(current_sample, 1)) s.update(random.sample(nonsample, 1)) yield s while True: new_sample = min(neighbours(), key=score) new_score = score(new_sample) if current_score <= new_score: # No neighbour is any better, so stop here. return current_sample current_sample, current_score = new_sample, new_score

Here's an example of how effective hill climbing can be in this problem. First, we'll use the original algorithm to pick the best of 10,000 samples.

>>> score_variance(best_sample(WORDS, score_variance, r=10000)) 114.18300653594771

Now, using hill climbing:

>>> score_variance(best_sample_2(WORDS, score_variance)) 23.241830065359476

The hill climbing approach reaches a local optimum. So usually it's a good idea to run it several times with different random starting points. For example, if I run it ten times and pick the best:

>>> f = lambda:best_sample_2(WORDS, score_variance) >>> score_variance(min((f() for _ in range(10)), key=score_variance)) 19.712418300653592

For further improvement, take a look at simulated annealing.

Stack Exchange Network

Return to Answer

1. Analysis

2. Review

3. Revised code

4. Hill climbing