Python 2 using pypy and pp: n = 15 in 3 minutes
Also just a simple brute force. Interesting to see, that I nearly get the same same speed as kuroi neko with C++. My code can reach n = 12 in about 5 minutes. And I only run it on one virtual core.
edit: Reduce search space by a factor of n
I noticed, that a cycled vector A* of A produces the same numbers as probabilities (same numbers) as the original vector A when I iterate over B. E.g. The vector (1, 1, 0, 1, 0, 0) has the same probabilities as each of the vectors (1, 0, 1, 0, 0, 1), (0, 1, 0, 0, 1, 1), (1, 0, 0, 1, 1, 0), (0, 0, 1, 1, 0, 1) and (0, 1, 1, 0, 1, 0) when choosing a random B. Therefore I don't have to iterate over each of these 6 vectors, but only about 1 and replace count[i] += 1 with count[i] += cycle_number.
This reduces the complexity from Theta(n) = 6^n to Theta(n) = 6^n / n. Therefore for n = 13 it's about 13 times as fast as my previous version. It calculates n = 13 in about 2 minutes 20 seconds. For n = 14 it is still a little bit too slow. It takes about 13 minutes.
edit 2: Multi-core-programming
Not really happy with the next improvement. I decided to also try to execute my program on multiple cores. On my 2 + 2 cores I now can calculate n = 14 in about 7 minutes. Only a factor of 2 improvement.
The code is available in this github repo: Link. The multi-core programming makes is a little bit ugly.
edit 3: Reducing search space for A vectors and B vectors
I noticed the same mirror-symmetry for the vectors A as kuroi neko did. Still not sure, why this works (and if it works for each n).
The reducing of the search space for B vectors is a bit cleverer. I replaced the generation of the vectors (itertools.product), with an own function. Basically, I start with an empty list and put it on a stack. Until the stack is empty, I remove a list, if it has not the same lenght as n, I generate 3 other lists (by appending -1, 0, 1) and pushing them onto the stack. I a list has the same length as n, I can evaluate the sums.
Now that I generate the vectors myself, I can filter them depending on if I can reach the sum = 0 or not. E.g. if my vector A is (1, 1, 1, 0, 0), and my vector B looks (1, 1, ?, ?, ?), I know, that I cannot fill the ? with values, so that A*B = 0. So I don't have to iterate over all those 6 vectors B of the form (1, 1, ?, ?, ?).
We can improve on this, if we ignore the values for 1. As noted in the question, for the values for i = 1 are the sequence A081671. There are many ways to calculate those. I choose the simple recurrence: a(n) = (4*(2*n-1)*a(n-1) - 12*(n-1)*a(n-2)) / n. Since we can calculate i = 1 in basically no time, we can filter more vectors for B. E.g. A = (0, 1, 0, 1, 1) and B = (1, -1, ?, ?, ?). We can ignore vectors, where the first ? = 1, because the A * cycled(B) > 0, for all these vectors. I hope you can follow. It's probably not the best example.
With this I can calculate n = 15 in 6 minutes.
edit 4:
Quickly implemented kuroi neko's great idea, which says, that B and -B produces the same results. Speedup x2. Implementation is only a quick hack, though. n = 15 in 3 minutes.
Code:
For the complete code visit Github. The following code is only a representation of the main features. I left out imports, multicore programming, printing the results, ...
count = [0] * n count[0] = oeis_A081671(n) #generating all important vector A visited = set(); todo = dict() for A in product((0, 1), repeat=n): if A not in visited: # generate all vectors, which have the same probability # mirrored and cycled vectors same_probability_set = set() for i in range(n): tmp = [A[(i+j) % n] for j in range(n)] same_probability_set.add(tuple(tmp)) same_probability_set.add(tuple(tmp[::-1])) visited.update(same_probability_set) todo[A] = len(same_probability_set) # for each vector A, create all possible vectors B stack = [] for A, cycled_count in dict_A.iteritems(): ones = [sum(A[i:]) for i in range(n)] + [0] # + [0], so that later ones[n] doesn't throw a exception stack.append(([0] * n, 0, 0, 0, False)) while stack: B, index, sum1, sum2, used_negative = stack.pop() if index < n: # fill vector B[index] in all possible ways, # so that it's still possible to reach 0. if used_negative: for v in (-1, 0, 1): sum1_new = sum1 + v * A[index] sum2_new = sum2 + v * A[index - 1 if index else n - 1] if abs(sum1_new) <= ones[index+1]: if abs(sum2_new) <= ones[index] - A[n-1]: C = B[:] C[index] = v stack.append((C, index + 1, sum1_new, sum2_new, True)) else: for v in (0, 1): sum1_new = sum1 + v * A[index] sum2_new = sum2 + v * A[index - 1 if index else n - 1] if abs(sum1_new) <= ones[index+1]: if abs(sum2_new) <= ones[index] - A[n-1]: C = B[:] C[index] = v stack.append((C, index + 1, sum1_new, sum2_new, v == 1)) else: # B is complete, calculate the sums count[1] += cycled_count # we know that the sum = 0 for i = 1 for i in range(2, n): sum_prod = 0 for j in range(n-i): sum_prod += A[j] * B[i+j] for j in range(i): sum_prod += A[n-i+j] * B[j] if sum_prod: break else: if used_negative: count[i] += 2*cycled_count else: count[i] += cycled_count
Usage:
You have to install pypy (for Python 2!!!). The parallel python module isn't ported for Python 3. Then you have to install the parallel python module pp-1.6.4.zip. Extract it, cd into the folder and call pypy setup.py install.
Then you can call my program with
pypy you-do-the-math.py 15
It will automatically determine the number of cpu's. There may be some error messages after finishing the program, just ignore them. n = 16 should be possible on your machine.
Output:
Calculation for n = 15 took 2:50 minutes 1 83940771168 / 470184984576 17.85% 2 17379109692 / 470184984576 3.70% 3 3805906050 / 470184984576 0.81% 4 887959110 / 470184984576 0.19% 5 223260870 / 470184984576 0.05% 6 67664580 / 470184984576 0.01% 7 30019950 / 470184984576 0.01% 8 20720730 / 470184984576 0.00% 9 18352740 / 470184984576 0.00% 10 17730480 / 470184984576 0.00% 11 17566920 / 470184984576 0.00% 12 17521470 / 470184984576 0.00% 13 17510280 / 470184984576 0.00% 14 17507100 / 470184984576 0.00% 15 17506680 / 470184984576 0.00%
Notes and ideas:
- I have a i7-4600m processor with 2 cores and 4 threads. It doesn't matter If I use 2 or 4 threads. The cpu-usage is 50% with 2 threads and 100% with 4 threads, but it still takes the same amount of time. I don't know why. I checked, that each thread only has to the the half amout of data, when there are 4 threads, checked the results, ...
- I use a lot of lists. Python isn't quite efficient in storing, I have to copy lots of lists, ... So I thought of using an integer instead. I could use the bits 00 (for 0) and 11 (for 1) in the vector A, and the bits 10 (for -1), 00 (for 0) and 01 (for 1) in the vector B. For the product of A and B, I would only have to calculate
A & B and count the 01 and 10 blocks. Cycling can be done with shifting the vector and using masks, ... I actually implemented all this, you can find it in some of my older commits on Github. But it turned out, to be slower than with lists. I guess, pypy really optimizes list operations.