Fast sampling of special binary strings

Question

We are going to define a simple little language. A word in this language is a binary string where the longest run of consecutive \$0\$s, is shorter than every (maximal) run of \$1\$s. So for example: \begin{align} 001110111110011 \end{align} which we will write as \begin{align} 0^21^301^50^21^2 \end{align} is not legal, because the following red segment is just as long as the blue segment: \begin{align} \color{red}{0^2}1^301^50^2\color{blue}{1^2} \end{align} The string: \begin{align} 0^31^401^80^31^5 \end{align} is valid, because the longest run of \$0\$s has length three and the shortest (maximal) run of \$1\$s has length four.

Strings of the form \$0^n\$ are somewhat ambiguous in the above description. Technically they should be allowed, since there are no runs of \$1\$s at all, but they seem to in some sense violate the spirit of the description. Whether they are included is up to you, it will not make a huge difference for the challenge.

Challenge

Given a non-negative integer \$n\$, uniformly sample from the words of length \$n\$ in the language. You algorithm must have sub-exponential average-case time complexity in terms of the input value.

You may choose a word using a built-in source of randomness, or you can accept as an additional input a source of randomness (e.g. a infinite list of 1s and 0s, or a stateful blackbox function).

This is code-golf so the goal is to minimize the size of your source code as measured in bytes.

Non-solutions

An easy would-be solution is to sample randomly on binary strings of the correct length until you get something in the language.

We will show that this does not work. This gives a uniform distribution but fails because the average-case time complexity is exponential. On average you will fail exponentially many times before you succeed.

To see this, notice that the string \$010\$ is never a legal substring in our language, regardless of context. So there are at most 7 possible length 3 segments that can extend a valid word at any point.

So the rate at which words in our language grow is \$O\left(\sqrt[3]{7}^n\right)\$ while all binary strings grow at a rate of \$2^n\$. \$2 = \sqrt[3]{8}>\sqrt[3]{7}\approx 1.913\$.

You might try to generate random strings which don't contain \$010\$ instead (this can be done uniformly and efficiently), and then reject the ones that don't work. However a general version of this argument can show that any finite set of forbidden substrings will either rule out some words you want to sample (i.e. make the sampling non-uniform) or it will require exponentially many samples (i.e. the base of the exponent is too large).

let N=1, is 1 the only valid solution, or is 0 also valid? Or should N at least be 3, and both 0 and 1 must be used? — tsh
– tsh, Commented Nov 25 at 2:46
@tsh Good question. I had thought of this at some point, but by the time I got around to writing things down I had forgotten. Since I don't think it makes a big difference and this is a hard challenge I've left it up to the solver to choose. — Wheat Wizard
– Wheat Wizard ♦, Commented Nov 25 at 4:25
Assuming my program is correct, I'm getting approximately \$ \left ( \frac 3 2 \right ) ^ n \$ solutions. You're saying it's possible to uniformly sample from that many solutions in sub-exponential average-case time complexity? — Neil
– Neil, Commented Nov 25 at 16:38
@Neil It should indeed be possible. You don't have to generate all the solutions to give a uniform sampling. The number of solutions has very little impact on the difficulty of the problem. For example all binary strings are certainly more numerous, but are very easy to sample uniformly. A harder version of this problem came up for me in a paper I am working on with a colleague and we have a solution that can be extended to this case. — Wheat Wizard
– Wheat Wizard ♦, Commented Nov 25 at 22:02
Turns out my program is incorrect anyway - it didn't generate 0111100. — Neil
– Neil, Commented Nov 26 at 0:06

DPenner1 · Accepted Answer · 2025-11-28 19:46:47Z

Python 3.8 (pre-release), 560 bytes

I think this is correct.

Admittedly, I was much more interested in figuring out an algorithm than the golfing bit, with emanresu A significantly golfing down my initial solution. This solution does not include the \$0^n\$ string.

from random import* c=lambda d:d and c(d[1:])*len(d)/d.count(d[0])or 1 g=lambda l,x:x and[v:=l[(i:=int(x/(p:=c(l)/len(l))))]]+g(l[:i]+l[i+1:],x-l.index(v)*p)or l f=lambda n,i=1:n//i and[l+[i]for l in f(n-i,i)]+f(n,i+1)or[[]][n:] v=[(1,0,[b:=int(input())])] for p in f(b)[:-1]: x=len(p);y=x//2;z=p[y:];w=p[:y] if p[y]-w[-1]:v+=[(0,z,w)]+-~x%2*[(1,z,w)] if x%2and p[y]-z[1]:v+=[(1,z[1:],p[:y+1])] s=randrange(sum(c(B)*c(C)for A,B,C in v)) for A,B,C in v: D=c(B);E=D*c(C) if s<E: e=g(B,s%D);h=g(C,s//D) while[print(end=str(A)*[e,h][A].pop())]:A=1-A s-=E

Try it online!

Credits for my initial solution (with sampling code): Partition code from an answer by xnor and permutation counting & indexing code adapted from this SO answer by Bartosz Marcinkowski.

Algorithm

For given \$n\$, we generate all integer partitions of \$n\$. The number of partitions is on the order of \$O\left(e^{\pi\sqrt{2n/3}}/n\right)\$ which crucially makes it sub-exponential for some (all?) definitions I saw (specifically one requiring any exponent to be \$o\left(n\right)\$, note the little-o). Each entry in a partition (part in mathematical terminology) represents the length of a run of zeroes or ones (the exponents in the challenge description). Note the partition function returns the parts in descending order. There's also a point of uncertainty here in that while the number of partitions is sub-exponential, I'm not 100% the specific function used to generate them is also sub-exponential.
Note that due to alternation of runs, there will either be equal numbers of runs consisting of ones as runs of zeroes, or a discrepancy of 1. So we iterate through each partition checking the halfway point (it ends up being finicky dealing with even/odd number of parts). Because the parts are ordered, we just need to ensure that the parts on either side of the halfway point do not equal for it to be a valid partition for this problem. The left-half will be runs of ones, the right-half runs of zeroes (for odd number of parts, taking into account whether the string as a whole would start with 0 or 1).
We maintain a list of 3-tuples in the following structure: (starting character, zero runs half-partition, one runs half-partition). So each valid even sized partition generates two entries in the list of tuples I'm maintaining, while odd sized partitions may have one or two entries. This allows us to loop through to calculate the total number of strings as well as the number of strings per partition/starting-character pair so that we can properly weight the selection of a random sample.
On randomly sampling a partition/starting-character pair, we still need to randomly select permutations of the runs. We cannot generate them as this would be factorial complexity. Luckily, permutations can be indexed in \$O\left(n^2\right)\$ time.

Complexity

Note: This describes my own calculation of the theoretical complexity of the idealized algorithm. There may remain errors in my calculation, and for code golf reasons there may be inefficiencies. Realistically, the only thing in the implementation that might push it exponential would be a sub-optimal partitions function.

The algorithmic complexity is dominated by calculating the number of permutations of the parts for each partition. Note that indexing a permutation occurs only once a partition is sampled and does not contribute to overall complexity (it's additive, not multiplicative). The average size of a partition is \$O(\sqrt{n}\log{n})\$, and per my shaky understading of this paper, that is broken down as an average of \$\sqrt{n}\$ distinct parts with \$\log n\$ duplicates. Calculating a partition's number of permutations is generally done by calculating total number (the dominating complexity), then dividing out duplicates, resulting in:

\begin{align} O(factorial\left(\sqrt{n}\log n)\right) + \sqrt{n} factorial\left(\log{n}\right)) \end{align}

Now if we take multiplication to be \$O\left(1\right)\$, factorial has linear complexity resulting in this being \$O\left(\sqrt{n}\log{n}\right)\$ and the algorithm as a whole being \$O\left(e^{\pi\sqrt{2n/3}}\log{n}\,/\sqrt{n}\right)\$. With multiplication complexity taken into account, the best known factorial algorithm is \$O\left(m\log^2{m}\right)\$. This results in \$O\left(\sqrt{n}\log^3{n}\right)\$ and the algorithm as a whole being:

\begin{align} O\left(e^{\pi\sqrt{2n/3}}\log^3{n}\,/\sqrt{n}\right) \end{align}

Nice! This is much easier and elegant than the algorithm I had envisioned for this challenge. (I think it is also slower, but elegance is better for code-golf anyway) — Wheat Wizard
– Wheat Wizard ♦, Commented Nov 27 at 4:02
@WheatWizard Thanks! I was wondering whether multiple algos were possible, it's always more fun when it is, so good to know! — DPenner1
– DPenner1, Commented Nov 27 at 4:05
I suspect there's a bit more to save, but here's 600 bytes. I changed the output format to an array of 1s/0s, and found some quite nice recursive algorithms for c and g (probably adding an O(n) factor or two on top), but aside from that it's just been logic simplifications — emanresu A
– emanresu A, Commented Nov 27 at 7:20
FYI (it doesn't really matter here but it's still good to know), arithmetic operations in python are O(1) except for multiplication which is always less than O(n^2) for arbitrary precision ints. — Themoonisacheese
– Themoonisacheese, Commented Nov 27 at 8:22
@emanresuA Wow, I knew it could be golfed, but I didn't expect a factor of 2 that fast! I've updated my answer now according to your golfs. — DPenner1
– DPenner1, Commented yesterday

l4m2 · Accepted Answer · 2025-11-27 18:31:56Z

JavaScript (Node.js), 259 bytes

f=(n,M,c,R)=>eval(`D=n;E=M;for(C=m=0;m++<n;)for(A={},i=M-m?f:0;i<n;++i)for(d of'01')for(j=0;j++<n;i+j==n&c!=w&e&&q/(C+=q)>Math.random()&&(D=j,E=m))A[s=[i+j,w=j<m,e=d|j==m|R]]=(A[s]||0)+(q=i?A[[i,!w,d]]||0:!+d);n?(+!(D<E)+'').repeat(D)+f(n-D,E,D<E,R|E==D):''`)

Try it online!

-1B emanresu A

Should be polynomial time by dynamic programming on (minimum 1 length*)(first color*, first length, minimum 1 length used) (* mean may be fixed because of previous decisions)

Unless I'm missing something with scoping/similar, you can save a byte by wrapping the whole body in eval, and another four by replacing the ||0s with ~~s: TIO — emanresu A
– emanresu A, Commented 2 days ago
@emanresuA Applied the eval. Not to use ~~s than ||0 cuz for large n, s will likely go above 2^32 — l4m2
– l4m2, Commented 2 days ago
I think this could use a proof sketch that it is a uniform sampling. — Wheat Wizard
– Wheat Wizard ♦, Commented yesterday

Stack Exchange Network

Fast sampling of special binary strings

Challenge

Non-solutions

2 Answers 2

Python 3.8 (pre-release), 560 bytes

Algorithm

Complexity

JavaScript (Node.js), 259 bytes

Linked

Hot Network Questions

Fast sampling of special binary strings

Challenge

Non-solutions

2 Answers 2

Python 3.8 (pre-release), 560 bytes

Algorithm

Complexity

JavaScript (Node.js), 259 bytes

Linked

Related

Hot Network Questions