Recursive Transformer - Non-Record Submission — 1.07424983 val_bpb (4h depth-recurrent hybrid transformer run)#1535
Conversation
…n), 14.2MB 7F+3C depth-recurrent hybrid transformer with loop-aware GPTQ. Includes 10-minute legal seeds (444/300/4) and 4-hour organizer-requested run. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Community Review — Recursive Transformer - Non-Record Submission — 1.07424983 val_bpb (4h depth-recurrent hybrid transformer run)BPB: 1.1083 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache What I found in the code (head SHA Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.12s, dim=512, layers=11, vocab=1024, code=125582 B, SMOKE_TEST_PASS Verdict: LOOKS CLEAN. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline. Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.12s, dim=512, layers=11, vocab=1024, code=125582 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
The Crawler: A 23-Day Architecture Research Campaign
This document is a chronological research corpus for rapid iteration on an architecture concept. It is a hybrid flat-plus-crawler transformer with a shared depth-recurrent bottleneck. It is not quite a Universal Transformer, but I ran it for four hours with the assumption that it was close enough.
The tables below are the fast read. I iterated on the concept of a transformer with recursive properties and enjoyed developing an original system.
The end state is straightforward. The legal 10-minute crawler runner is a depth-recurrent hybrid transformer
7F+3C, loops=3, loop-aware GPTQ, prune+pyminifyat1.13536063BPB and15,844,157bytes. The long-run confirmation is7F+3C, loops=3, loop-aware GPTQ, 4-hour full runat1.07424983BPB and14,184,849bytes. The final four-hour run establishes stable wind-down behavior and handling of the recursive elements.Core Tables
Table 1: Final Outcome And Major Checkpoints
Table 2: Naming Translation
Kunique blocks appliedNtimes4F+1C, 4 loops, DeltaNet heads, loop-aware GPTQ5F+1C+5F9F+1C9F+2C7F+3C7F+3Cover 4 hoursFmeans unique flat transformer blocks.Cmeans shared crawler blocks. Unless stated otherwise, the late crawler line uses 3 loops, loop-aware GPTQ, and int6 export.Table 3: Decisive Tests
4F+1C, 4 loops, DeltaNet heads, loop-aware GPTQ, late-start EMA5F+1C+5Fvs earlier BW arms8F+3C,7F+4C,9F+2C,9F+3C7F+3Cwith prune and aggressive pyminify7F+3Cover 4 hoursDay-By-Day Research Journal
Day 1 — March 18, 2026: Proof of Concept
The Frugendorff was born on a DGX Spark with 300 training steps beside a lot of other novel concepts in a rapid ablation sweep.
The idea was simple: instead of 9 unique transformer layers at 512 dimensions, use 3 shared blocks applied 3 times each at 864 dimensions. Same parameter count, wider layers, deeper effective network.
A 7.1% improvement with fewer total parameters. Gravity — auxiliary losses on early loops — hurt. The model learned to suppress early-loop contributions on its own, converging gravity weights to [0.13, 0.13, 0.70]. The mechanism was width, not recursion. That insight would take another week to prove rigorously.
Days 2–4 — March 19–21, with Frug Squared landing on March 22: The Automated Sweep
An overnight Qwen-guided architecture search ran 141 configurations and found the sweet spots: 2×4 block layout, cadence 3 (F/N/N pattern), learning rate 2e-3, MLP multiplier 3×. The winner reached 2.3332 BPB — 12% better than baseline.
A second extended sweep of 227 runs pushed further: 4×2 at cadence 4 hit 2.155 BPB. Cadence 4 beat cadence 3 beat cadence 1 beat cadence 2. A clear ordering.
Discovery: Cadence Training. Running all loops every step creates a "bandsaw" loss oscillation — shared weights receive contradictory gradient signals from different loop positions on consecutive steps. Alternating fractal steps (all loops) with normalize steps (single clean pass) in an F/N/N/N pattern gives the shared weights recovery time. The 227-run sweep confirmed cadence 4 is optimal. The bandsaw disappears at cadence ≥ 3.
Discovery: TTT Leverage Multiplier. Test-time training on shared weights updates all N loop iterations simultaneously — an N× leverage multiplier per gradient step. The v3+TTT variant peaked at 1.1901 BPB, a 0.032 improvement and roughly 3× the typical TTT gain. Aggressive TTT drifted after window 1400. (Note: TTT itself was never banned — only causality-violating implementations that score and train on the same tokens simultaneously. Legal score-first TTT, where chunks are evaluated before adaptation, remains permitted and is used by the #1 leaderboard submission. The Frugendorff's N× leverage is an architectural property, not a causality issue — its legality depends on whether the implementation followed score-first protocol.)
The full-scale H100 runs followed:
Frugendorff Squared landed at 1.1478 BPB — 0.025 from the posted world lead. The research line looked promising. It was submitted as PR to openai/parameter-golf, framed as a compression technique.
Days 5–6 — March 22–23: The Quantization Catastrophe
GPTQ destroys weight-shared models. Rounding error compounds multiplicatively across loop iterations:
More loops meant worse survival. The gap between pre-quant and post-quant performance was not 0.01 or 0.05 — it was 4+ BPB. Catastrophic. This single finding would drive the entire subsequent quantization research program: loop-aware GPTQ, QAT, int8 training.
Day 7 — March 24, 2026: The H4 Bridge
Both scored 1.35, but the crawler addition used fewer MB. This kept interest in the line.
Days 8–10 — March 25–27: Signal Analysis and Death
A rigorous 7-test statistical analysis across 8 micro-crawler TSVs, 175 Frugendorff sweep configs, and production cadence ablations delivered a verdict that was hard to argue with:
Steps after crawler-steps are slightly worse than steps after normalize-steps. There is no momentum effect from iterative refinement. The Frugendorff's advantage was almost entirely explained by wider layers at fixed parameter count, not by the recursive refinement that was supposed to be the whole point.
An H-FRUG remediation sweep on RTX A6000 tested three approaches — KL distillation, loop bottleneck gate, reinvest to width — and all failed. Root cause: shared-weight gradient conflict. The same weights must simultaneously map early→mid and mid→late representations, and at short training budgets the optimizer cannot find a viable compromise.
The conclusion was written formally: "Frugendorff / Crawler research line: CLOSED." Personal Neural SOTA at time of closure: 1.1129 BPB (Rat Rod Green v1).
At almost the same moment, one continuation branch produced the most exciting unstable result in the whole crawler story: Medusa. It took the crawler into a
4F+1Ctopology with 4 loops,DELTA_NET_HEADS=4, loop-aware GPTQ, and late-start EMA.Medusa mattered because it proved there was still real energy in the crawler line even after the formal death verdict. It also proved that raw breakout scores were not enough. The branch was too volatile to trust, and the failure mode was instructive: a DeltaNet state-dtype bug plus quantization unravel through 4 crawler loops. That combination is why DeltaNet remained fascinating but eventually became quarantined from the stable crawler line.
Days 11–12 — March 28–29: The Junkyard Era and Resurrection
At this point I knew I could get decent numbers from the recursion element, but it would unwind on me during compression. I had to find a way to stabilize the fractal compaction idea.
The ClownCar series evolved in a parallel submission-oriented track, combining the crawler architecture with an X-WING ngram oracle. ClownCar (base) established the 4F+1C×4L template at 1.1996 BPB. Raw crawler-track results: broken config 1.18119 (seed 1337), 1.18150 (seed 42); correct neural config 1.12299 (seed 1337). FX_Wing_Delta scored 1.18092 (seed 1337). ClownCar_II added DeltaNet heads with canonical kernel and achieved 1.0427 BPB sliding window — but exposed severe EMA degradation during warmdown (0.47 → 0.73 BPB). ClownCar_III tried trigram preprocessing with no improvement. ClownCar_VI stripped both EMA and GPTQ (
SKIP_EMA=1, SKIP_GPTQ=1), capturing live weights instead. ClownCar_VII re-enabled GPTQ with loop-aware 2-phase calibration. These experiments established patterns that became permanent:SKIP_EMA=1,LOOP_AWARE_GPTQ=1, andDELTA_NET_HEADS=0(quarantined for creating causality violations via cross-loop state carry).The Medusa cleanup line belongs here too.
Medusa_VIIwas the causality-fixed crawler base that fed into the March 29 submission path. In other words, Medusa did not disappear when the unstable DeltaNet version failed. The useful parts survived: loop-aware GPTQ, no-EMA discipline, and a cleaned crawler path withDELTA_NET_HEADS=0.Bandit_ClownCar_X + Cubric Ngram9 was submitted on March 29 as a production hybrid:
The architecture was 4F+1C×4L with a full X-WING ngram oracle (orders 2–9, 8M hash buckets per order, 3D Cubric with 54 adaptive multipliers, entropy-adaptive alpha 0.20–0.75).
The Cobra harness was designed as a 9-config multi-armed bandit for base model quality (COMPLEMENT_ALPHA sweep, SWA_EVERY, weight decay, late-QAT thresholds). The BW series' rapid progress made it obsolete before it could be scheduled — the answers it would have found were already being discovered faster through targeted ablations.
The decisive architectural pivot happened here. Rather than treating recurrence as the whole model, the crawler was recast as a hybrid component within a flat transformer stack. That shift, more than any single hyperparameter, created the modern crawler lineage.
Days 12–13 — March 29–30: Foundation
Crawler_Ablations_v1 (1×H100, 600s/arm, seed=1337) was the first systematic sweep of crawler infrastructure:
Loop-aware GPTQ won by −0.040. EMA was actively harmful (+0.070 to +0.101) — it smooths weights in ways that destroy GPTQ calibration. EMA and GPTQ are antagonistic; combined is worse than either alone. Fullgraph compile saved 0.026 BPB simply by being faster (more steps in the same wallclock).
Crawler_Leg_1 (1×H100, 600s/arm, seed=1337) was the 11-arm foundational parameter map:
MLP width was the largest lever: mlp=5.0 was −0.098 BPB AND faster (655ms vs 735ms). Fewer loops was better — each loop adds ~0.085 BPB quant gap. INST_DIM (tested 0, 16, 32, 64) had only 0.034 BPB total impact, nearly irrelevant. CRAWLER_QUANT_INT8=1 was mandatory (+0.198 BPB catastrophic if disabled). More crawler blocks destroyed quality at this scale.
A quick Bandit Wagon width/depth proxy (500 steps, seed=444) confirmed depth beats width at every tested point. Width expansions to 576 and 640 both trailed the depth arms.
Crawler_Leg_2 (8×H100, 350s/arm, seed=1337) confirmed CL1 winners at production scale:
Architecture wins compressed at scale: CL1's −0.098 became CL2's −0.007. But the direction held. Production config was locked: loops=3, mlp=5.0+, COMPILE_FULLGRAPH=1, LOOP_AWARE_GPTQ=1.
Day 14 — March 31: Scaling and Signal Stacking
This phase tested whether the newly locked 4F+1C base could support additional signals without losing its quantization discipline. Most proposed additions did not survive escalation.
BW3 (pyramid-512 choke + battery): 1.20684 BPB (+0.01964 vs Leg 3). Pyramid adds 1.57M cold params that do not converge in a 600-second budget. It did not promote.
BW4 (battery only, ROPE_SCALES=9,1,1): 1.18731 BPB (−0.00015 vs Leg 3). Promoted. Zero extra params, tighter quant gap.
BW5 (BW4 + COMPILE_FULLGRAPH=1): 1.18672 BPB seed 444, 1.18758 seed 300. Promoted on 2-seed mean (1.18715 vs 1.18743).
BW5_Cannon (scalar cannon): Gate signal (−0.00016) reversed at full run (+0.00020). Cross-run variance swamped the apparent gain, so the idea did not promote.
BW5_PyramidCannon: +0.03440 at gate. This was a decisive negative result; the 1.57M cold choke parameters compounded over time.
BW5_Pyramid (1GPU, 500 steps): −0.00987 proxy signal, but the 8GPU PyramidCannon showed it was proxy inflation. Deferred.
BW6_Skipgram (trigram hash, 4×GPU, 2000 steps, seed=444) was explicitly null:
Crawler recurrence already approximates trigram context. The only secondary effect was a compression side benefit: the trigram artifact shrank by ~140KB while quality remained slightly worse.
BW7 MegaGate (8-arm, 4×GPU, 2000 steps) was the day's most important run:
TAP shared clearly outperformed TAP per-loop. Anchor dim=32 clearly outperformed dim=64. SMEAR was null, while SharedFlat failed decisively. The full arm table matters: TAP-04 and TAP-02 were both real but weaker than TAP-03, and ANC-06 confirmed the anchor signal narrows sharply as the state grows wider. The day produced two clear winners and eliminated the remaining alternatives.
Day 15 — April 1: The Path to Nightcrawler
BW8 (TAP shared dim=32) and BW9 (Anchor dim=32) were designed as gate-only experiments using MegaGate evidence. Neither got its own full run — both were absorbed into the BW12 interaction matrix.
BW10 ran loop-aware GPTQ as a standalone production test on the BW5+TAP base:
Promoted. New champion. Proxy inflation was 1.3× (gate predicted −0.00486, production delivered −0.00380).
BW11 added a 5th flat layer (5F+1C vs 4F+1C):
Promoted. −0.01021 vs BW5, −0.00641 vs BW10. This was the Nightcrawler — a 5F+1C+5F bridge architecture. The champion lineage was now BW5 → BW10 → BW11/Nightcrawler. This was also the point where the architecture stopped being framed as "recursion for free depth" and started being framed as a compression/thinking engine: compress stored structure with a crawler bottleneck, then spend the saved budget on deeper flat scaffolding around it.
Days 15–16 — April 1–2: Interaction Gates and the Depth Sprint
BW12 (4×GPU, 2000 steps) tested interaction dynamics on the Nightcrawler baseline. Tap-off (−0.00199) and GPTQ (−0.00204) were both positive. Standard and loop-aware GPTQ tied exactly.
Full BW12 gate table:
BW13 tested the same axes on a tap-off baseline. Anchor dim=32 regressed (+0.00276). Anchor dim=64 regressed harder (+0.00292). GPTQ helped (−0.00193). GPTQ-lite was nearly as good (−0.00177) with less calibration cost. The key finding: anchor and tap-off are antagonistic.
Full BW13 gate table:
BW14 went for big swings:
Six-flat depth was the biggest single-arm gain at −0.007. Choke flat-128 regressed strongly, while choke flat-512 was effectively neutral. Depth, not routing, was the relevant lever.
This launched a rapid architecture sprint. BW15 consolidated the 41 arms from BW12–BW14 into a single decision matrix. BW16 swept flat depth 6–11F; the isolated 10F run confirmed that more depth could improve quality (
1.24295BPB) but not within the legal artifact budget (17.83MB), while 9F emerged as the practical ceiling. Width was explored through micro crawler runs. The best micro crawler (run8_pd_cad2at1.13554) actually beats BWX 9F on quality but exceeds 16MB:Every width increase blew the size budget, confirming depth as the only viable scaling axis. BW17 tested cadence interactions on the 9F base at rapid scale — loops=2 showed a strong directional signal (−0.054) but was unconfirmed at production. BW18 and BW19 mapped the remaining knob space (49 architecture/cadence arms, 18 crawler balance arms) — the individual knobs were tested across BW7, BW12–BW14, and bridge ablations rather than as single monolithic campaigns. BW20 templated the brotli compression swap that became part of the production pipeline. The collective signal from this sprint converged on one clear answer.
BWX_Latest was the contender selection tournament. Using everything learned from BW12–BW16, the winner was clear: 9 flat layers, 1 crawler, tap-off, no anchor.
BWX 9F production run (8×H100, 600s, seed=444):
Config: 9F+1C, loops=3, ROPE=9,1,1, INST_DIM=32, TAP=0, ANCHOR=0, COMPILE_FULLGRAPH=1, SKIP_GPTQ=1.
This became the in-tree leader and remains the reference crawler submission in this repository. In 15 days, the program moved from formal closure of the original recursive line to a new hybrid architecture at
1.13867894, the best result the crawler lineage produced under the campaign's legal constraints.Day 17 — April 3: Helix — The Boldest Idea
The Helix concept was radical, too radical: instead of the sequential encoder→crawler→decoder pipeline, fire the crawler alongside every flat layer with bidirectional cross-injection. The crawler fires 9 times (once per flat layer) instead of 2–3 in loop mode. Cross-injection via gated 32-dim projections (~65K new params), zero-initialized for warm start. Final merge:
x = x_flat + sigmoid(gate) * x_crawl.Helix_DepthRecur (4×H100, 500 steps, seed=444) produced the most illuminating result of the entire Helix campaign:
Depth recurrence without Helix incurred a 0.095 BPB quantization penalty. Helix reduced that penalty to 0.004. But recurrence added nothing on top of Helix — S0 was barely different from R2. Helix was therefore confirmed at proxy scale, but only in its helix-only form.
Helix_FlatServe proposed 5 modifications to flat layer behavior (residual scaling, noise injection, progressive delegation, crawler-aligned output projection, asymmetric skip connections). Deprioritized — the SplitHead results redirected all Helix resources toward cross-attention experiments instead.
DarkHorse attempted to port Helix onto an external codebase (PR #1296 by aryanbhosale, 1.0897 BPB). Their depth recurrence at layers 4,5 created a 0.095 BPB quant gap that Helix might have shielded. Both
train_gpt_base.pyandtrain_gpt_helix.pywere written, but afterHelix_ab_3showed that the SplitHead concept did not scale, the port was no longer worth the integration risk, and we got distracted and never ran it. I still kind of want to run it. The Helix line would be fun to work on. Bidirectional injection deserves affection.Day 18 — April 4: Helix SplitHead — The Largest Signal and Its Failure
Helix_SplitHead ran a 30-arm micro ablation (4×H100, 200 steps, dim=256, seed=444). The concept: split the crawler's attention heads between self-attend and cross-attend, with position-agnostic keys on the cross side.
The headline: no helix → helix → best SplitHead gave a total swing of −0.078 BPB — the largest architectural signal in the entire crawler program. The split-ratio ladder was clean: H4 full cross landed 1.8549, H2 50% cross 1.8575, H1 25% cross 1.8586, and H3 75% cross 1.8597. Full cross-attention beat every split ratio. The crawler didn't want self-attention at all. Weight decay 0.12 was the strongest single hyperparameter. QK gain 5.0/6.0 both hurt — the crawler needs broad attention, not sharpened focus. 7F depth hurt all arms versus 5F.
BWXII_Helix_SplitHead attempted the same at production scale (9F, seed=300): 1.25031 int6_sw, 177.65ms/step, 11.04MB. It regressed versus BWX 9F (1.13868). The micro signal did not transfer.
This was the first lesson about scale and being cheap in this competition. It would not be the last.
Days 18–20 — April 4–6: Quant Fix and Ouroboros
BWXII_QuantFix explored quantization fixes at 1GPU and 2×GPU scale. The best 1GPU result was
T4_smart_wd012_Q2_gptq_loop_int8at 1.27427 BPB with a quant gap of 0.0109 — smart_skip was the strongest intervention. The 2×GPU movie test showed R4_full_fix improved raw BPB (1.2420) but the quant gap widened to 0.0383. Quantization erased most gains.Ouroboros Ablation (4×GPU, 600s, seed=300) tested three quantization-focused improvements on the BWX 9F base:
All three beat control. All trained faster (176–178ms vs 186ms). But would they compose?
Ouroboros_stacked (2000 steps, seed=444) answered definitively: no. The stacked result was 1.24514 BPB — dramatically worse than any individual arm's ~1.161 — and the artifact exceeded 16MB by 172KB. The three signals conflicted when composed.
Ouroboros_II designed a mixed-bit variant (attn=5, mlp=6, embed=8) with all three improvements plus brotli on loops=2. It was deprioritized after Ouroboros_stacked showed the signals conflict when composed; a fourth stacking variant was not a good use of compute.
Ouroboros_III tried the stack on the exact BWX 9F base at full production scale (8×H100, 600s, seed=444): 1.14462 BPB. +0.00594 worse than BWX 9F. The individually-positive signals did not compose. Stacking failed.
Helix_ab_3 (8×H100, 2000 steps, seed=444) scaled the SplitHead concept to full model size (4F, dim=512):
At scale the signal reversed decisively. The
−0.078micro result became+0.140, meaning the largest architectural signal in the program failed to transfer twice — once inBWXII, once inab_3.Also designed during this period: BW17 cadence testing on DGX Spark (loops=2 showed a
−0.054directional signal). BWXI proposed a 5-signal stack (brotli + GPTQ + QK4 + loops=2 + warmdown), but Ouroboros III's+0.006regression from only three stacked signals made that risk unjustifiable. BW21 NoisyQAT had the strongest isolated signal (−0.00296from Ouroboros ablation) but was deprioritized once the multi-crawler breakthrough appeared.Days 21–22 — April 7–8: The Multi-Crawler Breakthrough
BW22_LoopDepth (8×GPU, 2000 steps gate, seed=444) swept loop counts on the 9F base:
Quality scaled with depth, but so did throughput cost: loop4 was ~9% slower, loop5 ~17%. A3 promoted as a quality-priority candidate.
Corpus Ablations v1 (4×GPU, 1500 steps, seed=444) was a 16-arm screen on the BWX 9F base. It produced the single most important finding of Phase 5:
Two crawler layers — −0.01234 BPB. The breakthrough signal. The late QAT lane only yielded directional outcomes: A01 TAP shared was mildly positive (
~−0.0010), A08 crawler-int8 was near-null (~−0.0006), A10 softclamp improved relative to legacy QAT A09 (~−0.0047), and A11 sigmoidste was unstable (+0.041). Exact BPBs for A01, A08, A10, and A11 are unavailable, so they are not entered as hard values here, but the old blanket claim thatA09–A11simply "crashed" was too coarse.Day 23 — April 9: The Size Wall
Everything converged toward confirming the multi-crawler breakthrough. A Layer Relationship Grid (2×GPU, 1000 steps, seed=444) mapped the full 5×4 surface of flat layers × crawler layers:
8F+3C was the quality peak. Three crawler layers matched three loops — a potential symmetry law. Unfortunately, 8F+3C broke the bank by 180KB, and the unmodified 7F+3C missed the MB limit by only 70KB. We cooked on the 7F+3C, and that became the final 4-hour (14.2MB) run candidate and the best 10-minute (15.9MB) runner.
The complete grid, including sizes and step times, was:
An isolated 4×GPU confirmation of the grid winner landed 8F+3C at 1.34632 int6_sw and 15.00MB, validating the surface outside the batch sweep.
A Symmetry Ablation (4×GPU, 1000 steps, seed=444) tested higher orders directly:
Available evaluation tails preserved the same ordering: A0 finished at
3.29162042int6 sliding-window, while A1 at least reached3.35664470int6 roundtrip before the log cut off. The margin was decisive. The 4th loop costs roughly 0.8s per step and produces worse quality. The separate 4×GPU symmetry gate for 4×4 logged raw_bpb1.3696, int6 roundtrip1.37830138, and artifact16,152,138bytes (over cap by 152KB); the sliding-window eval for 4×4 did not complete. C=3 is the practical ceiling.Then the production runs:
BW_9F2C (8×H100, 600s, seed=444):
Trapper_Keeper_1 (8F+3C, 8×H100, 600s, seed=444, no FA3):
Both configurations beat the leader on quality. Both exceeded the 16MB artifact cap. TK1's pod lacked FA3 (157ms/step → only 3,811 steps); with FA3 it would have been ~110ms → ~5,450 steps → likely better quality. Brotli recompression might save 10–15% on size, making TK1 near-legal, but near-legal is still non-compliant.
The Ratio Sweep v2 attempted a 9-arm comparison of flat:crawler ratios but was interrupted by an OOM kill after only B00–B02:
B01 and B02 completed but are invalid — the test harness was rebuilt from scratch and missed 9 critical crawler env vars (LOOP_AWARE_GPTQ, MLP_LEAKY_SLOPE, CRAWLER_MLP_CHOKE_DIM, etc.), so any crawler-vs-flat comparison from that batch is non-interpretable. Only B00 (10F+0C, pure transformer — no crawler code affected) is possibly clean: 1.31961 int8 BPB at 15.43MB. The remaining 6 arms — including the critical 8F+2C and 7F+3C configurations — OOM killed before completing.
midnight_GPTQ tested the GPTQ bank fix on the neural-track Midnight 12L leader. The structural fix worked correctly (0 → 60 tensors quantized), but quality regressed by +0.00312 BPB. The model had apparently adapted around the broken GPTQ, so the fix did not promote.
Midnight_Black ran three times on 8×H100 (seed=444). Run 1 was misconfigured (wrong quant bits, sequential loader) and thrown out. Run 2a (cache=1): 1.10899 BPB, 16.44MB (over cap). Run 2b (cache=4): 1.10831 BPB, 15.74MB. Both worse than champion (1.10568). DOES NOT PROMOTE — the 3-variable stack failed (+0.00263 BPB).
Crawler_Katta ran on 8×H100 (2000 steps, seed=444). The Euler control hit 1.24486 int6_sw at 110ms/step. RK2 fast (2 loops) was 7.4% faster (102ms) but regressed +0.00186 BPB — the speed gain did not offset the quality loss. RK4 and hybrid solvers crashed due to an implementation bug in the forward pass. Euler remained the preferred solver. A quick test into Runge-Kutta variants, nothing to see here.
BW23_EcoConcept (QAT surrogate variants) got a DGX Spark smoke test (
4.73BPB roundtrip), but the run was too short to support a sliding-window conclusion. Its core concept was also tested through corpus arms A09 (QAT legacy STE:+0.0085) and A10 (softclamp:+0.004). QAT surrogates did not help on this architecture. Crawler_Symmetry (C=LOOPS design law, testing 4×4 through 8×8) and TTT_Ablation (end-to-end test-time training) remained unfinished as the campaign's final days concentrated on fitting the multi-crawler winners within the size constraint.Finish
The program ended with a stable crawler line built by rapid exploration, repeated falsification, and aggressive simplification. The original fully recursive thesis did not survive intact; the hybrid thesis did. What held up was a flat-depth scaffold around a small shared crawler bottleneck, with loops kept to three, EMA removed, loop-aware GPTQ treated as mandatory, and bytes treated as a first-class constraint rather than an afterthought. The legal
7F+3Crunner closed the competition problem, and the7F+3C4-hour confirmation showed that the final architecture was not just a trick for the 10-minute harness. It was a stable compression-and-thinking system discovered under pressure through competition constraints and rapid iteration.Appendices
The supporting records below follow the campaign in time. The dated score ledger comes first. The keep/kill and disposition ledgers are also ordered by campaign date rather than by abstract category. Records recovered from dated logs but lacking trustworthy original run-day placement are separated at the end.
Appendix A: Chronological Score Ledger
This appendix is the compact dated record of the major production, bridge, and late-sprint results.
Appendix B: Campaign Decision Ledger
This appendix preserves the keep/kill record in campaign order rather than collapsing it into a short summary.
Rejected, invalid, or non-promoted directions (campaign order):
Medusa)DELTA_NET_HEADS=4)SKIP_EMA=0)CRAWLER_QUANT_INT8=0FLAT_WEIGHT_SHARE=1)Confirmed positive or durable findings (campaign order):
Quality passes blocked by the 16MB size cap:
Open leads (untested at production scale):
vault/preserved_ttt/Appendix C: Experiment Disposition Ledger
This appendix records what ran, what was invalid, what was shelved, and what was absorbed into later work, in campaign order.
Medusa_VIIline fed the March 29 submission path, but DeltaNet itself was later quarantined.Appendix D: 2026-03-31 Proxy Ablation Battery (500-step, seed=444)
46 arms covering every crawler infrastructure knob. This is the foundational dataset that informed BW7 MegaGate and all subsequent architecture decisions. Best results first.
Choke Shape (BWC): BWC-04 choke=512 (1.42887 BEST), BWC-02 choke=128 (1.43674), BWC-03 choke=256 (1.44071), BWC-01 choke=32 (1.45004 WORST).
Tap Configuration (BWT): BWT-05 dim=32 per-loop deep (1.43004), BWT-01 dim=32 shared all (1.43227), BWT-03 dim=16 per-loop (1.43268), BWT-06 dim=32 per-loop shallow (1.43322), BWT-02 dim=32 per-loop all (1.44133), BWT-04 dim=64 per-loop all (1.44346).
Battery Schedule (BWB): BWB-01 1,2,4 gentle asc (1.43769), BWB-04 9,3,1 desc (1.44156), BWB-05 1,9,1 middle (1.44237), BWB-03 1,5,25 aggressive (1.44283), BWB-07 9,1,1 first wide (1.44355), BWB-02 1,3,9 moderate asc (1.44470), BWB-06 1,1,9 final wide (1.44797).
Pyramid + RoPE (BWCD): BWCD-02 rope 9,1,1 (1.43531), BWCD-01 rope 4,2,1 (1.43749), BWCD-00 rope 9,3,1 desc (1.43779), BWCD-03 rope 9,3,9 wide-med-wide (1.44248).
Choke + Battery (BWCB): BWCB-00 rope 1,2,4 (1.44850), BWCB-02 rope 1,5,25 (1.44864), BWCB-01 rope 1,3,9 (1.44874).
Choke Shape Variants (BWCS): BWCS-02 pyramid dim=512 (1.44724), BWCS-06 residual dim=128 (1.45260), BWCS-03 pyramid_res dim=128 (1.45419), BWCS-01 pyramid dim=128 (1.45711), BWCS-05 grouped dim=512 groups=4 (1.45748), BWCS-00 control flat (1.45761), BWCS-04 grouped dim=512 groups=8 (1.46247).
Cannon Variants (BWE/BWVC): BWE-02 channel 1.5K (1.43590), BWE-00 control (1.44166), BWVC-00 control (1.44236), BWVC-01 scalar 3 params (1.44261), BWVC-02 channel 1.5K (1.44296), BWE-01 scalar 3 params (1.44337), BWVC-03 rmsnorm 1.5K (1.44428).
Other: BWS-01 loop smear (1.44628), CTRL-00 all disabled (1.44185).
XSA Coverage: BWXSA-02 XSA=15 100% (1.51431), BWXSA-01 XSA=13 87% (1.51982), baseline XSA=11 73% (1.52365).
MLP Slope: 0.75 (1.55637 best, −0.00065), 0.5 control (1.55702), worst (1.56116, +0.00413). All near-identical.
Depth: 4F+1C (1.52365), 5F+1C (1.54404), 6F+1C (1.56887). Depth beats width at every point.
Appendix E: 2026-04-03 BWXII QuantFix Full Sweep
Weight decay sweep on 1×GPU with GPTQ layer variants:
Movie test: R1_fire_embed 1.28121 (gap 0.0225), R4_full_fix 1.28033 (gap 0.0383). Full_fix improved raw BPB but widened quant gap.
Appendix F: 2026-03-18 to 2026-03-22 Cadence Characterization (H1/H2 series)
H1_cad0_FULLSCALE (no cadence, every step fractal): 1.16029 — best quality when throughput cost is ignored.
H2_2f4cx2_cad4: 4.30781 — DIVERGED. Too much recursion.
Appendix G: Source-Dated Historical Recoveries
These runs were recovered from dated session logs, records, and archive inventories. The source date is known in every row below. What still varies is how tightly each run can be placed on the main crawler spine: some are direct Frugendorff antecedents, some are bridge experiments, and some are adjacent side branches.
Frugendorff antecedents and adjacent recovered records:
fa168c83-cdfb-4c09-9032-2761810286fe.txt)d31921e8-8344-417d-9a93-f54180d5a21d.txt)c1539c60-b928-4a45-a69b-aa7209f61ff4.txt)e6bc18b0-0818-460e-9321-5a864a919e3e.txt)edb680e7-5a4c-4552-9c22-7b9886b6f7fa.txt)records/track_10min_16mb/2026-03-23_Frugendorff_Squared_6x2_640d_MLP4/submission.jsonA-WING GREEN_3 — Width 640)H4 bridge runs recovered from dated logs:
H4A_gsv7_control_20260324_201010.txtH4B_gsv7_crawler_bank_20260324_201746.txtH4_A_6flat_20260324_234630.txtH4_B_5f1cx2_btn_20260324_235155.txtResearch journal compiled 2026-04-10. Source: parameter-golf-lab (crawler/, junkyard/, records/, legs/), sota_nueral (legs/, records/), git history, pod logs.
TO WRAP EVERYTHING UP - THANKS FOR READING FELLOW HUMANS!
Hope: to work on AI research or visualizations with an amazing team. This is 35% of my output in this competition, and I have a very competitive neural line as well. I would love to do this for a living, since my old job of sculpting 3D assets and making games does not really exist like it used to. This research is not "research"; it is an honest and open work journal in the hope that I can continue to work on radical ideas in the AI space.