Record: TMA Megakernel + Improved Parallel Residuals + Tap-In min_match=1 — val_bpb 1.07636 (3-seed mean) by andrewbaggio1 · Pull Request #1555 · openai/parameter-golf

andrewbaggio1 · 2026-04-11T22:48:09Z

Summary

val_bpb = 1.07636 (3-seed mean, std 0.0006) | ~15.97 MB | 8xH100 SXM

Seed	Sliding BPB	TTT BPB	val_loss (nats)	Artifact
42	1.07856	1.07703	2.78208	15,961,726
1337	1.07727	1.07586	2.77907	15,964,616
2024	1.07833	1.07619	2.77990	15,970,213
Mean	1.07805	1.07636	2.78035

Merged SOTA (PR #1493): 2.78932 nats. Delta: -0.00897 nats (clears 0.005 threshold by 80%).

Novel Contributions

TMA Megakernel — Triton Hopper TMA fused MLP forward kernel. +10.5% throughput, ~200 extra steps in 600s. Claims megakernel bounty.
Tap-In min_match=1 (Unigram Matching) — First submission to lower Tap-In match threshold to 1 token. Fires at 21% of positions (vs 1.7% at min_match=3). Derived from local CPU loss analysis showing the model is uncertain on repeating tokens.
Improved Parallel Residuals — Ported from @msisovic's PR Record: ParallelResiduals, 1.0753 BPB / 2.7777 nats, -0.0025 BPB / -0.0064 nats vs PR #1523 #1529.

Compliance (Track B — Issue #1017)

Causal dependence — sliding window + strict prefix Tap-In + prefix-only hash embedding
Full normalized distribution — Tap-In mixing sums to 1 by construction
Score before update — TTT chunks scored under no_grad before SGD; Tap-In linked list updated after scoring
Single left-to-right pass — each token scored exactly once
No SLOT, no pre-quant TTT, no n-gram caches
All artifacts under 16,000,000 bytes
Training under 600s, eval under 600s

Test plan

3-seed validation (seeds 42, 1337, 2024)
All artifacts under 16MB
Train under 600s on all seeds (~587s)
Eval under 600s on all seeds (~385-457s)

Credits

@msisovic (improved parallel residuals #1529), @abaybektursun (Tap-In V4/V6 #1518/#1420, TTT #549), @clarkkev (SP8192 + SDClip #1394), @EthanYangTW (parameter banking #1523), @dexhunter (legal TTT #1413), @resouer (eval hash embedding #1460), @bigbag (QK-Gain tuning #1493)

🤖 Generated with Claude Code

8 Gated DeltaNet layers + 2 softmax attention layers. GDN is mathematically equivalent to E2E TTT-Linear with MSE loss. First competitive GDN hybrid in the 10-min budget. Targets bounty items: E2E TTT + State-space models. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ch=1 — val_bpb 1.07636 (3-seed mean) 3-seed mean 1.07636 BPB (std 0.0006), delta -0.00897 nats vs merged SOTA openai#1493. Novel: TMA fused MLP kernel, Tap-In unigram matching (min_match=1, fires 21% of positions), improved parallel residuals from openai#1529, parameter banking from openai#1523. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

andrewbaggio1 and others added 2 commits April 8, 2026 13:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: TMA Megakernel + Improved Parallel Residuals + Tap-In min_match=1 — val_bpb 1.07636 (3-seed mean)#1555

Record: TMA Megakernel + Improved Parallel Residuals + Tap-In min_match=1 — val_bpb 1.07636 (3-seed mean)#1555
andrewbaggio1 wants to merge 2 commits intoopenai:mainfrom
andrewbaggio1:record/megakernel-improved-paresid-tapin-mm1

andrewbaggio1 commented Apr 11, 2026

Labels

1 participant

Conversation