Skip to content

Record: TMA Megakernel + Improved Parallel Residuals + Tap-In min_match=1 — val_bpb 1.07636 (3-seed mean)#1555

Open
andrewbaggio1 wants to merge 2 commits intoopenai:mainfrom
andrewbaggio1:record/megakernel-improved-paresid-tapin-mm1
Open

Record: TMA Megakernel + Improved Parallel Residuals + Tap-In min_match=1 — val_bpb 1.07636 (3-seed mean)#1555
andrewbaggio1 wants to merge 2 commits intoopenai:mainfrom
andrewbaggio1:record/megakernel-improved-paresid-tapin-mm1

Conversation

@andrewbaggio1
Copy link
Copy Markdown

Summary

val_bpb = 1.07636 (3-seed mean, std 0.0006) | ~15.97 MB | 8xH100 SXM

Seed Sliding BPB TTT BPB val_loss (nats) Artifact
42 1.07856 1.07703 2.78208 15,961,726
1337 1.07727 1.07586 2.77907 15,964,616
2024 1.07833 1.07619 2.77990 15,970,213
Mean 1.07805 1.07636 2.78035

Merged SOTA (PR #1493): 2.78932 nats. Delta: -0.00897 nats (clears 0.005 threshold by 80%).

Novel Contributions

  1. TMA Megakernel — Triton Hopper TMA fused MLP forward kernel. +10.5% throughput, ~200 extra steps in 600s. Claims megakernel bounty.
  2. Tap-In min_match=1 (Unigram Matching) — First submission to lower Tap-In match threshold to 1 token. Fires at 21% of positions (vs 1.7% at min_match=3). Derived from local CPU loss analysis showing the model is uncertain on repeating tokens.
  3. Improved Parallel Residuals — Ported from @msisovic's PR Record: ParallelResiduals, 1.0753 BPB / 2.7777 nats, -0.0025 BPB / -0.0064 nats vs PR #1523 #1529.

Compliance (Track B — Issue #1017)

  • Causal dependence — sliding window + strict prefix Tap-In + prefix-only hash embedding
  • Full normalized distribution — Tap-In mixing sums to 1 by construction
  • Score before update — TTT chunks scored under no_grad before SGD; Tap-In linked list updated after scoring
  • Single left-to-right pass — each token scored exactly once
  • No SLOT, no pre-quant TTT, no n-gram caches
  • All artifacts under 16,000,000 bytes
  • Training under 600s, eval under 600s

Test plan

  • 3-seed validation (seeds 42, 1337, 2024)
  • All artifacts under 16MB
  • Train under 600s on all seeds (~587s)
  • Eval under 600s on all seeds (~385-457s)

Credits

@msisovic (improved parallel residuals #1529), @abaybektursun (Tap-In V4/V6 #1518/#1420, TTT #549), @clarkkev (SP8192 + SDClip #1394), @EthanYangTW (parameter banking #1523), @dexhunter (legal TTT #1413), @resouer (eval hash embedding #1460), @bigbag (QK-Gain tuning #1493)

🤖 Generated with Claude Code

andrewbaggio1 and others added 2 commits April 8, 2026 13:29
8 Gated DeltaNet layers + 2 softmax attention layers. GDN is mathematically equivalent to E2E TTT-Linear with MSE loss. First competitive GDN hybrid in the 10-min budget. Targets bounty items: E2E TTT + State-space models. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ch=1 — val_bpb 1.07636 (3-seed mean) 3-seed mean 1.07636 BPB (std 0.0006), delta -0.00897 nats vs merged SOTA openai#1493. Novel: TMA fused MLP kernel, Tap-In unigram matching (min_match=1, fires 21% of positions), improved parallel residuals from openai#1529, parameter banking from openai#1523. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant