Record: VarLen Attention + Triton Fused MLP + Doc-TTT + Warmdown 0.75 + Chunk 48 — val_bpb 1.07406 (3-seed mean) by dexhunter · Pull Request #1560 · openai/parameter-golf

dexhunter · 2026-04-12T03:43:04Z

Summary

val_bpb = 1.07406 (3-seed mean, std 0.00132) | 2.77441 nats | ~15.99 MB | 8xH100 SXM, 600s

Seed	Steps	Pre-TTT BPB	Post-TTT BPB	TTT Gain	TTT Time	Artifact
42	4918	1.08400	1.07352	-0.01048	213s	15,994,146
0	4900	1.08363	1.07310	-0.01053	221s	15,997,570
1337	4908	1.08619	1.07556	-0.01063	219s	15,988,610
Mean	4909	1.08461	1.07406	-0.01055	218s	15,993,442

Merged SOTA (PR #1493 @bigbag): 1.0810 BPB (2.78932 nats). Delta: -0.01491 nats (clears 0.005 bar by 3.0x).

Key Innovation

Warmdown fraction and TTT chunk size tuning on PR #1530's VarLen + Triton fused MLP + doc-TTT stack:

warmdown_frac = 0.75 (vs 0.72 default) — extends cosine decay, letting the model settle into a lower-loss basin before quantization
TTT_CHUNK_SIZE = 48 (vs 32 default) — larger chunks provide more context per TTT gradient step, improving LoRA adaptation
Muon momentum 0.97 — shorter memory horizon tracks the loss surface better during extended warmdown

Rule Compliance (Issue #1017)

Condition 1 (Causality): VarLen attention with per-document cu_seqlens, strict causal masking
Condition 2 (Normalized): Standard softmax over full vocabulary
Condition 3 (Score before update): TTT chunks scored under torch.no_grad() BEFORE LoRA gradient update
Condition 4 (Single pass): Each token scored exactly once
No SLOT, no pre-quant TTT, no n-gram cache
All artifacts < 16 MB, train < 600s, eval < 225s
Compile warmup uses random tokens (not val data)

Test Plan

3-seed verification (seeds 42, 0, 1337)
All artifacts under 16,000,000 bytes
Train under 600s on all seeds (~587s)
Eval under 225s on all seeds

Credits

@samacqua — VarLen attention, Triton fused MLP, doc-independent LoRA TTT, triple depth recurrence, parameter banking (PR Record: Varlen attention + fused MLP + doc-independent TTT (1.07336) #1530)
@EthanYangTW — Triple recurrence, parameter banking, fused MLP TMA (PR Record: SP8192 + Triple Recurrence + Banking + Fused MLP + Muon 0.97 — val_bpb 1.0778 (3-seed mean) #1523)
@bigbag — Merged SOTA baseline (PR Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean) #1493)
@clarkkev — SP8192 + GPTQ + SDClip + MuonEq-R (PR Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean) #1394)
@abaybektursun — Score-first TTT framework (PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549)

…Legal TTT — val_bpb 1.07785 (3-seed mean) 3-seed mean: 1.07785 (std 0.00047), seeds 42/314/999 All artifacts under 16MB, training under 600s, eval under 600s Improved parallel residuals (cross-lane routing), Muon 0.97, MATRIX_LR=0.03 Score-first TTT (SGD 3ep), no SLOT, no pre-quant TTT Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… (3-seed mean) PR openai#1530 v2 base + warmdown_frac=0.75 + TTT_CHUNK_SIZE=48 + Muon 0.97. 3-seed mean: 1.07406 (std 0.00132), 2.77441 nats. Delta vs merged SOTA (openai#1493): -0.01491 nats (clears 0.005 bar by 3.0x). All artifacts < 16 MB, train < 600s, eval < 225s.

Pavel Liashkov and others added 2 commits April 11, 2026 19:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: VarLen Attention + Triton Fused MLP + Doc-TTT + Warmdown 0.75 + Chunk 48 — val_bpb 1.07406 (3-seed mean)#1560

Record: VarLen Attention + Triton Fused MLP + Doc-TTT + Warmdown 0.75 + Chunk 48 — val_bpb 1.07406 (3-seed mean)#1560
dexhunter wants to merge 2 commits intoopenai:mainfrom
dexhunter:dexhunter/varlen-ttt-warmdown75-chunk48

dexhunter commented Apr 12, 2026

Labels

1 participant

Conversation