Skip to content

Record: VarLen Attention + Triton Fused MLP + Doc-TTT + Warmdown 0.75 + Chunk 48 — val_bpb 1.07406 (3-seed mean)#1560

Open
dexhunter wants to merge 2 commits intoopenai:mainfrom
dexhunter:dexhunter/varlen-ttt-warmdown75-chunk48
Open

Record: VarLen Attention + Triton Fused MLP + Doc-TTT + Warmdown 0.75 + Chunk 48 — val_bpb 1.07406 (3-seed mean)#1560
dexhunter wants to merge 2 commits intoopenai:mainfrom
dexhunter:dexhunter/varlen-ttt-warmdown75-chunk48

Conversation

@dexhunter
Copy link
Copy Markdown
Contributor

Summary

val_bpb = 1.07406 (3-seed mean, std 0.00132) | 2.77441 nats | ~15.99 MB | 8xH100 SXM, 600s

Seed Steps Pre-TTT BPB Post-TTT BPB TTT Gain TTT Time Artifact
42 4918 1.08400 1.07352 -0.01048 213s 15,994,146
0 4900 1.08363 1.07310 -0.01053 221s 15,997,570
1337 4908 1.08619 1.07556 -0.01063 219s 15,988,610
Mean 4909 1.08461 1.07406 -0.01055 218s 15,993,442

Merged SOTA (PR #1493 @bigbag): 1.0810 BPB (2.78932 nats). Delta: -0.01491 nats (clears 0.005 bar by 3.0x).

Key Innovation

Warmdown fraction and TTT chunk size tuning on PR #1530's VarLen + Triton fused MLP + doc-TTT stack:

  • warmdown_frac = 0.75 (vs 0.72 default) — extends cosine decay, letting the model settle into a lower-loss basin before quantization
  • TTT_CHUNK_SIZE = 48 (vs 32 default) — larger chunks provide more context per TTT gradient step, improving LoRA adaptation
  • Muon momentum 0.97 — shorter memory horizon tracks the loss surface better during extended warmdown

Rule Compliance (Issue #1017)

  • Condition 1 (Causality): VarLen attention with per-document cu_seqlens, strict causal masking
  • Condition 2 (Normalized): Standard softmax over full vocabulary
  • Condition 3 (Score before update): TTT chunks scored under torch.no_grad() BEFORE LoRA gradient update
  • Condition 4 (Single pass): Each token scored exactly once
  • No SLOT, no pre-quant TTT, no n-gram cache
  • All artifacts < 16 MB, train < 600s, eval < 225s
  • Compile warmup uses random tokens (not val data)

Test Plan

  • 3-seed verification (seeds 42, 0, 1337)
  • All artifacts under 16,000,000 bytes
  • Train under 600s on all seeds (~587s)
  • Eval under 225s on all seeds

Credits

Pavel Liashkov and others added 2 commits April 11, 2026 19:48
…Legal TTT — val_bpb 1.07785 (3-seed mean) 3-seed mean: 1.07785 (std 0.00047), seeds 42/314/999 All artifacts under 16MB, training under 600s, eval under 600s Improved parallel residuals (cross-lane routing), Muon 0.97, MATRIX_LR=0.03 Score-first TTT (SGD 3ep), no SLOT, no pre-quant TTT Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… (3-seed mean) PR openai#1530 v2 base + warmdown_frac=0.75 + TTT_CHUNK_SIZE=48 + Muon 0.97. 3-seed mean: 1.07406 (std 0.00132), 2.77441 nats. Delta vs merged SOTA (openai#1493): -0.01491 nats (clears 0.005 bar by 3.0x). All artifacts < 16 MB, train < 600s, eval < 225s.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant