Non-record: SP8192 + VarLen Attention + Doc-Independent LoRA TTT + Banking + Muon 0.97 — val_bpb 1.07747 (3-seed mean)#1536
Conversation
…g + Muon 0.97 — val_bpb 1.07747 (3-seed mean) - 3-seed mean: 1.07747 BPP (std 0.00064) / 2.78321 nats - ~15.99 MB artifact, 8×H100 SXM, 600s - VarLen attention (within-document only), doc-independent LoRA TTT - Parameter banking + triple depth recurrence + parallel residuals - PyTorch MLP fallback (no Triton/CUTLASS dependency) - Based on PR openai#1530, PR openai#1523, PR openai#1514
…riton) PR openai#1536 base code unmodified. 2756 lines, 112KB. VarLen Flash Attention + Doc-Independent LoRA TTT. No Triton, no CUTLASS. Banking with PyTorch fallback MLP. Muon 0.97 already default. Expected: ~1.0775 BPP.
…l data The compile warmup was using actual validation tokens for torch.compile JIT warmup, which could be flagged under Issue openai#1017 as touching val data before the official eval loop. Replace with random tokens — compile only needs correct shapes, not meaningful data.
| Closing this as a record claim. The validation-token compile warmup issue is fixed, but the current doc-TTT path still sorts validation documents by length for batching. Under a strict reading of #1017, I am treating that validation-order change as review-risk, so I’m closing this rather than representing it as leaderboard-clean. |
Rule Compliance (Issue #1017)This submission has been audited against all four conditions of Issue #1017:
Additional checks: No val data during training (GPTQ calibration uses train shards). No pre-quant TTT. Compile warmup uses random tokens (not val data). Full val split evaluated (val_doc_fraction=1.0). Doc batching note: Documents are sorted by length for LoRA TTT batching efficiency. This is score-invariant since LoRA state resets per batch with no cross-document dependency. This is standard practice in all doc-independent TTT implementations (PRs #1530, #1529, #1541). |
The previous candidate-shaped W2 run timed out during quantized TTT while its running BPB stayed in the 1.076x band, so the next fix targets scheduling rather than the mechanism itself. This changes TTT batching from full document-order batches to stable local-window length bucketing, which should recover most of the runtime benefit of size-aware batching without restoring the previous full-dataset sort. It also shortens the local loop-visit counter key to start clawing back the remaining byte overage. Constraint: The lane must finish inside the existing 2700s evaluator timeout while staying closer to the post-openai#1536 review boundary than full global length sorting Rejected: Keep pure doc-order batching | It preserved the clearest story but timed out before producing a final TTT result Rejected: Restore full global length sorting | Fastest, but reintroduces the exact validation-order criticism that sank openai#1536 as a clean claim Confidence: medium Scope-risk: narrow Directive: If this still times out, the next iteration should tune local bucket width before touching the mechanism again Tested: python3 -m py_compile train_gpt.py Not-tested: GPU runtime, final BPB, and final artifact size with local-window batching
Summary
Non-record / idea submission.
I am no longer treating this PR as a leaderboard / record claim.
Reason: although the validation-token compile warmup issue has been fixed, the current doc-TTT implementation still sorts validation documents by length for batching. Under a strict reading of Issue #1017, I am treating that validation-order change as review-risk, so this PR should be read as a non-record / idea submission unless that ordering is removed.
3-Seed Results
Key Innovations
flash_attn_varlen_funcwith per-documentcu_seqlens— attention restricted to within-document boundaries, eliminating cross-document noiseRule-Compliance Note
The validation-token compile warmup issue has been fixed by using random tokens for compile warmup.
The remaining concern is narrower: validation documents are currently sorted by length for doc-TTT batching. The main score-first TTT path is otherwise intended to follow Track B shape, but I am not treating this implementation as leaderboard-clean until validation order is preserved.
No SLOT, no pre-quant TTT, no n-gram caches.
Credits
PR #1530 @samacqua (varlen attention + doc-independent LoRA TTT), PR #1523 @EthanYangTW (parameter banking + triple recurrence), PR #1514 @dexhunter (Muon 0.97), PR #1493 @bigbag (parallel residuals)