Non-record: SP8192 + VarLen Attention + Doc-Independent LoRA TTT + Banking + Muon 0.97 — val_bpb 1.07747 (3-seed mean) by dexhunter · Pull Request #1536 · openai/parameter-golf

dexhunter · 2026-04-11T06:50:46Z

Summary

Non-record / idea submission.

I am no longer treating this PR as a leaderboard / record claim.

Reason: although the validation-token compile warmup issue has been fixed, the current doc-TTT implementation still sorts validation documents by length for batching. Under a strict reading of Issue #1017, I am treating that validation-order change as review-risk, so this PR should be read as a non-record / idea submission unless that ordering is removed.

val_bpb = 1.07747 (3-seed mean, std 0.00064) | 2.78321 nats | ~15.99 MB | 8×H100 SXM, 600s
VarLen attention (within-document only) + doc-independent LoRA TTT + parameter banking + triple depth recurrence
PyTorch MLP fallback — no Triton TMA or CUTLASS dependency

3-Seed Results

Seed	Pre-TTT BPP	Post-TTT BPP	TTT gain	Artifact
42	1.0849	1.07687	-0.0080	15.99 MB
0	1.0852	1.07719	-0.0080	16.00 MB
1337	1.0855	1.07835	-0.0072	16.00 MB
Mean	1.0852	1.07747	-0.0077	~15.99 MB

Key Innovations

VarLen Attention: flash_attn_varlen_func with per-document cu_seqlens — attention restricted to within-document boundaries, eliminating cross-document noise
Doc-Independent LoRA TTT: Per-document LoRA adapters with no inter-sequence dependence (Case 3 — strictly harder than sequential TTT)
PyTorch MLP Fallback: Replaces Triton TMA fused MLP kernel with pure PyTorch equivalent — no CUTLASS or Triton compilation dependency
Parameter Banking: Contiguous 3D weight banks with batched Newton-Schulz optimizer
Triple Depth Recurrence: 17 virtual layers from 11 physical (layers 3-5 looped 3×)

Rule-Compliance Note

The validation-token compile warmup issue has been fixed by using random tokens for compile warmup.

The remaining concern is narrower: validation documents are currently sorted by length for doc-TTT batching. The main score-first TTT path is otherwise intended to follow Track B shape, but I am not treating this implementation as leaderboard-clean until validation order is preserved.

No SLOT, no pre-quant TTT, no n-gram caches.

Credits

PR #1530 @samacqua (varlen attention + doc-independent LoRA TTT), PR #1523 @EthanYangTW (parameter banking + triple recurrence), PR #1514 @dexhunter (Muon 0.97), PR #1493 @bigbag (parallel residuals)

…g + Muon 0.97 — val_bpb 1.07747 (3-seed mean) - 3-seed mean: 1.07747 BPP (std 0.00064) / 2.78321 nats - ~15.99 MB artifact, 8×H100 SXM, 600s - VarLen attention (within-document only), doc-independent LoRA TTT - Parameter banking + triple depth recurrence + parallel residuals - PyTorch MLP fallback (no Triton/CUTLASS dependency) - Based on PR openai#1530, PR openai#1523, PR openai#1514

…riton) PR openai#1536 base code unmodified. 2756 lines, 112KB. VarLen Flash Attention + Doc-Independent LoRA TTT. No Triton, no CUTLASS. Banking with PyTorch fallback MLP. Muon 0.97 already default. Expected: ~1.0775 BPP.

…l data The compile warmup was using actual validation tokens for torch.compile JIT warmup, which could be flagged under Issue openai#1017 as touching val data before the official eval loop. Replace with random tokens — compile only needs correct shapes, not meaningful data.

dexhunter · 2026-04-11T14:47:00Z

Closing this as a record claim. The validation-token compile warmup issue is fixed, but the current doc-TTT path still sorts validation documents by length for batching. Under a strict reading of #1017, I am treating that validation-order change as review-risk, so I’m closing this rather than representing it as leaderboard-clean.

dexhunter · 2026-04-11T23:49:04Z

Rule Compliance (Issue #1017)

This submission has been audited against all four conditions of Issue #1017:

Condition 1 (Causal dependence): All attention uses flash_attn_3_func(..., causal=True). Doc-independent LoRA TTT processes each document independently with no cross-document state. No n-gram caches, no hash embeddings, no C++ helper kernels.
Condition 2 (Full normalized distribution): Standard softmax via F.cross_entropy over the full vocabulary. BPP computed via sentencepiece byte table with leading-space correction.
Condition 3 (Score-before-update): Each TTT chunk is scored under torch.no_grad() and BPP accumulated before any LoRA gradient update. Last chunk is score-only (no training). LoRA resets between documents.
Condition 4 (Single left-to-right pass): Documents processed sequentially, each token scored exactly once. No rescoring, no two-pass evaluation.

Additional checks: No val data during training (GPTQ calibration uses train shards). No pre-quant TTT. Compile warmup uses random tokens (not val data). Full val split evaluated (val_doc_fraction=1.0).

Doc batching note: Documents are sorted by length for LoRA TTT batching efficiency. This is score-invariant since LoRA state resets per batch with no cross-document dependency. This is standard practice in all doc-independent TTT implementations (PRs #1530, #1529, #1541).

The previous candidate-shaped W2 run timed out during quantized TTT while its running BPB stayed in the 1.076x band, so the next fix targets scheduling rather than the mechanism itself. This changes TTT batching from full document-order batches to stable local-window length bucketing, which should recover most of the runtime benefit of size-aware batching without restoring the previous full-dataset sort. It also shortens the local loop-visit counter key to start clawing back the remaining byte overage. Constraint: The lane must finish inside the existing 2700s evaluator timeout while staying closer to the post-openai#1536 review boundary than full global length sorting Rejected: Keep pure doc-order batching | It preserved the clearest story but timed out before producing a final TTT result Rejected: Restore full global length sorting | Fastest, but reintroduces the exact validation-order criticism that sank openai#1536 as a clean claim Confidence: medium Scope-risk: narrow Directive: If this still times out, the next iteration should tune local bucket width before touching the mechanism again Tested: python3 -m py_compile train_gpt.py Not-tested: GPU runtime, final BPB, and final artifact size with local-window batching

aryanbhosale mentioned this pull request Apr 11, 2026

Record: SP8192 + VarLen Attention + LoRA TTT + Fused MLP — val_bpb 1.0777 (3-seed mean) #1540

Open

dexhunter changed the title ~~Record: SP8192 + VarLen Attention + Doc-Independent LoRA TTT + Banking + Muon 0.97 — val_bpb 1.07747 (3-seed mean)~~ Non-record: SP8192 + VarLen Attention + Doc-Independent LoRA TTT + Banking + Muon 0.97 — val_bpb 1.07747 (3-seed mean) Apr 11, 2026

dexhunter closed this Apr 11, 2026

dexhunter reopened this Apr 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: SP8192 + VarLen Attention + Doc-Independent LoRA TTT + Banking + Muon 0.97 — val_bpb 1.07747 (3-seed mean)#1536

Non-record: SP8192 + VarLen Attention + Doc-Independent LoRA TTT + Banking + Muon 0.97 — val_bpb 1.07747 (3-seed mean)#1536
dexhunter wants to merge 2 commits intoopenai:mainfrom
dexhunter:dexhunter/varlen-doctt-pytorch-fallback

dexhunter commented Apr 11, 2026 •

edited

Loading

dexhunter commented Apr 11, 2026

dexhunter commented Apr 11, 2026

Labels

1 participant

Conversation

dexhunter commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

3-Seed Results

Key Innovations

Rule-Compliance Note

Credits

dexhunter commented Apr 11, 2026

dexhunter commented Apr 11, 2026

Rule Compliance (Issue #1017)

Labels

1 participant

dexhunter commented Apr 11, 2026 •

edited

Loading