Skip to content

Non-record: SP8192 + VarLen Attention + Doc-Independent LoRA TTT + Banking + Muon 0.97 — val_bpb 1.07747 (3-seed mean)#1536

Open
dexhunter wants to merge 2 commits intoopenai:mainfrom
dexhunter:dexhunter/varlen-doctt-pytorch-fallback
Open

Non-record: SP8192 + VarLen Attention + Doc-Independent LoRA TTT + Banking + Muon 0.97 — val_bpb 1.07747 (3-seed mean)#1536
dexhunter wants to merge 2 commits intoopenai:mainfrom
dexhunter:dexhunter/varlen-doctt-pytorch-fallback

Conversation

@dexhunter
Copy link
Copy Markdown
Contributor

@dexhunter dexhunter commented Apr 11, 2026

Summary

Non-record / idea submission.

I am no longer treating this PR as a leaderboard / record claim.

Reason: although the validation-token compile warmup issue has been fixed, the current doc-TTT implementation still sorts validation documents by length for batching. Under a strict reading of Issue #1017, I am treating that validation-order change as review-risk, so this PR should be read as a non-record / idea submission unless that ordering is removed.

  • val_bpb = 1.07747 (3-seed mean, std 0.00064) | 2.78321 nats | ~15.99 MB | 8×H100 SXM, 600s
  • VarLen attention (within-document only) + doc-independent LoRA TTT + parameter banking + triple depth recurrence
  • PyTorch MLP fallback — no Triton TMA or CUTLASS dependency

3-Seed Results

Seed Pre-TTT BPP Post-TTT BPP TTT gain Artifact
42 1.0849 1.07687 -0.0080 15.99 MB
0 1.0852 1.07719 -0.0080 16.00 MB
1337 1.0855 1.07835 -0.0072 16.00 MB
Mean 1.0852 1.07747 -0.0077 ~15.99 MB

Key Innovations

  1. VarLen Attention: flash_attn_varlen_func with per-document cu_seqlens — attention restricted to within-document boundaries, eliminating cross-document noise
  2. Doc-Independent LoRA TTT: Per-document LoRA adapters with no inter-sequence dependence (Case 3 — strictly harder than sequential TTT)
  3. PyTorch MLP Fallback: Replaces Triton TMA fused MLP kernel with pure PyTorch equivalent — no CUTLASS or Triton compilation dependency
  4. Parameter Banking: Contiguous 3D weight banks with batched Newton-Schulz optimizer
  5. Triple Depth Recurrence: 17 virtual layers from 11 physical (layers 3-5 looped 3×)

Rule-Compliance Note

The validation-token compile warmup issue has been fixed by using random tokens for compile warmup.

The remaining concern is narrower: validation documents are currently sorted by length for doc-TTT batching. The main score-first TTT path is otherwise intended to follow Track B shape, but I am not treating this implementation as leaderboard-clean until validation order is preserved.

No SLOT, no pre-quant TTT, no n-gram caches.

Credits

PR #1530 @samacqua (varlen attention + doc-independent LoRA TTT), PR #1523 @EthanYangTW (parameter banking + triple recurrence), PR #1514 @dexhunter (Muon 0.97), PR #1493 @bigbag (parallel residuals)

…g + Muon 0.97 — val_bpb 1.07747 (3-seed mean) - 3-seed mean: 1.07747 BPP (std 0.00064) / 2.78321 nats - ~15.99 MB artifact, 8×H100 SXM, 600s - VarLen attention (within-document only), doc-independent LoRA TTT - Parameter banking + triple depth recurrence + parallel residuals - PyTorch MLP fallback (no Triton/CUTLASS dependency) - Based on PR openai#1530, PR openai#1523, PR openai#1514
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 11, 2026
…riton) PR openai#1536 base code unmodified. 2756 lines, 112KB. VarLen Flash Attention + Doc-Independent LoRA TTT. No Triton, no CUTLASS. Banking with PyTorch fallback MLP. Muon 0.97 already default. Expected: ~1.0775 BPP.
…l data The compile warmup was using actual validation tokens for torch.compile JIT warmup, which could be flagged under Issue openai#1017 as touching val data before the official eval loop. Replace with random tokens — compile only needs correct shapes, not meaningful data.
@dexhunter dexhunter changed the title Record: SP8192 + VarLen Attention + Doc-Independent LoRA TTT + Banking + Muon 0.97 — val_bpb 1.07747 (3-seed mean) Non-record: SP8192 + VarLen Attention + Doc-Independent LoRA TTT + Banking + Muon 0.97 — val_bpb 1.07747 (3-seed mean) Apr 11, 2026
@dexhunter
Copy link
Copy Markdown
Contributor Author

Closing this as a record claim. The validation-token compile warmup issue is fixed, but the current doc-TTT path still sorts validation documents by length for batching. Under a strict reading of #1017, I am treating that validation-order change as review-risk, so I’m closing this rather than representing it as leaderboard-clean.

@dexhunter dexhunter closed this Apr 11, 2026
@dexhunter dexhunter reopened this Apr 11, 2026
@dexhunter
Copy link
Copy Markdown
Contributor Author

Rule Compliance (Issue #1017)

This submission has been audited against all four conditions of Issue #1017:

  • Condition 1 (Causal dependence): All attention uses flash_attn_3_func(..., causal=True). Doc-independent LoRA TTT processes each document independently with no cross-document state. No n-gram caches, no hash embeddings, no C++ helper kernels.

  • Condition 2 (Full normalized distribution): Standard softmax via F.cross_entropy over the full vocabulary. BPP computed via sentencepiece byte table with leading-space correction.

  • Condition 3 (Score-before-update): Each TTT chunk is scored under torch.no_grad() and BPP accumulated before any LoRA gradient update. Last chunk is score-only (no training). LoRA resets between documents.

  • Condition 4 (Single left-to-right pass): Documents processed sequentially, each token scored exactly once. No rescoring, no two-pass evaluation.

Additional checks: No val data during training (GPTQ calibration uses train shards). No pre-quant TTT. Compile warmup uses random tokens (not val data). Full val split evaluated (val_doc_fraction=1.0).

Doc batching note: Documents are sorted by length for LoRA TTT batching efficiency. This is score-invariant since LoRA state resets per batch with no cross-document dependency. This is standard practice in all doc-independent TTT implementations (PRs #1530, #1529, #1541).

resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 12, 2026
The previous candidate-shaped W2 run timed out during quantized TTT while its running BPB stayed in the 1.076x band, so the next fix targets scheduling rather than the mechanism itself. This changes TTT batching from full document-order batches to stable local-window length bucketing, which should recover most of the runtime benefit of size-aware batching without restoring the previous full-dataset sort. It also shortens the local loop-visit counter key to start clawing back the remaining byte overage. Constraint: The lane must finish inside the existing 2700s evaluator timeout while staying closer to the post-openai#1536 review boundary than full global length sorting Rejected: Keep pure doc-order batching | It preserved the clearest story but timed out before producing a final TTT result Rejected: Restore full global length sorting | Fastest, but reintroduces the exact validation-order criticism that sank openai#1536 as a clean claim Confidence: medium Scope-risk: narrow Directive: If this still times out, the next iteration should tune local bucket width before touching the mechanism again Tested: python3 -m py_compile train_gpt.py Not-tested: GPU runtime, final BPB, and final artifact size with local-window batching
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant