Skip to content

Record: SP8192 + VarLen Attention + LoRA TTT + Fused MLP — val_bpb 1.0777 (3-seed mean)#1540

Open
aryanbhosale wants to merge 1 commit intoopenai:mainfrom
aryanbhosale:submission/sp8192-varlen-lora-ttt-fused
Open

Record: SP8192 + VarLen Attention + LoRA TTT + Fused MLP — val_bpb 1.0777 (3-seed mean)#1540
aryanbhosale wants to merge 1 commit intoopenai:mainfrom
aryanbhosale:submission/sp8192-varlen-lora-ttt-fused

Conversation

@aryanbhosale
Copy link
Copy Markdown
Contributor

@aryanbhosale aryanbhosale commented Apr 11, 2026

Record: SP8192 + VarLen Attention + Doc-Independent LoRA TTT + Fused MLP

val_bpb = 1.0777 (3-seed mean, std 0.0003) | ~15.99 MB | 8×H100 SXM

3-Seed Results

Seed LoRA TTT BPB val_loss (nats)
42 1.0775 2.7834
314 1.0776 2.7834
999 1.0780 2.7845
Mean 1.0777 2.7838

Merged SOTA (PR #1493): 1.0810. Delta: −0.0082 nats. Clears 0.005 by 0.0032.

Novel: Fused Triton MLP via importlib Wrapper

Integrates PR #1523's Triton TMA fused MLP into PR #1536's VarLen + LoRA TTT stack using an importlib-based code loader that solves Triton's inspect.getsourcelines() requirement for compressed wrappers. Writes decompressed code to temp file, loads via importlib.util.spec_from_file_location as __main__.

Full Stack

  1. VarLen Attention — within-document only (PR Non-record: SP8192 + VarLen Attention + Doc-Independent LoRA TTT + Banking + Muon 0.97 — val_bpb 1.07747 (3-seed mean) #1536 @dexhunter)
  2. Doc-Independent LoRA TTT — rank-96, score-first, no inter-doc dependence
  3. Fused Triton TMA MLP — +5% throughput (PR Record: SP8192 + Triple Recurrence + Banking + Fused MLP + Muon 0.97 — val_bpb 1.0778 (3-seed mean) #1523 @abaybektursun)
  4. Triple Recurrence (L3-5) + Parallel Residuals (L7+)
  5. Muon 0.97 + QK-Gain 5.25 + SDClip + Brotli

Compliance (Track B)

Score-first LoRA TTT. No SLOT, no hash embed, no pre-quant TTT, no n-gram. All conditions from Issue #1017 satisfied.

Doc sorting note: _build_ttt_global_batches sorts documents by length for batching efficiency only. The LoRA adapter is reset to zero before every batch (reusable_lora.reset()), so there is no inter-batch state. Each document's BPB is computed independently of batch composition and processing order. The final val_bpb is a global sum of per-token losses divided by total bytes, which is order invariant. Same approach as PR #1530.

Credits

PR #1536 @dexhunter, PR #1523 @abaybektursun, PR #1530 @samacqua, PR #1394 @clarkkev, PR #1493 @bigbag

@dexhunter
Copy link
Copy Markdown
Contributor

dexhunter commented Apr 11, 2026

I may be missing something, but there looks to be one remaining Issue #1017 review-risk here.

In the VarLen + doc-TTT path, _build_ttt_global_batches sorts validation documents by length before batching via:

global_doc_entries = sorted(doc_entries, key=lambda x: x[1][1])

My reading of Issue #1017 is that validation order should be preserved, and this seems like a reordering of the validation set before scoring.

Even if the LoRA state is reset per batch and the score may be invariant in practice, I think this is still worth clarifying because under a strict reading it is no longer the default validation order.

If I am missing something about why this remains guideline-clean, a short note in the PR would help reviewers a lot.

@aryanbhosale
Copy link
Copy Markdown
Contributor Author

I may be missing something, but there looks to be one remaining Issue #1017 review-risk here.

In the VarLen + doc-TTT path, _build_ttt_global_batches sorts validation documents by length before batching via:

global_doc_entries = sorted(doc_entries, key=lambda x: x[1][1])

My reading of Issue #1017 is that validation order should be preserved, and this seems like a reordering of the validation set before scoring.

Even if the LoRA state is reset per batch and the score may be invariant in practice, I think this is still worth clarifying because under a strict reading it is no longer the default validation order.

If I am missing something about why this remains guideline-clean, a short note in the PR would help reviewers a lot.

Good question! The sort is just for batching efficiency so documents of similar length get grouped together and there's less padding waste. It doesn't affect any document's score though.

The LoRA adapter gets reset to zero before every batch (reusable_lora.reset()), so there's no state carried between batches. Each document's BPB is computed completely independently of what other docs happen to be in the same batch or what order the batches run in. The final val_bpb is just a global sum of per token losses divided by total bytes, which is order invariant.

PR #1530 (samacqua) uses the same length sorting approach for doc independent LoRA TTT. I can add a clarifying comment in the code if that would help reviewers.

sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 11, 2026
…RA TTT doc-independent legal; BPB bug alert - PR openai#1541 (bigbag, 1.07785): Improved Parallel Residuals cross-lane + Muon 0.97 — open, hash embed flag pending - PR openai#1540 (aryanbhosale, 1.0777): VarLen Attention + Doc-Independent LoRA TTT rank-96 (score-first, resets per batch) — appears legal - PR openai#1539 confirmed illegal (Pre-Quant AdamW TTT, same ruling as openai#771) - PR openai#1545 BPB double-counting bug: real score ~1.028 claim is ~1.18 actual - PR openai#758 effectively dead: TTT contradiction + unnormalized n-gram both flagged - Session 10 lessons: MATRIX_LR=0.03 pairs with Muon 0.97; doc-independent LoRA TTT is adoptable - No merged SOTA change (still 1.0810); target remains ≤1.0760 https://claude.ai/code/session_01LgqwEDyFnyHsBbyJiSFUjK
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — SP8192 + VarLen Attention + LoRA TTT + Fused MLP

Thanks @aryanbhosale — this is a substantial novel piece of work. I want to flag one import-path issue I hit during the smoke test, then a deeper compliance read on the LoRA TTT path.

What I found in the code (head SHA bf9e5b4f881336e500ebdbb6114797b63c81e54c, records/track_10min_16mb/2026-04-11_SP8192_VarLen_LoRATTT_FusedMLP/train_gpt.py, decoded from the import lzma as L,base64 as B,importlib.util as IU,os,sys,fcntl self-extracting shim — 115,799 bytes, 2,809 lines of actual source):

  • Hyperparameters: ttt_lora_rank=96, ttt_k_lora=True, ttt_mlp_lora=True, ttt_o_lora=True (L87-89, 93 on the decoded payload)
  • BatchedLinearLoRA at L1108 — wraps a base Linear with a LoRA (A@B low-rank delta) added at forward time
  • BatchedTTTLoRA at L1126 — batches multiple independent LoRA slots so each element of the eval batch can carry its own adapter state
  • forward_ttt at L946 — forward pass that routes through _block_with_lora (L1032), _block_with_lora_attn (L1069), _block_with_lora_mlp (L1100) to inject the per-batch LoRA delta into each block
  • eval_val_ttt_lora at L2121 — the TTT-adapted eval path
  • eval_val_sliding at L1948 — uses BOS-ID-delimited cu_seqlens for document-packed varlen attention (via flash_attn_varlen_func)

Import issue. The top of the decoded payload does from flash_attn_interface import (flash_attn_varlen_func, ...). My local flash_attn stub covers the plain flash_attn_func path but not flash_attn_varlen_func, so the CPU smoke test on CT2038 (proteus-engine, 2026-04-11) failed with:

IMPORT_FAIL error=ImportError("cannot import name 'flash_attn_varlen_func' from 'flash_attn_interface' (unknown location)") 

This is a CPU-stub limitation, not a PR defect — FA3 is available on H100s where this script is intended to run. But it does mean no CPU import/HP/forward verification from my side; the structural review below is static-code only.

Doc-independent TTT & LoRA compliance — the actual question.

Per-batch independent LoRA slots are an interesting answer to the Issue #1336 / #1017 causal-dependence question: if the LoRA adapter that scores token t is only ever optimized against tokens from other documents (or from an earlier chunk that has already been scored under a different adapter), the delta at the moment t is scored is a function of x_1..x_{t-1} only, and the pattern is legal.

Three things I wasn't able to fully verify from a single static read:

  1. Does eval_val_ttt_lora score chunk ci under an adapter trained on chunks 0..ci-1 only, or does the LoRA that's active when chunk ci is scored already have gradient updates from chunk ci itself? The per-batch slot plumbing makes this subtler than the single-global-delta case.
  2. Cross-document isolation: eval_val_sliding packs multiple docs into a single cu_seqlens batch, and the varlen attention kernel enforces that attention doesn't cross doc boundaries. Does the LoRA adapter respect the same boundary — i.e. is the LoRA gradient for doc A's tokens prevented from flowing into doc B's scoring? If the answer is "yes via per-slot separation" the pattern is clean; if the answer is "no, one adapter per batch mixes all docs" the delta at doc B's scored position depends on doc A's targets (and vice versa), which may or may not be legal depending on ordering.
  3. _build_ttt_global_batches at L2051 partitions documents across ranks; the batch ordering there determines whether the doc-independence claim holds globally.

Questions:

Verdict: LOOKS INTERESTING, NEEDS DEEPER AUDIT — I'm not ready to call this clean or flag it without the answers above, and the doc-independent framing is a plausible path to a legal TTT variant that the community SLOT cluster could borrow from.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: HOLD pending author clarification on the per-slot / per-doc scoring-before-adaptation ordering in eval_val_ttt_lora. This is materially different from the HOLD SLOT cluster because the LoRA slot separation is a real attempt at causal isolation rather than scored-region optimization. If the ordering checks out, this is merge-eligible and interesting research.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): IMPORT_FAIL due to flash_attn_varlen_func missing from my flash_attn stub — known stub limitation, not a PR defect. Static code review performed against the decoded payload (115,799 bytes, 2809 lines) of the lzma self-extracting train_gpt.py shim at SHA bf9e5b4f881336e500ebdbb6114797b63c81e54c. AI tooling: review drafted with Claude Code (Opus) using an internal review template; batch-9 subagent quota was exhausted mid-review so this specific review was authored in the main session with reduced static audit depth — flagged questions above are the places where I'd want a confirmation read rather than statements I'm making from full verification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants