Non-record: Universal Transformer + Legal Pre-Quant TTT (Training-Slice Variant) by dentity007 · Pull Request #1554 · openai/parameter-golf

dentity007 · 2026-04-11T21:56:10Z

Legal resubmission of PR #1193 per @MatoTeziTanka review

Status: Non-record research submission

Hardware: NVIDIA DGX Spark GB10 (single GPU, aarch64)

val_bpb: TBD (Spark run in progress, will update this PR when complete)

Background

PR #1193 (original Universal Transformer submission) was flagged on 2026-04-11 by @MatoTeziTanka for using an illegal TTT pattern. His review was thorough: the ttt_adapt() function trained multi-epoch on val_tokens without score-first discipline, matching the same pattern that closed PR #1376 and the rest of the Pre-Quant TTT cluster.

His recommendation was clear: resubmit with the TTT function taking a training-data slice instead of val_tokens, per the PR #1416 / PR #1423 reference implementations. This PR does exactly that.

PR #1193 remains open with the honest no-TTT numbers (val_bpb 3.2483 from my clean Spark ablation) and a full acknowledgment of the flag. This PR is the proper legal version with TTT enabled.

What Changed

TTT function signature. ttt_adapt() now takes train_slice_tokens instead of val_tokens. The parameter name documents the intent explicitly.
Training slice source. Before the TTT call, the submission loads a fixed window from the tail of the last fineweb_train_*.bin shard. This slice was not used during main training (we only train on the prefix of each shard up to iterations steps) and is never part of fineweb_val_*.bin. No val_tokens touch the TTT gradient path at any point.
Evaluation unchanged. val_tokens are scored exactly once, in a single left-to-right pass, after all training (including TTT) has finished. TTT updates shift model weights but do not influence how val tokens are scored.

Universal Transformer Architecture (unchanged from #1193)

Single shared transformer block looped N times
Per-iteration learnable parameters: attn_scale, mlp_scale, resid_mix, iteration_embed
50 percent sparse-to-dense curriculum during training
Implements OpenAI's requested "Universal transformer" research direction from the README

Legality Argument

Issue #402 and Issue #677 rulings define illegal TTT as any training pass that updates model state based on val_tokens the model has not already been tested on. This submission satisfies the rules because:

The TTT gradient comes entirely from training-set tokens (a tail slice of the last fineweb_train shard)
Those training tokens are never scored as part of val_bpb
val_tokens are scored exactly once, after all training (including TTT) is complete
No eval-time leakage of val targets into training loss

The argument is structurally identical to the reference PRs #1416 and #1423 cited by @MatoTeziTanka in his review of #1193.

Reproduction

pip install sentencepiece brotli python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 80 VOCAB_SIZE=1024 NUM_ITERS=6 TORCH_COMPILE_DISABLE=1 ITERATIONS=200 \ TTT_ENABLED=1 TTT_EPOCHS=3 TTT_TRAIN_SLICE_SEQS=128 TTT_LR=0.0005 \ python3 records/track_non_record_16mb/2026-04-11_UniversalTransformer_LegalTTT/train_gpt.py

Hardware Notes

DGX Spark GB10 is approximately 6x slower per step than 8xH100. No torch.compile (Triton/inductor unsupported on aarch64), no flash_attn_interface (using scaled_dot_product_attention fallback). Absolute BPB will be higher than a competition 8xH100 run due to the short 200-step training budget, but the legality story holds across hardware.

PR Non-record: Universal Transformer + Adaptive Density (val_bpb 1.4390) #1193 Original (flagged, being kept open with honest no-TTT numbers)
PR Record: SP8192 + Pre-Quant TTT — val_bpb 1.07948 (3-seed mean) #1416, Record: SP8192 + Pre-Quant TTT + QK-Gain 5.0 + Depth Recurrence + MuonEq-R — val_bpb 1.0791 (3-seed mean) #1423 (@aryanbhosale) Reference legal Pre-Quant TTT implementations
PR Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA #1204 (@msisovic) Mini depth recurrence (partial weight sharing)
PR Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0 — val_bpb 1.0897 (3-seed mean) #1334 (@aryanbhosale) Track A legal submission using parallel residuals + depth recurrence
Full ablation gist https://gist.github.com/dentity007/324ac35505c27acd18e7ffb468f4fa08 (all 22 runs across 7 OpenAI-requested architectures)

Review Credit

@MatoTeziTanka flagged the original #1193 TTT-on-val issue via The Agora community compliance tracker (https://matotezitanka.github.io/parameter-golf/). This resubmission implements the exact fix he recommended. His unpaid community review work is exactly the kind of standards-enforcement this repo needs given the maintainer bandwidth constraints.

@MatoTeziTanka

…ce Variant) Legal-compliant resubmission of PR openai#1193 per @MatoTeziTanka review on 2026-04-11. Changes vs original PR openai#1193: - ttt_adapt() function signature takes train_slice_tokens instead of val_tokens - Call site loads TTT data from the tail of the last fineweb_train_*.bin shard (training data slice, never scored during eval) - All val_tokens references in TTT removed; val is only scored in the final single-pass evaluation after training + TTT finish - README documents the legality argument and references PR openai#1416 / openai#1423 Universal Transformer architecture itself is unchanged: - Single shared block looped N times with per-iteration scale/shift/resid_mix - 50% sparse-to-dense curriculum - Implements OpenAI's requested 'Universal transformer' research direction Final val_bpb pending DGX Spark run completion. Thanks to @MatoTeziTanka for the careful review via The Agora (https://matotezitanka.github.io/parameter-golf/).

Spark run completed on 2026-04-11. Key numbers: - Model params: 4,546,568 - Pre-quant val_bpb (step 200): 3.2483 - Post-TTT int6 roundtrip val_bpb: 3.4446 - TTT source: fineweb_train_000079.bin (last training shard tail) - TTT tokens: 131,073 (training data, NOT val_tokens) - TTT config: 3 epochs AdamW lr=0.0005 - TTT loss curve: 6.15 -> 5.89 -> 5.79 - Artifact size: 1.35 MB (int6+brotli-11) The legal TTT pattern works: TTT gradient came entirely from training data, val tokens were scored exactly once after all training finished. The explicit log line 'ttt:legal_slice source=fineweb_train_000079.bin tokens=131073 (train data slice, not val_tokens)' confirms the fix.

dentity007 · 2026-04-12T01:47:38Z

Spark run completed. Final numbers pushed in commit 8896c0a.

Results

Metric	Value
Model params	4,546,568
Pre-quant val_bpb (step 200)	3.2483
Post-TTT int6 roundtrip val_bpb	3.4446
TTT source	fineweb_train_000079.bin (tail of last training shard)
TTT tokens	131,073 (training data, NOT val_tokens)
TTT config	3 epochs AdamW lr=0.0005
TTT loss curve	6.15 -> 5.89 -> 5.79
TTT duration	13.6 seconds
Artifact size	1.35 MB (int6+brotli-11)

Legality Verification

The training log explicitly confirms the fix:

ttt:legal_slice source=fineweb_train_000079.bin tokens=131073 (train data slice, not val_tokens) ttt:start lr=0.0005 momentum=0.9 epochs=3 freeze_blocks=0 ttt_epoch:1/3 loss:6.1531 time:4.1s ttt_epoch:2/3 loss:5.8870 time:8.5s ttt_epoch:3/3 loss:5.7944 time:13.0s ttt:done elapsed=13.6s

The source log line proves the TTT gradient came entirely from training data. val_tokens were scored exactly once, after all training (including TTT) finished.

Notes on BPB Numbers

The 3.4446 number is high relative to competition runs because:

DGX Spark GB10 is a single GPU, not 8xH100 (roughly 6x slower per step)
Only 200 training steps instead of 5000+
No torch.compile (Triton unsupported on aarch64)
Universal Transformer with 6 shared-block iterations, 4.5M params

This is a research/non-record submission to document the architecture direction, not a competitive record attempt. The legality story holds across hardware, and the relative ordering between UT-1 (6 iters) and UT-2 (24 iters) in my broader ablation confirmed more iterations does not help.

Full ablation data across all 7 OpenAI-requested research architectures: https://gist.github.com/dentity007/324ac35505c27acd18e7ffb468f4fa08

MatoTeziTanka · 2026-04-12T05:13:55Z

Community Review — Non-record: Universal Transformer + Legal Pre-Quant TTT (Training-Slice Variant)

Compliance: LOOKS CLEAN — TTT trains on a held-out training-data slice, never touches val_tokens

What I found in the code (head SHA from PR #1554, file records/track_non_record_16mb/2026-04-11_UniversalTransformer_LegalTTT/train_gpt.py):

The ttt_adapt() function (line 1198) takes train_slice_tokens — a tensor loaded from the tail of the last fineweb_train_*.bin shard (lines 1778–1787). This slice:

Was not used during main training (training only consumes the prefix up to iterations steps)
Is never part of fineweb_val_*.bin
Is never scored as part of val_bpb

The TTT loop (lines 1244–1274) runs multi-epoch AdamW on this training slice. Validation tokens are scored exactly once, after all training (including TTT) is complete, at line 1862 via eval_val_sliding() under torch.inference_mode().

This is cleanly legal. The TTT gradient path never touches val_tokens. The val_bpb number comes from a single forward-only scoring pass on data the model was never adapted on. The Issue #402 / #677 rules are satisfied because val_tokens are never the subject of loss.backward() at any point.

One citation correction: the PR description and docstring reference PRs #1416 and #1423 as "legal Pre-Quant TTT" implementations. I need to correct my own earlier review on #1193 — I cited those same PRs as legal references, and I was wrong. At their current heads, #1416 and #1423 contain the ILLEGAL flat-epoch ttt_adapt_adamw on val_tokens (despite folder names saying "LegalTTT"). The confirmed legal TTT reference is PR #1413 (dexhunter), which uses the score-first-per-chunk pattern. I posted a correction on Issue #677 on 2026-04-11.

That said — your implementation doesn't need the score-first-per-chunk pattern at all, because you're not training on val_tokens in the first place. Training on a held-out training slice and then scoring val_tokens is just... regular training with a second fine-tuning phase. It's structurally the cleanest approach.

tl;dr: My #1193 review gave you the right implementation advice (move TTT off val_tokens) but cited the wrong precedent PRs. You got the code right anyway. Apologies for the misleading citation — I've since corrected it.

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual non-record checks. No compliance flags. This is a clean legal resubmission of the Universal Transformer architecture from #1193.

Reviewed by @MatoTeziTanka — The Agora. Manual code review (not auto-classified). Special attention given because this is a resubmission responding to our prior compliance flag on #1193 — wanted to make sure we didn't lead the author down a bad path. We didn't — the implementation is correct, just the citation needs updating.

dentity007 added 2 commits April 11, 2026 16:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Universal Transformer + Legal Pre-Quant TTT (Training-Slice Variant)#1554

Non-record: Universal Transformer + Legal Pre-Quant TTT (Training-Slice Variant)#1554
dentity007 wants to merge 2 commits intoopenai:mainfrom
NathanMaine:research/universal-transformer-legal-ttt

dentity007 commented Apr 11, 2026

dentity007 commented Apr 12, 2026

MatoTeziTanka commented Apr 12, 2026

Labels

2 participants

Conversation