Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248) by jfprincz · Pull Request #315 · openai/parameter-golf

jfprincz · 2026-03-21T06:10:31Z

Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248)

val_bpb: 1.1248 (sliding window, stride=64) | 15.6 MB | 8xH100 SXM, 600s

Progress from prior submissions

	PR #70	PR #164	PR #198	PR #287	This	Delta vs #287
val_bpb (sliding)	1.1659 (s256)	1.1524 (s256)	1.1318 (s64)	1.1271 (s64)	1.1248 (s64)	-0.0023
Layers	9	9	11	11	11	—
Params	21.8M	22.4M	26.8M	26.8M	26.8M	—
Artifact	14.9 MB	15.4 MB	15.7 MB	15.5 MB	15.6 MB	+0.1 MB

Two new techniques on top of PR #287's 11-layer stack.

Key additions over PR #287

Change	Impact
Partial RoPE (16 of 64 dims)	Apply rotary embeddings to only 25% of head dimensions. Remaining dims use position-free attention, improving generalization. Zero new parameters.
LN Scale	RMSNorm outputs scaled by 1/sqrt(layer_idx+1). Damps deeper layers' contributions, stabilizing training. Zero new parameters.

Everything else from PR #287 carries forward: 11 layers, XSA on last 4 layers, EMA (0.997), OrthoInit + muP, 3x MLP, int6 mixed quant + zstd-22, WD=0.04, SmearGate, BigramHash(2048), FA3, seq 2048, tuned Muon.

Results

Metric	Value
Pre-quant val_bpb	1.1418
Int6 roundtrip val_bpb	1.1485
Int6 sliding val_bpb (s64)	1.1248
Steps completed (600s cap)	7,051
Step time	85ms
Model params	26,829,913
Artifact size	15,612,308 bytes

Reproducibility (3 seeds)

Seed	Steps	Sliding s64	Artifact
2025	7,051	1.1248	15,612,308
42	7,061	1.1250	15,528,666
1337	7,063	1.1253	15,639,340

Mean: 1.1250 | Variance: 0.0005 | Submitted: seed 2025

Run command

NUM_LAYERS=11 BIGRAM_VOCAB_SIZE=2048 XSA_LAST_N=4 \ EMA_ENABLED=1 EMA_DECAY=0.997 SWA_ENABLED=0 \ ROPE_DIMS=16 LN_SCALE=1 LATE_QAT=1 QAT_THRESHOLD=0.1 \ MUON_WD=0.04 ADAM_WD=0.04 \ MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \ MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 \ MUON_MOMENTUM_WARMUP_STEPS=1500 WARMDOWN_ITERS=3000 \ ITERATIONS=9000 MAX_WALLCLOCK_SECONDS=600 EVAL_STRIDE=64 \ torchrun --standalone --nproc_per_node=8 train_gpt.py

Note on Late QAT

The submitted code includes a Late QAT flag (LATE_QAT=1) intended to enable STE int6 fake-quantization in the final 4% of training. Post-submission analysis (credit: @152334H) revealed that torch.compile constant-folds the CastedLinear._qat_enabled class attribute at first trace, so the STE branch is dead-code-eliminated and never activates during training. Late QAT had no effect on the results. The score is driven entirely by Partial RoPE and LN Scale.

himanalot · 2026-03-21T06:23:41Z

yes! great job this is sort of where i went too

- Add FA3 > FA2 > SDPA attention backend dispatch - FA2 wrapper uses @torch.compiler.disable + fullgraph=False - FA3 uses fullgraph=True (compatible with torch.compile) - Default FP16_KEEP_NAME_PATTERNS empty (quantize everything, matches PR openai#315) - Add pod_setup.sh with FA3/FA2 install flow - Add build_fa3_wheel.sh for pre-building FA3 on cheap 1xH100

Rename folder to today's date. Replace train_gpt.py with the new baseline from PR openai#315 (11L XSA4 + EMA + Partial RoPE + Late QAT, 1.1248 BPB). Previous script preserved as previous_train_gpt.py. Update README with PR lineage and new baseline context.

…unner Port per-head gated attention (12ch, 2*sigmoid) into the PR openai#315 train_gpt.py (11L XSA4 + EMA + Partial RoPE + Late QAT, 1.1248 BPB). Update run script to use PR openai#315 config for both baseline and experiment.

… log

records/track_10min_16mb/2026-03-21_11L_XSA4_EMA_PartialRoPE_LateQAT_1.1248/train_gpt.py

- Rebased train_gpt.py on PR openai#315 (1.1248 BPB SOTA) - Added SGD TTT and causal TTT variant - Added gradient-guided adaptive quantization (int5/int6/int7) - Added z-loss regularization - Updated plan with current landscape and run commands

Merged records from all experiment branches into one working branch. Updated CLAUDE.md with current competitive landscape and next priorities. Rewrote idea bank with tiered roadmap for closing the gap to openai#315.

@152334H

torch.compile constant-folds CastedLinear._qat at first trace. Credit: @152334H via PR openai#315.

@152334H

torch.compile constant-folds CastedLinear._qat at first trace. Credit: @152334H via PR openai#315.

@152334H

torch.compile constant-folds CastedLinear._qat at first trace. Credit: @152334H via PR openai#315.

13 techniques tested that did NOT work on PR openai#315 base: - Causal TTT (3 variants): neutral on EMA+XSA base - MTP: +0.028 BPB, throughput penalty kills it - INT4: 0.06 BPB quant gap wipes out param advantage - Canon layers: 48% step overhead not compensated - Memory tokens, gradient-guided quant, cautious WD, L1 regularization, label smoothing, 1M batch, full QAT 4 positive findings: - EMA > SWA by 0.003 BPB (3-seed verified) - Weight decay directly controls artifact size - 786K > 524K batch by 0.004 BPB - FA3 Hopper: 15-20% more steps at same wallclock Best verified result: 1.1257 BPB (PR openai#315 reproduction) Includes 12 training logs for verification.

Safe config matching PR openai#315 proven techniques: - 11 layers, MLP 3x (1536), BigramHash 2048 - Muon backend_steps=5, momentum=0.99 (proven by all top PRs) - XSA on last 4 layers, Partial RoPE 16/64, LN Scale, Late QAT - EMA decay=0.997 every 4 steps via torch._foreach_lerp_ - CUDA_DEVICE_MAX_CONNECTIONS=1 for multi-GPU overlap - SmearGate, OrthoInit, int5 MLP/int6 attention, zstd-22

…le, EMA, Late QAT, TTT Major rewrite targeting top-5 leaderboard: - 11 layers (from 10), BigramHash reduced to 10240 to fit 16MB - XSA (Exclusive Self-Attention) on last 4 layers - Partial RoPE: 16/64 head dims get position encoding - LN Scale: 1/sqrt(layer+1) dampening on deeper layers - EMA (decay=0.997) replaces SWA - Late QAT: STE int6 enabled only in final 4% of training - TTT: 25-epoch SGD on val data post-quantization - FA3 auto-detection with SDPA fallback - Reverted SwiGLU back to relu² (confirmed worse by openai#340, openai#344)

…e-lateqat-1.1248 Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248)

Research contribution: confirmed torch.compile constant-folds Late QAT in openai#315-derived code, tested tensor-scale STE fix, swept 7 untried techniques from recent papers. All negative on 1xH100. Includes anti-layer diagnostic, prune-then-quantize, and spectral SVD compression implementations as env var toggles.

From audit subagent's gap analysis. Both are zero-param model-arch wins in multiple top SOTA records. Patch 19 — USE_PARTIAL_ROPE + PARTIAL_ROPE_DIMS=16: Rotate only the first 16 of 64 head dims via RoPE. Leave the remaining 48 dims unrotated (positional-bias-free). The model learns to use rotated dims for relative-position attention and unrotated dims for content-only attention. From PR openai#1019 (openai#1 SOTA at 1.11473 BPB) and PR openai#315 (1.1248 BPB). Implementation: modify apply_rotary_emb to optionally split x into rotated/plain halves. Idempotent via PARTIAL_ROPE_MARKER. Patch 20 — USE_LN_SCALE: Scale RMSNorm output by 1/sqrt(layer_idx+1) at each block. Damps the contribution of deeper layers. Zero params. From PRs openai#1019 + openai#315. Implementation: add layer_idx to Block.__init__, store as self._layer_idx, apply the 1/sqrt scaling in forward. The hard part is wiring layer_idx through GPT.__init__'s Block creation list. Idempotent via LN_SCALE_MARKER. NOTE: Patch 20 stores _layer_idx but the apply side (multiplying the normed output by the scale factor) is NOT yet wired. That's a follow-up in the next iteration. This patch is the prep work to make USE_LN_SCALE possible without breaking the current Block.forward. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

EL2 cycle-2 = 3.2742 (only +0.0008 above champion 3.2734) reversed the audit fire openai#1 verdict that EngramLite was falsified. Adding 4 new EL multi-seed experiments to confirm: - EL3 (seed 1337), EL4 (seed 999), EL5 (seed 7) - EL6 with L5 weights (0.15/0.20/0.15) — new combination Removed 15 dead/falsified configs that wasted cycle 2 compute: EA*, BG*, NG*, TH*, MEGA, MTP0/2/3, MTP1_seed999, PR2/3, EL0. Also captured EMA(0.997) canonical spec from 6 merged records (openai#287, openai#315, openai#414, openai#1019, openai#1099) — DEFERRED actual Patch 17 ship because EMA only affects final val_bpb (not loop train_loss) and training-loop anchoring is risky without reading train_gpt.py. Queue now cycles in ~100 min (vs 185 min) leaving more compute for the EL family expansion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ntified as top missing technique Patches 15/16/21 + NEW Patch 20 USE_COPRIME_STRIDE all uncontested in 150+ open + 20 closed PRs (7 consecutive audits for the original 3, first confirmation for Patch 20 just shipped 3h ago). CRITICAL FINDING: XSA (Cross-Sequence Attention) is in 4+ MERGED records (PR openai#1019, openai#287, openai#315, openai#265, latest openai#1099) and we have ZERO attention-mask variants. Most-validated missing technique. ~200 LOC moderate port — too big for a single research fire but worth a focused 30-45 min investigation if we can find a minimal variant. SLOT (Score-First TTT) is the openai#2 missing (PR openai#549, ~100 LOC) but it's eval-time, joins the H100 escalation bundle category. H100 escalation candidate updated: NEW: CHAMP_L4 + COPRIME_STRIDE + EL + (EMA + Tilt + INT6 GPTQ) OLD: CHAMP_L4 + EL + (EMA + Tilt + INT6 GPTQ) Need CS2 cycle 2+3 for n=3 mean confirmation before escalating. PR openai#1430 still OPEN, 0 comments, no comp owner activity for 16h+. Spend ~$4.00/$36 (11.1%). Pod healthy at 7h 50min uptime. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

notapplica mentioned this pull request Mar 21, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

bopmite added a commit to bopmite/parameter-golf that referenced this pull request Mar 21, 2026

PR openai#315 base + OLB + deployment ready

54e6e37

saml212 added a commit to saml212/parameter-golf that referenced this pull request Mar 21, 2026

update README and submission to match PR openai#315 format

1707fb6

saml212 mentioned this pull request Mar 21, 2026

Record: 12L Gradient-Guided Quant + Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1320) #332

Open

alertcat mentioned this pull request Mar 21, 2026

Record: 11L XSA+EMA+TTT, sliding val_bpb=1.1254 (3-seed mean 1.1256) #338

Open

sheeki03 mentioned this pull request Mar 21, 2026

Record: 11L Backout + Int6 + SWA (val_bpb: 1.1364) #339

Open

mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 21, 2026

Fix LN Scale to match openai#315, add U-Net skip connections, fix QAT…

3150afd

… log

sjp611 mentioned this pull request Mar 21, 2026

Non-record: PR315 repro on 1xH100 PCIe, int6+zstd (val_bpb=1.8338) #356

Open

3 tasks

152334H reviewed Mar 21, 2026

View reviewed changes

records/track_10min_16mb/2026-03-21_11L_XSA4_EMA_PartialRoPE_LateQAT_1.1248/train_gpt.py Show resolved Hide resolved

Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248)

2951651

jfprincz force-pushed the submission/11l-partialrope-lateqat-1.1248 branch from dfb05a5 to 2951651 Compare March 21, 2026 21:01

jfprincz changed the title ~~Record: 11L Partial RoPE + LN Scale + EMA + Late QAT + XSA4 (val_bpb: 1.1248)~~ Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248) Mar 21, 2026

felipe-parodi added a commit to felipe-parodi/parameter-golf that referenced this pull request Mar 21, 2026

results: Run 4 — 1.1242 BPB, beats PR openai#315 SOTA

babca60

alia-abbas added a commit to alia-abbas/parameter-golf that referenced this pull request Mar 21, 2026

reptile meta-TTT on openai#315 base

1cb087f

mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 21, 2026

Add finding: Late QAT is dead code under torch.compile

a2d8bcc

torch.compile constant-folds CastedLinear._qat at first trace. Credit: @152334H via PR openai#315.

mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 21, 2026

Add finding: Late QAT is dead code under torch.compile

2fbdc32

torch.compile constant-folds CastedLinear._qat at first trace. Credit: @152334H via PR openai#315.

mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 21, 2026

Add finding: Late QAT is dead code under torch.compile

2ebcf79

torch.compile constant-folds CastedLinear._qat at first trace. Credit: @152334H via PR openai#315.

charmquark1984 mentioned this pull request Mar 21, 2026

Non-record: Negative results & insights from 24hrs on 8xH100 #375

Open

trasnake87 mentioned this pull request Mar 22, 2026

Record: 11L Int5-All + XSA5 + EMA + 10% Pruning (val_bpb=1.1466) #389

Open

Christopher-Lee-McClendon mentioned this pull request Mar 24, 2026

Non-record: 11L GEPA + 20k Steps + Pure Int6 + Legal TTT (val_bpb=1.0983): unlimited compute: 4×A100-40GB, ~2.8 hours #628

Open

Asukabot0 mentioned this pull request Mar 24, 2026

Record: 11L XSA-all + LeakyReLU(0.5)² + VR + GA (val_bpb=1.1164, pending 3-seed) #638

Closed

This was referenced Mar 24, 2026

Non-record: 11L GEPA + 25k Steps + Pure Int6 + Legal TTT (val_bpb=1.0944) - unlimited compute category #644

Open

Non-record: 11L GEPA + 30k Steps + Pure Int6 + Legal TTT (val_bpb=1.0920) #668

Open

Asukabot0 mentioned this pull request Mar 25, 2026

Record: First Legal Sub-1.0 BPB — Multi-order N-gram Backoff + Entropy-Adaptive Alpha (val_bpb=0.9674, 3-seed) #727

Closed

abaybektursun mentioned this pull request Mar 25, 2026

Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #728

Closed

Asukabot0 mentioned this pull request Mar 25, 2026

Record: Score-First TTT + N-gram Backoff (3-seed mean val_bpb=0.9581) #761

Open

minh-stakc mentioned this pull request Mar 25, 2026

Record: 11L + Multi-Order N-gram Backoff + Entropy-Adaptive Alpha (val_bpb=0.6672) #770

Open

4 tasks

This was referenced Mar 26, 2026

Non-Record: 11L Parallel Muon + LN Scale + LeakyReLU² MLP3x + Legal TTT (val_bpb 1.1215) #754

Closed

Non-Record: 11L Parallel Muon + LN Scale + LeakyReLU² MLP3x + Legal TTT — val_bpb 1.1215 (3-seed mean) #838

Open

nvemuri4649 pushed a commit to thanushpatlolla/parameter-golf that referenced this pull request Mar 27, 2026

Merge pull request openai#315 from jfprincz/submission/11l-partialrop…

33d159b

…e-lateqat-1.1248 Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248)

abaybektursun mentioned this pull request Mar 28, 2026

Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019

Merged

wfproc mentioned this pull request Mar 28, 2026

[Non-Record] QAT Dead-Code Analysis + 7 Novel Technique Sweep (1xH100) #1032

Open

TimPietruskyRunPod mentioned this pull request Mar 28, 2026

Record: Muon TTT + Entropy-Adaptive Epochs — val_bpb 1.1179 (3-seed mean) #1037

Closed

5 tasks

Gusanidas mentioned this pull request Mar 30, 2026

Record: 1.1140 BPB — ResidLambdas + Split-LR + Train-Budget GPTQ + Coprime Loader (12-seed mean) #1130

Open

This was referenced Mar 31, 2026

Non-record: 10L E2E TTT-Linear + FlowRefiner — val_bpb 1.1335 ± 0.0010 (4-seed mean) #1166

Open

Non-record: 11L NativeFlowMatcher + Legal TTT — val_bpb 1.1199 (3-seed mean no-TTT: 1.1225) #1170

Open

Gusanidas mentioned this pull request Apr 1, 2026

Record: Window Attention + Mixed Seq_Len Training, bpb 1.1108, eval at 6144 (5-seed mean) #1212

Open

Christopher-Lee-McClendon mentioned this pull request Apr 1, 2026

feat: Non-record 11L PR940 Stack (no n-gram in use) + 20k Steps + Legal TTT (1.0929 BPB) #1232

Open

aryanbhosale mentioned this pull request Apr 3, 2026

Record: Depth Recurrence + MuonEq-R + AR Self-Gen GPTQ — val_bpb 1.1104 (3-seed mean) #1290

Open

abaybektursun mentioned this pull request Apr 3, 2026

Record: Fused MLP (Triton+CUTLASS EVT) + Fast Causal N-Gram Tilt & Subword Certainty (3-seed mean) #1105

Closed

AR6420 mentioned this pull request Apr 3, 2026

Full-Depth MLP Megakernel + Fused Attention Preprocessing (non-record) #1316

Open

monisha-max mentioned this pull request Apr 4, 2026

Record Submission: Poly5 Softcap + Z-Loss + YaRN + Zstd-22 + Stride-16 (on PR #549 stack) #1325

Open

4 tasks

Its-Just-Crump mentioned this pull request Apr 5, 2026

Add record: SP4096 + Depth Recurrence + Parallel Residuals + QK-Gain + Brotli (1.1020 BPB) #1392

Open

abaybektursun mentioned this pull request Apr 6, 2026

Record: Triple Loop + Fused Kernels + Parallel Residuals + N-gram Tilt; val_bpb 1.08309 (5-seed mean) #1420

Open

PhamPhuHoa-23 mentioned this pull request Apr 8, 2026

Non-record: XSA-11 + Parallel Residual (L7+) + Depth Recurrence — val_bpb 1.1056 (1-seed, 1×H100) #1467

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248)#315

Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248)#315
cocohearts merged 1 commit intoopenai:mainfrom
jfprincz:submission/11l-partialrope-lateqat-1.1248

jfprincz commented Mar 21, 2026 •

edited

Loading

himanalot commented Mar 21, 2026

Uh oh!

Labels

4 participants

Conversation

jfprincz commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!