Skip to content

Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248)#315

Merged
cocohearts merged 1 commit intoopenai:mainfrom
jfprincz:submission/11l-partialrope-lateqat-1.1248
Mar 23, 2026
Merged

Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248)#315
cocohearts merged 1 commit intoopenai:mainfrom
jfprincz:submission/11l-partialrope-lateqat-1.1248

Conversation

@jfprincz
Copy link
Copy Markdown
Contributor

@jfprincz jfprincz commented Mar 21, 2026

Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248)

val_bpb: 1.1248 (sliding window, stride=64) | 15.6 MB | 8xH100 SXM, 600s

Progress from prior submissions

PR #70 PR #164 PR #198 PR #287 This Delta vs #287
val_bpb (sliding) 1.1659 (s256) 1.1524 (s256) 1.1318 (s64) 1.1271 (s64) 1.1248 (s64) -0.0023
Layers 9 9 11 11 11
Params 21.8M 22.4M 26.8M 26.8M 26.8M
Artifact 14.9 MB 15.4 MB 15.7 MB 15.5 MB 15.6 MB +0.1 MB

Two new techniques on top of PR #287's 11-layer stack.

Key additions over PR #287

Change Impact
Partial RoPE (16 of 64 dims) Apply rotary embeddings to only 25% of head dimensions. Remaining dims use position-free attention, improving generalization. Zero new parameters.
LN Scale RMSNorm outputs scaled by 1/sqrt(layer_idx+1). Damps deeper layers' contributions, stabilizing training. Zero new parameters.

Everything else from PR #287 carries forward: 11 layers, XSA on last 4 layers, EMA (0.997), OrthoInit + muP, 3x MLP, int6 mixed quant + zstd-22, WD=0.04, SmearGate, BigramHash(2048), FA3, seq 2048, tuned Muon.

Results

Metric Value
Pre-quant val_bpb 1.1418
Int6 roundtrip val_bpb 1.1485
Int6 sliding val_bpb (s64) 1.1248
Steps completed (600s cap) 7,051
Step time 85ms
Model params 26,829,913
Artifact size 15,612,308 bytes

Reproducibility (3 seeds)

Seed Steps Sliding s64 Artifact
2025 7,051 1.1248 15,612,308
42 7,061 1.1250 15,528,666
1337 7,063 1.1253 15,639,340

Mean: 1.1250 | Variance: 0.0005 | Submitted: seed 2025

Run command

NUM_LAYERS=11 BIGRAM_VOCAB_SIZE=2048 XSA_LAST_N=4 \ EMA_ENABLED=1 EMA_DECAY=0.997 SWA_ENABLED=0 \ ROPE_DIMS=16 LN_SCALE=1 LATE_QAT=1 QAT_THRESHOLD=0.1 \ MUON_WD=0.04 ADAM_WD=0.04 \ MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \ MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 \ MUON_MOMENTUM_WARMUP_STEPS=1500 WARMDOWN_ITERS=3000 \ ITERATIONS=9000 MAX_WALLCLOCK_SECONDS=600 EVAL_STRIDE=64 \ torchrun --standalone --nproc_per_node=8 train_gpt.py

Note on Late QAT

The submitted code includes a Late QAT flag (LATE_QAT=1) intended to enable STE int6 fake-quantization in the final 4% of training. Post-submission analysis (credit: @152334H) revealed that torch.compile constant-folds the CastedLinear._qat_enabled class attribute at first trace, so the STE branch is dead-code-eliminated and never activates during training. Late QAT had no effect on the results. The score is driven entirely by Partial RoPE and LN Scale.

@himanalot
Copy link
Copy Markdown

yes! great job this is sort of where i went too

bopmite added a commit to bopmite/parameter-golf that referenced this pull request Mar 21, 2026
saml212 added a commit to saml212/parameter-golf that referenced this pull request Mar 21, 2026
robinojw pushed a commit to robinojw/parameter-golf that referenced this pull request Mar 21, 2026
- Add FA3 > FA2 > SDPA attention backend dispatch - FA2 wrapper uses @torch.compiler.disable + fullgraph=False - FA3 uses fullgraph=True (compatible with torch.compile) - Default FP16_KEEP_NAME_PATTERNS empty (quantize everything, matches PR openai#315) - Add pod_setup.sh with FA3/FA2 install flow - Add build_fa3_wheel.sh for pre-building FA3 on cheap 1xH100
filipviz added a commit to filipviz/parameter-golf that referenced this pull request Mar 21, 2026
Rename folder to today's date. Replace train_gpt.py with the new baseline from PR openai#315 (11L XSA4 + EMA + Partial RoPE + Late QAT, 1.1248 BPB). Previous script preserved as previous_train_gpt.py. Update README with PR lineage and new baseline context.
filipviz added a commit to filipviz/parameter-golf that referenced this pull request Mar 21, 2026
…unner Port per-head gated attention (12ch, 2*sigmoid) into the PR openai#315 train_gpt.py (11L XSA4 + EMA + Partial RoPE + Late QAT, 1.1248 BPB). Update run script to use PR openai#315 config for both baseline and experiment.
mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 21, 2026
felipe-parodi added a commit to felipe-parodi/parameter-golf that referenced this pull request Mar 21, 2026
- Rebased train_gpt.py on PR openai#315 (1.1248 BPB SOTA) - Added SGD TTT and causal TTT variant - Added gradient-guided adaptive quantization (int5/int6/int7) - Added z-loss regularization - Updated plan with current landscape and run commands
@jfprincz jfprincz force-pushed the submission/11l-partialrope-lateqat-1.1248 branch from dfb05a5 to 2951651 Compare March 21, 2026 21:01
@jfprincz jfprincz changed the title Record: 11L Partial RoPE + LN Scale + EMA + Late QAT + XSA4 (val_bpb: 1.1248) Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248) Mar 21, 2026
saml212 added a commit to saml212/parameter-golf that referenced this pull request Mar 21, 2026
Merged records from all experiment branches into one working branch. Updated CLAUDE.md with current competitive landscape and next priorities. Rewrote idea bank with tiered roadmap for closing the gap to openai#315.
felipe-parodi added a commit to felipe-parodi/parameter-golf that referenced this pull request Mar 21, 2026
alia-abbas added a commit to alia-abbas/parameter-golf that referenced this pull request Mar 21, 2026
mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 21, 2026
torch.compile constant-folds CastedLinear._qat at first trace. Credit: @152334H via PR openai#315.
mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 21, 2026
torch.compile constant-folds CastedLinear._qat at first trace. Credit: @152334H via PR openai#315.
mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 21, 2026
torch.compile constant-folds CastedLinear._qat at first trace. Credit: @152334H via PR openai#315.
charmquark1984 added a commit to charmquark1984/parameter-golf that referenced this pull request Mar 21, 2026
13 techniques tested that did NOT work on PR openai#315 base: - Causal TTT (3 variants): neutral on EMA+XSA base - MTP: +0.028 BPB, throughput penalty kills it - INT4: 0.06 BPB quant gap wipes out param advantage - Canon layers: 48% step overhead not compensated - Memory tokens, gradient-guided quant, cautious WD, L1 regularization, label smoothing, 1M batch, full QAT 4 positive findings: - EMA > SWA by 0.003 BPB (3-seed verified) - Weight decay directly controls artifact size - 786K > 524K batch by 0.004 BPB - FA3 Hopper: 15-20% more steps at same wallclock Best verified result: 1.1257 BPB (PR openai#315 reproduction) Includes 12 training logs for verification.
charmquark1984 added a commit to charmquark1984/parameter-golf that referenced this pull request Mar 21, 2026
13 techniques tested that did NOT work on PR openai#315 base: - Causal TTT (3 variants): neutral on EMA+XSA base - MTP: +0.028 BPB, throughput penalty kills it - INT4: 0.06 BPB quant gap wipes out param advantage - Canon layers: 48% step overhead not compensated - Memory tokens, gradient-guided quant, cautious WD, L1 regularization, label smoothing, 1M batch, full QAT 4 positive findings: - EMA > SWA by 0.003 BPB (3-seed verified) - Weight decay directly controls artifact size - 786K > 524K batch by 0.004 BPB - FA3 Hopper: 15-20% more steps at same wallclock Best verified result: 1.1257 BPB (PR openai#315 reproduction) Includes 12 training logs for verification.
charmquark1984 added a commit to charmquark1984/parameter-golf that referenced this pull request Mar 21, 2026
13 techniques tested that did NOT work on PR openai#315 base: - Causal TTT (3 variants): neutral on EMA+XSA base - MTP: +0.028 BPB, throughput penalty kills it - INT4: 0.06 BPB quant gap wipes out param advantage - Canon layers: 48% step overhead not compensated - Memory tokens, gradient-guided quant, cautious WD, L1 regularization, label smoothing, 1M batch, full QAT 4 positive findings: - EMA > SWA by 0.003 BPB (3-seed verified) - Weight decay directly controls artifact size - 786K > 524K batch by 0.004 BPB - FA3 Hopper: 15-20% more steps at same wallclock Best verified result: 1.1257 BPB (PR openai#315 reproduction) Includes 12 training logs for verification.
turazashvili added a commit to turazashvili/parameter-golf that referenced this pull request Mar 22, 2026
Safe config matching PR openai#315 proven techniques: - 11 layers, MLP 3x (1536), BigramHash 2048 - Muon backend_steps=5, momentum=0.99 (proven by all top PRs) - XSA on last 4 layers, Partial RoPE 16/64, LN Scale, Late QAT - EMA decay=0.997 every 4 steps via torch._foreach_lerp_ - CUDA_DEVICE_MAX_CONNECTIONS=1 for multi-GPU overlap - SmearGate, OrthoInit, int5 MLP/int6 attention, zstd-22
EthanYangTW added a commit to EthanYangTW/parameter-golf that referenced this pull request Mar 22, 2026
…le, EMA, Late QAT, TTT Major rewrite targeting top-5 leaderboard: - 11 layers (from 10), BigramHash reduced to 10240 to fit 16MB - XSA (Exclusive Self-Attention) on last 4 layers - Partial RoPE: 16/64 head dims get position encoding - LN Scale: 1/sqrt(layer+1) dampening on deeper layers - EMA (decay=0.997) replaces SWA - Late QAT: STE int6 enabled only in final 4% of training - TTT: 25-epoch SGD on val data post-quantization - FA3 auto-detection with SDPA fallback - Reverted SwiGLU back to relu² (confirmed worse by openai#340, openai#344)
nvemuri4649 pushed a commit to thanushpatlolla/parameter-golf that referenced this pull request Mar 27, 2026
…e-lateqat-1.1248 Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248)
wfproc added a commit to wfproc/parameter-golf that referenced this pull request Mar 28, 2026
Research contribution: confirmed torch.compile constant-folds Late QAT in openai#315-derived code, tested tensor-scale STE fix, swept 7 untried techniques from recent papers. All negative on 1xH100. Includes anti-layer diagnostic, prune-then-quantize, and spectral SVD compression implementations as env var toggles.
taka6745 pushed a commit to taka6745/parameter-golf that referenced this pull request Apr 7, 2026
From audit subagent's gap analysis. Both are zero-param model-arch wins in multiple top SOTA records. Patch 19 — USE_PARTIAL_ROPE + PARTIAL_ROPE_DIMS=16: Rotate only the first 16 of 64 head dims via RoPE. Leave the remaining 48 dims unrotated (positional-bias-free). The model learns to use rotated dims for relative-position attention and unrotated dims for content-only attention. From PR openai#1019 (openai#1 SOTA at 1.11473 BPB) and PR openai#315 (1.1248 BPB). Implementation: modify apply_rotary_emb to optionally split x into rotated/plain halves. Idempotent via PARTIAL_ROPE_MARKER. Patch 20 — USE_LN_SCALE: Scale RMSNorm output by 1/sqrt(layer_idx+1) at each block. Damps the contribution of deeper layers. Zero params. From PRs openai#1019 + openai#315. Implementation: add layer_idx to Block.__init__, store as self._layer_idx, apply the 1/sqrt scaling in forward. The hard part is wiring layer_idx through GPT.__init__'s Block creation list. Idempotent via LN_SCALE_MARKER. NOTE: Patch 20 stores _layer_idx but the apply side (multiplying the normed output by the scale factor) is NOT yet wired. That's a follow-up in the next iteration. This patch is the prep work to make USE_LN_SCALE possible without breaking the current Block.forward. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
taka6745 pushed a commit to taka6745/parameter-golf that referenced this pull request Apr 7, 2026
EL2 cycle-2 = 3.2742 (only +0.0008 above champion 3.2734) reversed the audit fire openai#1 verdict that EngramLite was falsified. Adding 4 new EL multi-seed experiments to confirm: - EL3 (seed 1337), EL4 (seed 999), EL5 (seed 7) - EL6 with L5 weights (0.15/0.20/0.15) — new combination Removed 15 dead/falsified configs that wasted cycle 2 compute: EA*, BG*, NG*, TH*, MEGA, MTP0/2/3, MTP1_seed999, PR2/3, EL0. Also captured EMA(0.997) canonical spec from 6 merged records (openai#287, openai#315, openai#414, openai#1019, openai#1099) — DEFERRED actual Patch 17 ship because EMA only affects final val_bpb (not loop train_loss) and training-loop anchoring is risky without reading train_gpt.py. Queue now cycles in ~100 min (vs 185 min) leaving more compute for the EL family expansion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
taka6745 pushed a commit to taka6745/parameter-golf that referenced this pull request Apr 7, 2026
…ntified as top missing technique Patches 15/16/21 + NEW Patch 20 USE_COPRIME_STRIDE all uncontested in 150+ open + 20 closed PRs (7 consecutive audits for the original 3, first confirmation for Patch 20 just shipped 3h ago). CRITICAL FINDING: XSA (Cross-Sequence Attention) is in 4+ MERGED records (PR openai#1019, openai#287, openai#315, openai#265, latest openai#1099) and we have ZERO attention-mask variants. Most-validated missing technique. ~200 LOC moderate port — too big for a single research fire but worth a focused 30-45 min investigation if we can find a minimal variant. SLOT (Score-First TTT) is the openai#2 missing (PR openai#549, ~100 LOC) but it's eval-time, joins the H100 escalation bundle category. H100 escalation candidate updated: NEW: CHAMP_L4 + COPRIME_STRIDE + EL + (EMA + Tilt + INT6 GPTQ) OLD: CHAMP_L4 + EL + (EMA + Tilt + INT6 GPTQ) Need CS2 cycle 2+3 for n=3 mean confirmation before escalating. PR openai#1430 still OPEN, 0 comments, no comp owner activity for 16h+. Spend ~$4.00/$36 (11.1%). Pod healthy at 7h 50min uptime. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

4 participants