Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248)#315
Merged
cocohearts merged 1 commit intoopenai:mainfrom Mar 23, 2026
Merged
Conversation
| yes! great job this is sort of where i went too |
bopmite added a commit to bopmite/parameter-golf that referenced this pull request Mar 21, 2026
saml212 added a commit to saml212/parameter-golf that referenced this pull request Mar 21, 2026
robinojw pushed a commit to robinojw/parameter-golf that referenced this pull request Mar 21, 2026
- Add FA3 > FA2 > SDPA attention backend dispatch - FA2 wrapper uses @torch.compiler.disable + fullgraph=False - FA3 uses fullgraph=True (compatible with torch.compile) - Default FP16_KEEP_NAME_PATTERNS empty (quantize everything, matches PR openai#315) - Add pod_setup.sh with FA3/FA2 install flow - Add build_fa3_wheel.sh for pre-building FA3 on cheap 1xH100
filipviz added a commit to filipviz/parameter-golf that referenced this pull request Mar 21, 2026
Rename folder to today's date. Replace train_gpt.py with the new baseline from PR openai#315 (11L XSA4 + EMA + Partial RoPE + Late QAT, 1.1248 BPB). Previous script preserved as previous_train_gpt.py. Update README with PR lineage and new baseline context.
filipviz added a commit to filipviz/parameter-golf that referenced this pull request Mar 21, 2026
…unner Port per-head gated attention (12ch, 2*sigmoid) into the PR openai#315 train_gpt.py (11L XSA4 + EMA + Partial RoPE + Late QAT, 1.1248 BPB). Update run script to use PR openai#315 config for both baseline and experiment.
mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 21, 2026
3 tasks
152334H reviewed Mar 21, 2026
records/track_10min_16mb/2026-03-21_11L_XSA4_EMA_PartialRoPE_LateQAT_1.1248/train_gpt.py Show resolved Hide resolved
felipe-parodi added a commit to felipe-parodi/parameter-golf that referenced this pull request Mar 21, 2026
- Rebased train_gpt.py on PR openai#315 (1.1248 BPB SOTA) - Added SGD TTT and causal TTT variant - Added gradient-guided adaptive quantization (int5/int6/int7) - Added z-loss regularization - Updated plan with current landscape and run commands
dfb05a5 to 2951651 Compare saml212 added a commit to saml212/parameter-golf that referenced this pull request Mar 21, 2026
Merged records from all experiment branches into one working branch. Updated CLAUDE.md with current competitive landscape and next priorities. Rewrote idea bank with tiered roadmap for closing the gap to openai#315.
felipe-parodi added a commit to felipe-parodi/parameter-golf that referenced this pull request Mar 21, 2026
alia-abbas added a commit to alia-abbas/parameter-golf that referenced this pull request Mar 21, 2026
mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 21, 2026
torch.compile constant-folds CastedLinear._qat at first trace. Credit: @152334H via PR openai#315.
mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 21, 2026
torch.compile constant-folds CastedLinear._qat at first trace. Credit: @152334H via PR openai#315.
mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 21, 2026
torch.compile constant-folds CastedLinear._qat at first trace. Credit: @152334H via PR openai#315.
charmquark1984 added a commit to charmquark1984/parameter-golf that referenced this pull request Mar 21, 2026
13 techniques tested that did NOT work on PR openai#315 base: - Causal TTT (3 variants): neutral on EMA+XSA base - MTP: +0.028 BPB, throughput penalty kills it - INT4: 0.06 BPB quant gap wipes out param advantage - Canon layers: 48% step overhead not compensated - Memory tokens, gradient-guided quant, cautious WD, L1 regularization, label smoothing, 1M batch, full QAT 4 positive findings: - EMA > SWA by 0.003 BPB (3-seed verified) - Weight decay directly controls artifact size - 786K > 524K batch by 0.004 BPB - FA3 Hopper: 15-20% more steps at same wallclock Best verified result: 1.1257 BPB (PR openai#315 reproduction) Includes 12 training logs for verification.
charmquark1984 added a commit to charmquark1984/parameter-golf that referenced this pull request Mar 21, 2026
13 techniques tested that did NOT work on PR openai#315 base: - Causal TTT (3 variants): neutral on EMA+XSA base - MTP: +0.028 BPB, throughput penalty kills it - INT4: 0.06 BPB quant gap wipes out param advantage - Canon layers: 48% step overhead not compensated - Memory tokens, gradient-guided quant, cautious WD, L1 regularization, label smoothing, 1M batch, full QAT 4 positive findings: - EMA > SWA by 0.003 BPB (3-seed verified) - Weight decay directly controls artifact size - 786K > 524K batch by 0.004 BPB - FA3 Hopper: 15-20% more steps at same wallclock Best verified result: 1.1257 BPB (PR openai#315 reproduction) Includes 12 training logs for verification.
charmquark1984 added a commit to charmquark1984/parameter-golf that referenced this pull request Mar 21, 2026
13 techniques tested that did NOT work on PR openai#315 base: - Causal TTT (3 variants): neutral on EMA+XSA base - MTP: +0.028 BPB, throughput penalty kills it - INT4: 0.06 BPB quant gap wipes out param advantage - Canon layers: 48% step overhead not compensated - Memory tokens, gradient-guided quant, cautious WD, L1 regularization, label smoothing, 1M batch, full QAT 4 positive findings: - EMA > SWA by 0.003 BPB (3-seed verified) - Weight decay directly controls artifact size - 786K > 524K batch by 0.004 BPB - FA3 Hopper: 15-20% more steps at same wallclock Best verified result: 1.1257 BPB (PR openai#315 reproduction) Includes 12 training logs for verification.
turazashvili added a commit to turazashvili/parameter-golf that referenced this pull request Mar 22, 2026
Safe config matching PR openai#315 proven techniques: - 11 layers, MLP 3x (1536), BigramHash 2048 - Muon backend_steps=5, momentum=0.99 (proven by all top PRs) - XSA on last 4 layers, Partial RoPE 16/64, LN Scale, Late QAT - EMA decay=0.997 every 4 steps via torch._foreach_lerp_ - CUDA_DEVICE_MAX_CONNECTIONS=1 for multi-GPU overlap - SmearGate, OrthoInit, int5 MLP/int6 attention, zstd-22
EthanYangTW added a commit to EthanYangTW/parameter-golf that referenced this pull request Mar 22, 2026
…le, EMA, Late QAT, TTT Major rewrite targeting top-5 leaderboard: - 11 layers (from 10), BigramHash reduced to 10240 to fit 16MB - XSA (Exclusive Self-Attention) on last 4 layers - Partial RoPE: 16/64 head dims get position encoding - LN Scale: 1/sqrt(layer+1) dampening on deeper layers - EMA (decay=0.997) replaces SWA - Late QAT: STE int6 enabled only in final 4% of training - TTT: 25-epoch SGD on val data post-quantization - FA3 auto-detection with SDPA fallback - Reverted SwiGLU back to relu² (confirmed worse by openai#340, openai#344)
4 tasks
nvemuri4649 pushed a commit to thanushpatlolla/parameter-golf that referenced this pull request Mar 27, 2026
…e-lateqat-1.1248 Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248)
wfproc added a commit to wfproc/parameter-golf that referenced this pull request Mar 28, 2026
Research contribution: confirmed torch.compile constant-folds Late QAT in openai#315-derived code, tested tensor-scale STE fix, swept 7 untried techniques from recent papers. All negative on 1xH100. Includes anti-layer diagnostic, prune-then-quantize, and spectral SVD compression implementations as env var toggles.
5 tasks
Open
4 tasks
taka6745 pushed a commit to taka6745/parameter-golf that referenced this pull request Apr 7, 2026
From audit subagent's gap analysis. Both are zero-param model-arch wins in multiple top SOTA records. Patch 19 — USE_PARTIAL_ROPE + PARTIAL_ROPE_DIMS=16: Rotate only the first 16 of 64 head dims via RoPE. Leave the remaining 48 dims unrotated (positional-bias-free). The model learns to use rotated dims for relative-position attention and unrotated dims for content-only attention. From PR openai#1019 (openai#1 SOTA at 1.11473 BPB) and PR openai#315 (1.1248 BPB). Implementation: modify apply_rotary_emb to optionally split x into rotated/plain halves. Idempotent via PARTIAL_ROPE_MARKER. Patch 20 — USE_LN_SCALE: Scale RMSNorm output by 1/sqrt(layer_idx+1) at each block. Damps the contribution of deeper layers. Zero params. From PRs openai#1019 + openai#315. Implementation: add layer_idx to Block.__init__, store as self._layer_idx, apply the 1/sqrt scaling in forward. The hard part is wiring layer_idx through GPT.__init__'s Block creation list. Idempotent via LN_SCALE_MARKER. NOTE: Patch 20 stores _layer_idx but the apply side (multiplying the normed output by the scale factor) is NOT yet wired. That's a follow-up in the next iteration. This patch is the prep work to make USE_LN_SCALE possible without breaking the current Block.forward. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
taka6745 pushed a commit to taka6745/parameter-golf that referenced this pull request Apr 7, 2026
EL2 cycle-2 = 3.2742 (only +0.0008 above champion 3.2734) reversed the audit fire openai#1 verdict that EngramLite was falsified. Adding 4 new EL multi-seed experiments to confirm: - EL3 (seed 1337), EL4 (seed 999), EL5 (seed 7) - EL6 with L5 weights (0.15/0.20/0.15) — new combination Removed 15 dead/falsified configs that wasted cycle 2 compute: EA*, BG*, NG*, TH*, MEGA, MTP0/2/3, MTP1_seed999, PR2/3, EL0. Also captured EMA(0.997) canonical spec from 6 merged records (openai#287, openai#315, openai#414, openai#1019, openai#1099) — DEFERRED actual Patch 17 ship because EMA only affects final val_bpb (not loop train_loss) and training-loop anchoring is risky without reading train_gpt.py. Queue now cycles in ~100 min (vs 185 min) leaving more compute for the EL family expansion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
taka6745 pushed a commit to taka6745/parameter-golf that referenced this pull request Apr 7, 2026
…ntified as top missing technique Patches 15/16/21 + NEW Patch 20 USE_COPRIME_STRIDE all uncontested in 150+ open + 20 closed PRs (7 consecutive audits for the original 3, first confirmation for Patch 20 just shipped 3h ago). CRITICAL FINDING: XSA (Cross-Sequence Attention) is in 4+ MERGED records (PR openai#1019, openai#287, openai#315, openai#265, latest openai#1099) and we have ZERO attention-mask variants. Most-validated missing technique. ~200 LOC moderate port — too big for a single research fire but worth a focused 30-45 min investigation if we can find a minimal variant. SLOT (Score-First TTT) is the openai#2 missing (PR openai#549, ~100 LOC) but it's eval-time, joins the H100 escalation bundle category. H100 escalation candidate updated: NEW: CHAMP_L4 + COPRIME_STRIDE + EL + (EMA + Tilt + INT6 GPTQ) OLD: CHAMP_L4 + EL + (EMA + Tilt + INT6 GPTQ) Need CS2 cycle 2+3 for n=3 mean confirmation before escalating. PR openai#1430 still OPEN, 0 comments, no comp owner activity for 16h+. Spend ~$4.00/$36 (11.1%). Pod healthy at 7h 50min uptime. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248)
val_bpb: 1.1248 (sliding window, stride=64) | 15.6 MB | 8xH100 SXM, 600s
Progress from prior submissions
Two new techniques on top of PR #287's 11-layer stack.
Key additions over PR #287
Everything else from PR #287 carries forward: 11 layers, XSA on last 4 layers, EMA (0.997), OrthoInit + muP, 3x MLP, int6 mixed quant + zstd-22, WD=0.04, SmearGate, BigramHash(2048), FA3, seq 2048, tuned Muon.
Results
Reproducibility (3 seeds)
Mean: 1.1250 | Variance: 0.0005 | Submitted: seed 2025
Run command
Note on Late QAT
The submitted code includes a Late QAT flag (
LATE_QAT=1) intended to enable STE int6 fake-quantization in the final 4% of training. Post-submission analysis (credit: @152334H) revealed thattorch.compileconstant-folds theCastedLinear._qat_enabledclass attribute at first trace, so the STE branch is dead-code-eliminated and never activates during training. Late QAT had no effect on the results. The score is driven entirely by Partial RoPE and LN Scale.