Skip to content

Non-Record: JEPA-NTP Auxiliary Losses (Negative Result)#1556

Open
sidhanth97 wants to merge 1 commit intoopenai:mainfrom
sidhanth97:submission/jepa-ntp-experiments
Open

Non-Record: JEPA-NTP Auxiliary Losses (Negative Result)#1556
sidhanth97 wants to merge 1 commit intoopenai:mainfrom
sidhanth97:submission/jepa-ntp-experiments

Conversation

@sidhanth97
Copy link
Copy Markdown

This PR adds a non-record submission under:

records/track_non_record_16mb/2026-04-11_JEPA_NTP_Auxiliary_Losses_Negative_Result

Negative result: JEPA-style auxiliary losses (spectral variance floor + cosine-MSE latent prediction from LeWM/LeWorldModel) do not improve next-token prediction in the parameter golf regime.

Results (1 epoch, 2xRTX PRO 6000 Blackwell, torch.compile enabled)

Experiment val_bpb (post-quant) Throughput Int8+zlib
Baseline 1.4326 2,119K tok/s 9.90 MB
JEPA exp4 (spectral + cosine-MSE, layers 2-5) 1.4352 (+0.003) 1,703K tok/s 9.90 MB
MQA + Value Embeds 1.4439 (+0.011) 2,201K tok/s 9.68 MB
MQA + VE + 3x MLP 1.4364 (+0.004) 1,936K tok/s 12.12 MB

Key findings

  • Spectral floor loss mechanically prevents dimensional collapse (effective rank 424->445/512) but this doesn't translate to better language modeling at 17M params
  • Cosine-MSE predictor was essentially inert (loss values 0.001-0.003, negligible gradient contribution)
  • An initial apparent improvement was traced to a torch.compile confound (comparing compiled JEPA vs uncompiled baseline)
  • MQA at this scale is harmful -- 4 KV heads are too few to spare one
  • Includes full experimental framework with WandB diagnostics for reproducibility

Submission contents

  • README.md -- detailed results, methodology, and analysis
  • submission.json -- leaderboard metadata
  • train_jepa_ntp.py -- JEPA training script with spectral + cosine-MSE losses
  • train_modded.py -- MQA + Value Embeddings training script
  • config.py -- experiment configurations
  • losses/ -- spectral variance floor, cosine-MSE loss implementations
  • metrics/ -- effective rank, singular spectrum, latent curvature diagnostics
JEPA-style auxiliary losses (spectral variance floor + cosine-MSE latent prediction from LeWM) do not improve next-token prediction in the parameter golf regime. Baseline val_bpb 1.4326 beats all variants. Includes full experimental framework with losses, metrics, and WandB diagnostics. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@sidhanth97 sidhanth97 changed the title Add non-record 16MB submission: JEPA-NTP Auxiliary Losses (Negative Result) Non-Record: JEPA-NTP Auxiliary Losses (Negative Result) Apr 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant