Conversation
With technique: Slightly faster
Community Review — (Nonrecord) Async PrefetchingCompliance: NEEDS AUTHOR ACTION — What I found: The CPU smoke test on CT2038 (proteus-engine, 128 GB RAM, Triton 3.6.0, flash_attn stub, cutlass_evt_fusion stub) failed at the import step with: A few of the common patterns I've seen for this class of error in the 2026-04-11 sweep:
Recommendation: Could you run Once the parse/import issue is fixed, I'll re-run the compliance audit through the normal pipeline. No other flags identified yet because the audit halts at the import step. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): IMPORT_FAIL — AttributeError: 'NoneType' object has no attribute 'dict'. Classification via |
Retraction — this IMPORT_FAIL was a Python 3.10 |
Asynchronous Prefetching — submission notes
1191 (with technique) vs 1137 (default) steps in 600s on local compute
Key changes
Same model, optimizer, data layout, and training math as baseline. This is a general purpose rework that could apply to most other approaches for slight speed boosts. Overlap CPU data prep and host→device copies with GPU work so the GPU spends less time idle.
train_gpt_og_linux.py)PrefetchingDistributedTokenLoader) builds the next pinned CPU batch while the GPU runs the current step. Primary win: CPU work overlaps GPU compute (not GPU-side double-buffering of H2D vs forward).TRAIN_COPY_STREAM, off when timing diagnostics are on). Transfers use pinned memory; the training path still waits for that step’s H2D before forward (wait_stream).VAL_BYTECOUNT_DEVICE=cpumoves BPB byte counting off the GPU vs the original (setcudato mirror baseline GPU LUT math).Diagnostics
To measure how much time this actually saves, I added
TRAINING_TIMING_BREAKDOWN(batch CPU vs H2D vs FWD/BWD/opt vs val; adds syncs). When enabled, lines log everyTRAINING_TIMING_EVERYsteps (default 200) and for early steps (first 10). Extra logs: train/val I/O mode,val_stage_time_ms, train vs val wall time split.VAL_BYTECOUNT_DEVICEdefaults tocpuin the improved script (not an extra flag you must set). Usecudaif you want validation byte math on the GPU like the original.Optional
VAL_PROGRESS_LOG_EVERY(default 0): set to a positive value to log per-batch validation progress (val_progress:...).Defaults & toggles
Overlap features are on by default (
TRAIN_PREFETCH,TRAIN_COPY_STREAM,VAL_PREFETCH,VAL_COPY_STREAM, etc.) and can be turned off via env vars if needed.TRAINING_TIMING_BREAKDOWNdefaults to 0 and is not displayed. Prefetch/overlap are automatically disabled whenTRAINING_TIMING_BREAKDOWN=1so timings stay interpretable.Idea
Prefetch training and validation batches asynchronously and parallelize CPU ↔ GPU transfers with compute to minimize pipeline bubbles under a fixed wall-clock budget.
This is an intuitive idea that I came up with that could help models with real research and architectural advancements place slightly higher.
Why this may be unimpactful in some cases
With
TRAINING_TIMING_BREAKDOWN=1, early-step lines look like this (same hardware / config as above;grad_accum_steps=8, per-micro averages for batch/forward/backward):How to read this:
batch_cpu_msandbatch_h2d_msare ~0.3 ms per micro-step;forward_msandbackward_msare ~30 ms and ~65 ms per micro-step. Scaled by 8 micro-steps, batch prep + H2D is on the order of ~5 ms per optimizer step, while forward + backward + optimizer is on the order of ~800+ ms. So data movement is a tiny slice of the step; overlapping it cannot move wall-clock much when the GPU is already busy with compute for almost the whole step.Caveat: On a much faster GPU (or smaller model / larger batch so steps are shorter), the same CPU+H2D work could become a larger fraction of the step, and prefetch or val overlap might show up more in profiles. The breakdown above is not universal; it only shows why the optimization can be a no-op when compute is the bottleneck.