Single H100 10 min 16mb< 1.24 bpb#1559
Open
adityasasidhar wants to merge 1 commit intoopenai:mainfrom
Open
Conversation
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hey,
Single-H100 run -> final sliding-window val_bpb: 1.2498
This is my second PR and I have moved closer to my actual goal of reaching the naive baseline bpb on a single H100
Here's what I have found:
Also Im well under the with a submission size of 15.2 mb and in future I'll use this space for extra improvements without computational overhead.
What Changed vs Base
train_gpt.pyModel and training setup
9layers to81024to20482xto3x8query heads and4KV headsrope_dims=32qk_gain_initfrom the base script's1.520000to35000xsa_last_n, with XSA enabled on the final2layers in this run[batch, seq, heads, dim]internally and uses theflash_attn 2interface when available future work needs flash attention 3.rope_dimschannels and leaves the rest of each head unrotatedeps=1e-6Optimization and schedule
0.05to0.040.04to0.032AdamtoAdamWadam_wd=0.04,muon_wd=0.02Muonitself to apply decoupled weight decaywarmdown_last_fracso warmdown can be driven by wallclock fraction instead of onlywarmdown_itersqat_last_fracso fake quantization is turned on only near the end of trainingEvaluation
eval_strideandeval_batch_seqseval_val_sliding(...)for sliding-window validationforward_logits(...)so sliding evaluation can reuse logits directlyQuantization and export
mixed_int6_int8_per_row_v1pack_lowbit_tensor(...)/unpack_lowbit_tensor(...)INT6_NAME_PATTERNSandINT8_QAT_NAME_PATTERNSto control export and QAT targeting by parameter nameCastedLinearweightscompress_quant_payload(...)/decompress_quant_payload(...), currently usingzlibfinal_int8_zlib_roundtrip...tofinal_mixed_quant_zlib_roundtrip...Logged artifact sizes and roundtrip metrics
76650381bytes59176bytes76709557bytes15142148bytes19476680195292073.93x15201324bytesval_loss=2.11022315,val_bpb=1.24979460657418msApologies for no multi seed runs, I'm a university student and I'm funding myself, waiting for the grant, usually I experiment on cheap services which don't give a lot of options with A5000 or such gpus, I'll update the multi seed log as soon as I find some compute.