This repository was archived by the owner on Jul 7, 2023. It is now read-only.
- Notifications
You must be signed in to change notification settings - Fork 3.7k
This repository was archived by the owner on Jul 7, 2023. It is now read-only.
Aadm is slower than Adafactor #1920
Copy link
Copy link
Open
Description
Hi,
I found that training Transformers with Adam is three times slower than with Adafactor.
Here is the command I am using for Adam:
t2t-trainer \ --data_dir=./t2t/t2t_data \ --problem=translate_ende_wmt32k \ --model=transformer \ --hparams_set=transformer_base \ --hparams="batch_size=1024,learning_rate_schedule=constant*linear_warmup*rsqrt_decay, learning_rate_constant=0.1,optimizer_adam_beta2=0.999" \ --schedule=continuous_train_and_eval \ --output_dir=./t2t/t2t_train/translate_ende_wmt32k_adam_lineB \ --train_steps=300000 \ --worker_gpu=10 \ --eval_steps=5000 Here is the command I am using for Adafactor:
t2t-trainer \ --data_dir=./t2t/t2t_data \ --problem=translate_ende_wmt32k \ --model=transformer \ --hparams_set=transformer_base \ --hparams="optimizer_adafactor_factored=False,batch_size=1024,optimizer=Adafactor,learning_rate_schedule=constant*linear_warmup*rsqrt_decay, learning_rate_constant=0.1,optimizer_adafactor_multiply_by_parameter_scale=False" \ --schedule=continuous_train_and_eval \ --output_dir=./t2t/t2t_train/translate_ende_wmt32k_adafactor_lineN \ --train_steps=300000 \ --worker_gpu=10 \ --eval_steps=5000 I found that training 100 steps cost 240 seconds for Adam, while it just needs 80s for Adafactor.
Could anyone help take a look?
Thanks very much!
Metadata
Metadata
Assignees
Labels
No labels