Validatioin loss zigzagging

Question

I'm training a speech recognition model using the Nvidia Nemo framework. Just results with the small fastconformer model and two dozen iterations are pretty good; for my data I would say they are quite amazing.

However, I have noticed something strange about validation loss: it is doing zigzags which I would think normal, only each zig and zag consists of several epochs. It looks like this:

Training is conducted with https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_ctc/speech_to_text_ctc_bpe.py, the model is https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/fastconformer/fast-conformer_ctc_bpe.yaml with gradient accumulation steps increased to 16.

What could be the reason? Is this normal? Can this be avoided somehow?

MuhammedYunus · Accepted Answer · 2025-02-08 15:00:37Z

The config you linked to uses optim/lr: 0.001 and optim/weight_decay: 1e-3.

It could be that the learning rate is too high, such that the model approaches a lower loss but then overshoots it. It's improving on average, but with an oscillatory component.

I'd try a smaller learning rate, and compensate by training for longer. The validation curve would ideally become more stable, and reach performance as good as before.

Try disabling weight_decay to begin with. You can bring it back in to mange overfitting, if required.

Thanks, this with connection to the smaller batch size then recommended may be the reason (there is a comment that default parameters are for batch sizes ~2k), maybe smaller learning rate is another parameter to change besides gradient accumulation — comodoro
– comodoro, Commented Feb 10 at 9:19

The_Data_Scientist_Man · Accepted Answer · 2025-02-08 19:04:51Z

4

I agree with MuhammedYunus answer. In addition of weight_decay, what you can do is you can use multiple learning rate.

In multiple learning rate, you initially set learning rate high, but as loss go down, you reduce the learning rate subsequently.

This will speed up training and also avoid the oscillating of val_loss.

Cheers!

answered Feb 8 at 19:04

The_Data_Scientist_Man

1,3921 silver badge6 bronze badges

1

$\begingroup$ This is certainly relevant, but this particular model use learning rate scheduling as well as warmup. $\endgroup$

comodoro
– comodoro

2025-02-10 09:24:50 +00:00
Commented Feb 10 at 9:24
$\begingroup$ Thanks for your reply. This situation generally occurs when your gradient vector and momentum vector oscillate around the minima. So your graph could be a result of that. $\endgroup$

The_Data_Scientist_Man
– The_Data_Scientist_Man

2025-02-12 17:30:23 +00:00
Commented Feb 12 at 17:30

Add a comment |

Stack Exchange Network

Validatioin loss zigzagging

2 Answers 2

Hot Network Questions

Validatioin loss zigzagging

2 Answers 2

Related

Hot Network Questions