4
$\begingroup$

I'm training a speech recognition model using the Nvidia Nemo framework. Just results with the small fastconformer model and two dozen iterations are pretty good; for my data I would say they are quite amazing.

However, I have noticed something strange about validation loss: it is doing zigzags which I would think normal, only each zig and zag consists of several epochs. It looks like this:

Tensorboard of validation loss

Training is conducted with https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_ctc/speech_to_text_ctc_bpe.py, the model is https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/fastconformer/fast-conformer_ctc_bpe.yaml with gradient accumulation steps increased to 16.

What could be the reason? Is this normal? Can this be avoided somehow?

$\endgroup$

2 Answers 2

5
$\begingroup$

The config you linked to uses optim/lr: 0.001 and optim/weight_decay: 1e-3.

It could be that the learning rate is too high, such that the model approaches a lower loss but then overshoots it. It's improving on average, but with an oscillatory component.

I'd try a smaller learning rate, and compensate by training for longer. The validation curve would ideally become more stable, and reach performance as good as before.

Try disabling weight_decay to begin with. You can bring it back in to mange overfitting, if required.

$\endgroup$
1
  • 1
    $\begingroup$ Thanks, this with connection to the smaller batch size then recommended may be the reason (there is a comment that default parameters are for batch sizes ~2k), maybe smaller learning rate is another parameter to change besides gradient accumulation $\endgroup$ Commented Feb 10 at 9:19
4
$\begingroup$

I agree with MuhammedYunus answer. In addition of weight_decay, what you can do is you can use multiple learning rate.

In multiple learning rate, you initially set learning rate high, but as loss go down, you reduce the learning rate subsequently.

This will speed up training and also avoid the oscillating of val_loss.

Cheers!

$\endgroup$
2
  • 1
    $\begingroup$ This is certainly relevant, but this particular model use learning rate scheduling as well as warmup. $\endgroup$ Commented Feb 10 at 9:24
  • $\begingroup$ Thanks for your reply. This situation generally occurs when your gradient vector and momentum vector oscillate around the minima. So your graph could be a result of that. $\endgroup$ Commented Feb 12 at 17:30

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.