Issue with Loss Divergence in SSL Pretraining of NeMo FastConformer Model #11242

satheesh-k-4986 · 2024-11-09T11:39:09Z

satheesh-k-4986
Nov 9, 2024

I'm currently pretraining a FastConformer SSL model using NVIDIA's NeMo framework with 62k hours of data. The training starts off well, but after some time, the loss begins to increase rather than decrease, causing the model to diverge(After 150k steps) instead of converge. I've adjusted learning rates and experimented with different schedulers but haven't seen improvement.

Is there any guidance or best practices for handling loss divergence during SSL pretraining in NeMo? Are there specific parameters or configurations that are known to stabilize FastConformer SSL models during longer training runs?

I'm using default config - https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/ssl/fastconformer/fast-conformer.yaml

Any insights or similar experiences would be appreciated!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with Loss Divergence in SSL Pretraining of NeMo FastConformer Model #11242

{{title}}

Replies: 0 comments

Select a reply

Issue with Loss Divergence in SSL Pretraining of NeMo FastConformer Model #11242

satheesh-k-4986 Nov 9, 2024

Replies: 0 comments

satheesh-k-4986
Nov 9, 2024