Issue with Loss Divergence in SSL Pretraining of NeMo FastConformer Model #11242
Unanswered
satheesh-k-4986
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'm currently pretraining a FastConformer SSL model using NVIDIA's NeMo framework with 62k hours of data. The training starts off well, but after some time, the loss begins to increase rather than decrease, causing the model to diverge(After 150k steps) instead of converge. I've adjusted learning rates and experimented with different schedulers but haven't seen improvement.
Is there any guidance or best practices for handling loss divergence during SSL pretraining in NeMo? Are there specific parameters or configurations that are known to stabilize FastConformer SSL models during longer training runs?
I'm using default config - https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/ssl/fastconformer/fast-conformer.yaml
Any insights or similar experiences would be appreciated!
Beta Was this translation helpful? Give feedback.
All reactions