Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zipformer BF16 training recipe #1700

Merged
merged 12 commits into from
Aug 23, 2024

Conversation

marcoyang1998
Copy link
Collaborator

@marcoyang1998 marcoyang1998 commented Jul 24, 2024

This PR shows how to enable bf16 training of Zipformer. It's not intended for merging at this moment.

Advantages of using bf16 training:

Disadvantages of bf16 training:

  • Limited hardware support (only supports ampere and afterwards, so V100 not supported)
  • (Possibly) Slightly worse results compared to AMP because bf16 has less numerical accuracy

TODO:

  • Performance benchmark with AMP on full LibriSpeech
  • Test exporting the model

@marcoyang1998
Copy link
Collaborator Author

Performance benchmark on LibriSpeech train-clean-100, all models are trained with 2 A100 GPUs. WERs are obtained with greedy search.

Training config WER peak GPU memory Time per epoch
AMP, max_duration=1000 6.08/15.59 20459MB 8min
full bf16, max_duration=1000 6.11/15.72 15473MB 8min

The WERs are about the same(AMP slightly better than bf16), but bf16 training consumes much less GPU memory. The training speed is the same, but we will have to see on larger max-duration and larger model size.

@marcoyang1998
Copy link
Collaborator Author

Experiments on full LibriSpeech. The model is trained for 30 epochs using 4 A100 GPUs. WERs are obtained using modified_beam_search for decoding.

Training config WER peak GPU memory Time per epoch
full bf16, max_duration=1000 2.4/5.31 15392 MB ~39min

There is a notable performance gap (around ~5%) between AMP (2.25/5.06) and full bf16 (2.4/5.31). We will investigate the reason and try to reduce the gap.

@marcoyang1998
Copy link
Collaborator Author

Since AMP also supports bf16 as dtype, I tried this (with minor modification to the original code). See this post for more instructions on how to use AMP. The results are shown below.

Training config WER peak GPU memory Time per epoch
full bf16, max_duration=1000 2.4/5.31 15392 MB ~39min
amp bf16, max_duration=1000 2.15/5.1 16169 MB ~42min

So AMP+bf16 training achieves better WER than full bf16, while the peak GPU memory and time per epoch slightly increased.

@marcoyang1998
Copy link
Collaborator Author

Detailed experiment results:

model test-clean test-other comment
amp + fp16 2.27 5.1 --epoch 30 --avg 9, greedy
amp + fp16 2.25 5.06 --epoch 30 --avg 9, modified_beam_search
amp + fp16 2.23 4.96 --epoch 40 --avg 9, greedy
amp + fp16 2.21 4.91 --epoch 40 --avg 16
amp + bf16 2.19 5.17 --epoch 30 --avg 13, greedy
amp + bf16 2.15 5.1 --epoch 30 --avg 13, modified_beam_search
amp + bf16 2.16 5.03 --epoch 40 --avg 13, greedy
amp + bf16 2.15 4.94 --epoch 40 --avg 13, modified_beam_search
full bf16 2.42 5.44 --epoch 30 --avg 15, greedy
full bf16 2.4 5.31 --epoch 30 --avg 15, modified_beam_search
full bf16 2.39 5.44 --epoch 40 --avg 17, greedy
full bf16 2.31 5.35 --epoch 40 --avg 17, modified_beam_search
  • Accuracy: amp fp16 = amp bf16 > full bf16
  • Training speech: full bf16 > amp fp16 > amp bf16
  • Memory usage: amp fp16 = amp bf16 > full bf16

@marcoyang1998
Copy link
Collaborator Author

The models are uploaded to huggingface:

The amp bf16 zipformer-M model: https://huggingface.co/marcoyang/icefall-zipformer-M-librispeech-amp-bf16

The full bf16 zipformer-M model: https://huggingface.co/marcoyang/icefall-zipformer-M-librispeech-bf16

@marcoyang1998
Copy link
Collaborator Author

I decided to only support amp+bf16 training in this PR. Full bf16 training loses too much accuracy.

@marcoyang1998 marcoyang1998 merged commit a6c02a4 into k2-fsa:master Aug 23, 2024
278 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant