Fix #8948, allow preprocessor to be stream captured to a cuda graph when doing per_feature normalization #8964

galv · 2024-04-18T00:20:49Z

Commit 1:

Do feature normalization in parallel, rather than via a for loop.

At large batch sizes, this becomes a bottleneck, taking about 9 ms at
batch size 16, for example. See issue #8948.

Commit 2:

Remove all instances of cudaStreamSynchronize() in the featurizer when doing "per_feature" normalization.

With this change, we can now do stream capture to a cuda graph on the
preprocessor. This is bound to increase performance
significantly. Even at batch size 16, the GPU is idle about 50% of the
time because these kernels finish so fast.

No unit test. I simply ran:

python examples/asr/speech_to_text_eval.py  pretrained_name=nvidia/parakeet-ctc-1.1b  dataset_manifest=/home/dgalvez/scratch/data/test_other.json  batch_size=16  output_filename=test_other_decoded.jsonl  amp=true  amp_dtype=bfloat16  use_cer=false num_workers=1

Before my changes:

Transcribing: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 184/184 [00:25<00:00,  7.17it/s]
[NeMo I 2024-04-17 14:50:33 transcribe_speech:398] Finished transcribing from manifest file: /home/dgalvez/scratch/data/test_other.json
[NeMo I 2024-04-17 14:50:33 transcribe_speech:403] Writing transcriptions into file: test_other_decoded.jsonl
[NeMo I 2024-04-17 14:50:33 transcribe_speech:421] Finished writing predictions to test_other_decoded.jsonl!
[NeMo I 2024-04-17 14:50:33 transcribe_speech:439] Writing prediction and error rate of each sample to test_other_decoded.jsonl!
[NeMo I 2024-04-17 14:50:33 transcribe_speech:440] {'samples': 2939, 'tokens': 52343, 'wer': 0.037712779168179125, 'ins_rate': 0.004203045297365455, 'del_rate': 0.0022734654108476776, 'sub_rate': 0.031236268459965993}
[NeMo I 2024-04-17 14:50:33 speech_to_text_eval:131] Finished transcribing speech dataset. Computing ASR metrics..
[NeMo I 2024-04-17 14:50:33 speech_to_text_eval:210] Dataset WER/CER 3.77%/1.41%

After my changes:

Transcribing: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 184/184 [00:25<00:00,  7.18it/s]
[NeMo I 2024-04-17 14:57:37 transcribe_speech:398] Finished transcribing from manifest file: /home/dgalvez/scratch/data/test_other.json
[NeMo I 2024-04-17 14:57:38 transcribe_speech:403] Writing transcriptions into file: test_other_decoded.jsonl
[NeMo I 2024-04-17 14:57:38 transcribe_speech:421] Finished writing predictions to test_other_decoded.jsonl!
[NeMo I 2024-04-17 14:57:38 transcribe_speech:439] Writing prediction and error rate of each sample to test_other_decoded.jsonl!
[NeMo I 2024-04-17 14:57:38 transcribe_speech:440] {'samples': 2939, 'tokens': 52343, 'wer': 0.03761725541142082, 'ins_rate': 0.004222150048717116, 'del_rate': 0.0022352559081443555, 'sub_rate': 0.03115984945455935}
[NeMo I 2024-04-17 14:57:38 speech_to_text_eval:131] Finished transcribing speech dataset. Computing ASR metrics..
[NeMo I 2024-04-17 14:57:38 speech_to_text_eval:210] Dataset WER/CER 3.76%/1.41%

You can see that the WER went down by 0.01%, implying a small change took place. I don't think it is a concern, but worth noting nonetheless.

pzelasko

LGTM

Profiling Canary training showed that the old implementation is taking about 15% of the total forward step time, while the new one is barely noticeable. It's likely this will speed up ASR training across all architectures.

pzelasko · 2024-04-26T21:07:10Z

The CPU-only tests are failing on:

2024-04-25T23:47:10.4949722Z nemo/collections/asr/parts/preprocessing/features.py:466: in forward
2024-04-25T23:47:10.4950460Z     x, _, _ = normalize_batch(x, seq_len, normalize_type=self.normalize)
2024-04-25T23:47:10.4951858Z nemo/collections/asr/parts/preprocessing/features.py:67: in normalize_batch
2024-04-25T23:47:10.4952730Z     if not torch.cuda.is_current_stream_capturing() and torch.any(seq_len == 1).item():
2024-04-25T23:47:10.4953491Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2024-04-25T23:47:10.4953904Z 
2024-04-25T23:47:10.4954053Z     def is_current_stream_capturing():
2024-04-25T23:47:10.4954490Z         r"""
2024-04-25T23:47:10.4955100Z         Returns True if CUDA graph capture is underway on the current CUDA stream, False otherwise.
2024-04-25T23:47:10.4955767Z     
2024-04-25T23:47:10.4956394Z         If a CUDA context does not exist on the current device, returns False without initializing the context.
2024-04-25T23:47:10.4957165Z         """
2024-04-25T23:47:10.4957517Z >       return _cuda_isCurrentStreamCapturing()
2024-04-25T23:47:10.4958284Z E       RuntimeError: CUDA error: no CUDA-capable device is detected
2024-04-25T23:47:10.4959247Z E       CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
2024-04-25T23:47:10.4960142Z E       For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
2024-04-25T23:47:10.4961287Z E       Device-side assertions were explicitly omitted for this error check; the error probably arose while initializing the DSA handlers.
2024-04-25T23:47:10.4961983Z 
2024-04-25T23:47:10.4962405Z /usr/local/lib/python3.10/dist-packages/torch/cuda/graphs.py:28: RuntimeError

@galv can you add a check to only call CUDA API if it is available? (sth like if torch.cuda.is_available(): don't remember the name exactly)

At large batch sizes, this becomes a bottleneck, taking about 9 ms at batch size 16, for example. See issue NVIDIA#8948. Signed-off-by: Daniel Galvez <[email protected]>

doing "per_feature" normalization. With this change, we can now do stream capture to a cuda graph on the preprocessor. This is bound to increase performance significantly. Even at batch size 16, the GPU is idle about 50% of the time because these kernels finish so fast. Signed-off-by: Daniel Galvez <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Daniel Galvez <[email protected]>

for more information, see https://pre-commit.ci

pzelasko

LGTM

…hen doing per_feature normalization (#8964) * Do feature normalization in parallel, rather than via a for loop. At large batch sizes, this becomes a bottleneck, taking about 9 ms at batch size 16, for example. See issue #8948. Signed-off-by: Daniel Galvez <[email protected]> * Remove all instances of cudaStreamSynchronize() in the featurizer when doing "per_feature" normalization. With this change, we can now do stream capture to a cuda graph on the preprocessor. This is bound to increase performance significantly. Even at batch size 16, the GPU is idle about 50% of the time because these kernels finish so fast. Signed-off-by: Daniel Galvez <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix crash in CPU mode. Signed-off-by: Daniel Galvez <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Daniel Galvez <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Ao Tang <[email protected]>

…raph when doing per_feature normalization (NVIDIA#8964) * Do feature normalization in parallel, rather than via a for loop. At large batch sizes, this becomes a bottleneck, taking about 9 ms at batch size 16, for example. See issue NVIDIA#8948. Signed-off-by: Daniel Galvez <[email protected]> * Remove all instances of cudaStreamSynchronize() in the featurizer when doing "per_feature" normalization. With this change, we can now do stream capture to a cuda graph on the preprocessor. This is bound to increase performance significantly. Even at batch size 16, the GPU is idle about 50% of the time because these kernels finish so fast. Signed-off-by: Daniel Galvez <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix crash in CPU mode. Signed-off-by: Daniel Galvez <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Daniel Galvez <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

github-actions bot added the ASR label Apr 18, 2024

pzelasko previously approved these changes Apr 25, 2024

View reviewed changes

pzelasko mentioned this pull request Apr 25, 2024

GPU-based vectorized SpecAug #9041

Closed

8 tasks

galv dismissed pzelasko’s stale review via f2debc1 April 29, 2024 18:32

galv force-pushed the pr-pusher branch from af6af25 to f592cc0 Compare April 29, 2024 18:41

github-actions bot added NLP CI common Multi Modal labels Apr 29, 2024

galv and others added 5 commits April 29, 2024 11:42

Do feature normalization in parallel, rather than via a for loop.

ee1ccbf

At large batch sizes, this becomes a bottleneck, taking about 9 ms at batch size 16, for example. See issue NVIDIA#8948. Signed-off-by: Daniel Galvez <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

e9f88bd

for more information, see https://pre-commit.ci

Fix crash in CPU mode.

034799f

Signed-off-by: Daniel Galvez <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

22b06d0

for more information, see https://pre-commit.ci

galv force-pushed the pr-pusher branch from f592cc0 to 22b06d0 Compare April 29, 2024 18:43

github-actions bot removed NLP CI common Multi Modal labels Apr 29, 2024

galv added the Run CICD label Apr 30, 2024

pzelasko approved these changes Apr 30, 2024

View reviewed changes

pzelasko merged commit a9dac0e into NVIDIA:main Apr 30, 2024
126 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #8948, allow preprocessor to be stream captured to a cuda graph when doing per_feature normalization #8964

Fix #8948, allow preprocessor to be stream captured to a cuda graph when doing per_feature normalization #8964

galv commented Apr 18, 2024

pzelasko left a comment •

edited

Loading

pzelasko commented Apr 26, 2024

pzelasko left a comment

Fix #8948, allow preprocessor to be stream captured to a cuda graph when doing per_feature normalization #8964

Fix #8948, allow preprocessor to be stream captured to a cuda graph when doing per_feature normalization #8964

Conversation

galv commented Apr 18, 2024

pzelasko left a comment • edited Loading

Choose a reason for hiding this comment

pzelasko commented Apr 26, 2024

pzelasko left a comment

Choose a reason for hiding this comment

pzelasko left a comment •

edited

Loading