Skip to content

Commit

Permalink
add a common issue of using tensorrt (#1701)
Browse files Browse the repository at this point in the history
  • Loading branch information
hnyu authored Sep 24, 2024
1 parent 76d6fe9 commit b189e3d
Showing 1 changed file with 9 additions and 0 deletions.
9 changes: 9 additions & 0 deletions alf/utils/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,3 +71,12 @@ If you have to use a dynamic shape, you can choose to use the CUDA backend by se
```bash
ORT_ONNX_BACKEND_EXCLUDE_PROVIDERS=TensorrtExecutionProvider
```

## Common issues
There is some known side effect on CUDA/GPU when importing ``tensorrt_utils.py``. It is crucial to make sure
that this module is never imported during training, but only imported for inference when necessary.

If imported during training, the typical issues can be:

1. GPU 0 consumes an extra abnormal amount of memory, leading to CUDA out-of-mem issue.
2. For multi-gpu training, sometimes there will be an error such as "Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 17000" after actual training starts (rollout is fine).

0 comments on commit b189e3d

Please sign in to comment.