Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional GPU memory usage in the first GPU #114

Closed
chengxuz opened this issue Jan 28, 2022 · 5 comments · Fixed by #115
Closed

Additional GPU memory usage in the first GPU #114

chengxuz opened this issue Jan 28, 2022 · 5 comments · Fixed by #115
Assignees
Labels
bug Something isn't working high priority

Comments

@chengxuz
Copy link

When training one network on multiple GPUs, I find the first GPU is going to have some memory used by processes run on other GPUs. Is there some way to avoid this? It is an issue as then the first GPU always will have more memory used than the other GPUs, meaning that the other GPUs need to have memory unused to let that possible.

@GuillaumeLeclerc
Copy link
Collaborator

Hello,

This is definitely not normal behavior and I am investigating a report from someone else. Are you sure you have ch.cuda.set_device in your code appropriately? If not this is known for causing what you are describing.

@GuillaumeLeclerc
Copy link
Collaborator

@chengxuz Can you show me the output of nvidia-smi while this is running ?

@chengxuz
Copy link
Author

Here is the output from nvidia-smi. I have just confirmed that I have run torch.cuda.set_device in my code correctly.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 465.19.01    CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA TITAN Xp     Off  | 00000000:1A:00.0 Off |                  N/A |
| 23%   24C    P8     8W / 250W |      0MiB / 12196MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA TITAN Xp     Off  | 00000000:1B:00.0 Off |                  N/A |
| 23%   26C    P8     8W / 250W |      0MiB / 12196MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA TITAN Xp     Off  | 00000000:1C:00.0 Off |                  N/A |
| 23%   26C    P8     9W / 250W |      0MiB / 12196MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA TITAN Xp     Off  | 00000000:1D:00.0 Off |                  N/A |
| 23%   27C    P8    10W / 250W |      0MiB / 12196MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA TITAN Xp     Off  | 00000000:1E:00.0 Off |                  N/A |
| 23%   28C    P8     9W / 250W |      0MiB / 12196MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA TITAN Xp     Off  | 00000000:3D:00.0 Off |                  N/A |
| 47%   75C    P2   160W / 250W |   8948MiB / 12196MiB |     35%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA TITAN Xp     Off  | 00000000:3E:00.0 Off |                  N/A |
| 49%   79C    P2   286W / 250W |   6549MiB / 12196MiB |     89%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA TITAN Xp     Off  | 00000000:3F:00.0 Off |                  N/A |
| 52%   83C    P2   306W / 250W |   6549MiB / 12196MiB |     99%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   8  NVIDIA TITAN Xp     Off  | 00000000:40:00.0 Off |                  N/A |
| 52%   83C    P2   182W / 250W |   6529MiB / 12196MiB |     83%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   9  NVIDIA TITAN Xp     Off  | 00000000:41:00.0 Off |                  N/A |
| 23%   31C    P8     9W / 250W |      0MiB / 12196MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    5   N/A  N/A     31955      C   ...nda2/envs/ffcv/bin/python     6543MiB |
|    5   N/A  N/A     31956      C   ...nda2/envs/ffcv/bin/python      799MiB |
|    5   N/A  N/A     31957      C   ...nda2/envs/ffcv/bin/python      799MiB |
|    5   N/A  N/A     31958      C   ...nda2/envs/ffcv/bin/python      799MiB |
|    6   N/A  N/A     31956      C   ...nda2/envs/ffcv/bin/python     6547MiB |
|    7   N/A  N/A     31957      C   ...nda2/envs/ffcv/bin/python     6547MiB |
|    8   N/A  N/A     31958      C   ...nda2/envs/ffcv/bin/python     6527MiB |
+-----------------------------------------------------------------------------+

@GuillaumeLeclerc
Copy link
Collaborator

This is definitely not normal. I can reproduce right now with 2 GPUs. However, for me it's GPU1 that has two processes associated to it

It shouldn't take long to fix now that I can reproduce. Thank you for confirming what I suspected!

@GuillaumeLeclerc
Copy link
Collaborator

Hello! Thanks for the report. It should land in v0.0.4. I might deploy a release candidate tonight. You can otherwise install directly from github (branch v0.0.4)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working high priority
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants