Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA out of memory with 24G GPU #1

Open
ax130885 opened this issue Aug 15, 2024 · 3 comments
Open

CUDA out of memory with 24G GPU #1

ax130885 opened this issue Aug 15, 2024 · 3 comments

Comments

@ax130885
Copy link

Hi, Sorry for replying so late.

In the ProMISe issue, it was mentioned that this project requires around 24GB of VRAM.
However, I still encounter out-of-memory issue with RTX 3090 (24GB)
while running the following command:

python src/train.py --data colon --data_dir "dataset/Task10_Colon" --save_name "my_train" --multiple_outputs --dynamic --use_box --refine

[11:35:04.209] Namespace(data='colon', save_dir='./implementation/colon/my_train', data_dir='dataset/Task10_Colon', num_workers=2, split='train', use_small_dataset=False, model_type='vit_b_ori', lr=4e-05, lr_scheduler='linear', warm_up=False, device='cuda:0', max_epoch=200, image_size=128, batch_size=1, checkpoint='best', checkpoint_sam='./checkpoint_sam/sam_vit_b_01ec64.pth', num_classes=2, tolerance=5, boundary_kernel_size=5, use_pretrain=False, pretrain_path='', resume=False, resume_best=False, ddp=False, gpu_ids=[0, 1], accumulation_steps=20, iter_nums=11, num_clicks=50, num_clicks_validation=10, use_box=True, dynamic_box=False, use_scribble=False, num_multiple_outputs=3, multiple_outputs=True, refine=True, no_detach=False, refine_test=False, dynamic=True, efficient_scribble=False, use_sam3d_turbo=False, save_predictions=False, save_csv=False, save_test_dir='./', save_name='my_train')
2024-08-15 11:35:04,209 - Namespace(data='colon', save_dir='./implementation/colon/my_train', data_dir='dataset/Task10_Colon', num_workers=2, split='train', use_small_dataset=False, model_type='vit_b_ori', lr=4e-05, lr_scheduler='linear', warm_up=False, device='cuda:0', max_epoch=200, image_size=128, batch_size=1, checkpoint='best', checkpoint_sam='./checkpoint_sam/sam_vit_b_01ec64.pth', num_classes=2, tolerance=5, boundary_kernel_size=5, use_pretrain=False, pretrain_path='', resume=False, resume_best=False, ddp=False, gpu_ids=[0, 1], accumulation_steps=20, iter_nums=11, num_clicks=50, num_clicks_validation=10, use_box=True, dynamic_box=False, use_scribble=False, num_multiple_outputs=3, multiple_outputs=True, refine=True, no_detach=False, refine_test=False, dynamic=True, efficient_scribble=False, use_sam3d_turbo=False, save_predictions=False, save_csv=False, save_test_dir='./', save_name='my_train')
Unet_encoder features: (32, 32, 64, 128, 384, 32).
Unet_decoder features: (32, 32, 64, 128, 384, 32).
dataloaders are created, models are loaded, and others are set, spent 3.8 for rank -1
num_clicks 50 points_length: 69189 dynamic_size: 13
First batch:   fn: 1.0000, fp: 0.0000, label 0: tensor(0), label 1: tensor(13)
--- ===================================== ---
--- above before model, below after model ---
--- ===================================== ---
dice before refine 0.07599078863859177 and after 0.037667643278837204
num_clicks 50 points_length: 356190 dynamic_size: 11
First batch:   fn: 0.8992, fp: 4.2488, label 0: tensor(11), label 1: tensor(0)
--- ===================================== ---
.
.
.
.
.
.
RuntimeError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 23.69 GiB total capacity; 21.36 GiB already allocated; 42.62 MiB free; 21.59 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

How can I resolve this problem?

@HaoLi12345
Copy link
Collaborator

Hi,

reduce the iteration number would lower the GPU usage but lower the performance. I have tried 9 and 11, and 9 is slightly lower.

modify the network and input size could be other options if not using pertained model.

@ax130885
Copy link
Author

Thank you. After changing iter to 9, I can successfully execute the training program.

@HaoLi12345
Copy link
Collaborator

glad to hear that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants