Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERROR:CUDA out of memory when the batch_size has already been set to 1 #15

Open
radish512 opened this issue Apr 11, 2024 · 2 comments

Comments

@radish512
Copy link

radish512 commented Apr 11, 2024

I have deployed the model on two servers with 4090 GPUs each. When the training dataset is syntext2, the model can complete one epoch of training normally. However, when the training dataset is syntext1, the model encounters an out-of-memory error after training for around 1000 iterations. When the training dataset is expanded to include totaltext : ic13 : ic15 : mlt : syntext1 : syntext in Pretrain.sh, the model encounters an out-of-memory error after training for around 150 iterations. The specific number of iterations may vary in each training session.

The detailed error message is as follows:

First time:

Traceback (most recent call last): File "main.py", line 382, in <module> main(args) File "main.py", line 293, in main train_stats = train_one_epoch( File "/home/xx/ESTextSpotter-main/engine.py", line 50, in train_one_epoch outputs = model(samples, targets) File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 886, in forward output = self.module(*inputs[0], **kwargs[0]) File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/xx/ESTextSpotter-main/models/ests/ests.py", line 293, in forward hs, reference, rec_references, hs_enc, ref_enc, init_box_proposal = self.transformer(srcs, masks, input_query_bbox, poss,input_query_label,attn_mask, H, W) File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/xx/ESTextSpotter-main/models/ests/deformable_transformer.py", line 432, in forward hs, references, rec_references = self.decoder( File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/xx/ESTextSpotter-main/models/ests/deformable_transformer.py", line 786, in forward output = layer( File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/xx/ESTextSpotter-main/models/ests/deformable_transformer.py", line 1131, in forward tgt = self.forward_ca(tgt, tgt_query_pos, text_pos_embed, tgt_query_sine_embed, \ File "/home/xx/ESTextSpotter-main/models/ests/deformable_transformer.py", line 1098, in forward_ca tgt2 = self.cross_attn(self.with_pos_embed(tgt, tgt_query_pos.unsqueeze(2)).transpose(0, 1).flatten(1,2), File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/xx/ESTextSpotter-main/models/ests/ops/modules/ms_deform_attn.py", line 96, in forward value = value.masked_fill(input_padding_mask[..., None], float(0)) RuntimeError: CUDA out of memory. Tried to allocate 102.00 MiB (GPU 1; 23.65 GiB total capacity; 20.90 GiB already allocated; 43.69 MiB free; 21.46 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 397826 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 397827) of binary: /home/xx/anaconda3/envs/ests/bin/python

Second time:

Traceback (most recent call last): File "main.py", line 382, in <module> main(args) File "main.py", line 293, in main train_stats = train_one_epoch( File "/home/xx/ESTextSpotter-main/engine.py", line 50, in train_one_epoch outputs = model(samples, targets) File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 886, in forward output = self.module(*inputs[0], **kwargs[0]) File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/xx/ESTextSpotter-main/models/ests/ests.py", line 293, in forward hs, reference, rec_references, hs_enc, ref_enc, init_box_proposal = self.transformer(srcs, masks, input_query_bbox, poss,input_query_label,attn_mask, H, W) File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/xx/ESTextSpotter-main/models/ests/deformable_transformer.py", line 432, in forward hs, references, rec_references = self.decoder( File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/xx/ESTextSpotter-main/models/ests/deformable_transformer.py", line 786, in forward output = layer( File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/xx/ESTextSpotter-main/models/ests/deformable_transformer.py", line 1136, in forward tgt = self.forward_sa(tgt, tgt_query_pos, text_pos_embed, tgt_query_sine_embed, \ File "/home/xx/ESTextSpotter-main/models/ests/deformable_transformer.py", line 1039, in forward_sa tgt2_inter = self.attn_inter( File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/activation.py", line 1003, in forward attn_output, attn_output_weights = F.multi_head_attention_forward( File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/functional.py", line 5101, in multi_head_attention_forward attn_output, attn_output_weights = _scaled_dot_product_attention(q, k, v, attn_mask, dropout_p) File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/functional.py", line 4844, in _scaled_dot_product_attention attn = torch.bmm(q, k.transpose(-2, -1)) RuntimeError: CUDA out of memory. Tried to allocate 452.00 MiB (GPU 1; 23.65 GiB total capacity; 20.87 GiB already allocated; 27.69 MiB free; 21.48 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 365387 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 365388) of binary: /home/xx/anaconda3/envs/ests/bin/python

@radish512 radish512 changed the title CUDA out of memory when the batch_size has already been set to 1 ERROR:CUDA out of memory when the batch_size has already been set to 1 Apr 11, 2024
@radish512
Copy link
Author

Dear blogger, please help resolve this issue.

@Apostatee
Copy link

same

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants