ERROR：CUDA out of memory when the batch_size has already been set to 1 #15

radish512 · 2024-04-11T07:04:43Z

I have deployed the model on two servers with 4090 GPUs each. When the training dataset is syntext2, the model can complete one epoch of training normally. However, when the training dataset is syntext1, the model encounters an out-of-memory error after training for around 1000 iterations. When the training dataset is expanded to include totaltext : ic13 : ic15 : mlt : syntext1 : syntext in Pretrain.sh, the model encounters an out-of-memory error after training for around 150 iterations. The specific number of iterations may vary in each training session.

The detailed error message is as follows:

First time：

Traceback (most recent call last): File "main.py", line 382, in <module> main(args) File "main.py", line 293, in main train_stats = train_one_epoch( File "/home/xx/ESTextSpotter-main/engine.py", line 50, in train_one_epoch outputs = model(samples, targets) File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 886, in forward output = self.module(*inputs[0], **kwargs[0]) File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/xx/ESTextSpotter-main/models/ests/ests.py", line 293, in forward hs, reference, rec_references, hs_enc, ref_enc, init_box_proposal = self.transformer(srcs, masks, input_query_bbox, poss,input_query_label,attn_mask, H, W) File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/xx/ESTextSpotter-main/models/ests/deformable_transformer.py", line 432, in forward hs, references, rec_references = self.decoder( File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/xx/ESTextSpotter-main/models/ests/deformable_transformer.py", line 786, in forward output = layer( File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/xx/ESTextSpotter-main/models/ests/deformable_transformer.py", line 1131, in forward tgt = self.forward_ca(tgt, tgt_query_pos, text_pos_embed, tgt_query_sine_embed, \ File "/home/xx/ESTextSpotter-main/models/ests/deformable_transformer.py", line 1098, in forward_ca tgt2 = self.cross_attn(self.with_pos_embed(tgt, tgt_query_pos.unsqueeze(2)).transpose(0, 1).flatten(1,2), File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/xx/ESTextSpotter-main/models/ests/ops/modules/ms_deform_attn.py", line 96, in forward value = value.masked_fill(input_padding_mask[..., None], float(0)) RuntimeError: CUDA out of memory. Tried to allocate 102.00 MiB (GPU 1; 23.65 GiB total capacity; 20.90 GiB already allocated; 43.69 MiB free; 21.46 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 397826 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 397827) of binary: /home/xx/anaconda3/envs/ests/bin/python

Second time:

Traceback (most recent call last): File "main.py", line 382, in <module> main(args) File "main.py", line 293, in main train_stats = train_one_epoch( File "/home/xx/ESTextSpotter-main/engine.py", line 50, in train_one_epoch outputs = model(samples, targets) File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 886, in forward output = self.module(*inputs[0], **kwargs[0]) File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/xx/ESTextSpotter-main/models/ests/ests.py", line 293, in forward hs, reference, rec_references, hs_enc, ref_enc, init_box_proposal = self.transformer(srcs, masks, input_query_bbox, poss,input_query_label,attn_mask, H, W) File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/xx/ESTextSpotter-main/models/ests/deformable_transformer.py", line 432, in forward hs, references, rec_references = self.decoder( File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/xx/ESTextSpotter-main/models/ests/deformable_transformer.py", line 786, in forward output = layer( File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/xx/ESTextSpotter-main/models/ests/deformable_transformer.py", line 1136, in forward tgt = self.forward_sa(tgt, tgt_query_pos, text_pos_embed, tgt_query_sine_embed, \ File "/home/xx/ESTextSpotter-main/models/ests/deformable_transformer.py", line 1039, in forward_sa tgt2_inter = self.attn_inter( File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/activation.py", line 1003, in forward attn_output, attn_output_weights = F.multi_head_attention_forward( File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/functional.py", line 5101, in multi_head_attention_forward attn_output, attn_output_weights = _scaled_dot_product_attention(q, k, v, attn_mask, dropout_p) File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/functional.py", line 4844, in _scaled_dot_product_attention attn = torch.bmm(q, k.transpose(-2, -1)) RuntimeError: CUDA out of memory. Tried to allocate 452.00 MiB (GPU 1; 23.65 GiB total capacity; 20.87 GiB already allocated; 27.69 MiB free; 21.48 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 365387 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 365388) of binary: /home/xx/anaconda3/envs/ests/bin/python

The text was updated successfully, but these errors were encountered:

radish512 · 2024-04-11T07:08:25Z

Dear blogger, please help resolve this issue.

Apostatee · 2024-05-26T11:51:13Z

same

radish512 changed the title ~~CUDA out of memory when the batch_size has already been set to 1~~ ERROR：CUDA out of memory when the batch_size has already been set to 1 Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ERROR：CUDA out of memory when the batch_size has already been set to 1 #15

ERROR：CUDA out of memory when the batch_size has already been set to 1 #15

radish512 commented Apr 11, 2024 •

edited

Loading

radish512 commented Apr 11, 2024

Apostatee commented May 26, 2024

ERROR：CUDA out of memory when the batch_size has already been set to 1 #15

ERROR：CUDA out of memory when the batch_size has already been set to 1 #15

Comments

radish512 commented Apr 11, 2024 • edited Loading

radish512 commented Apr 11, 2024

Apostatee commented May 26, 2024

radish512 commented Apr 11, 2024 •

edited

Loading