You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have deployed the model on two servers with 4090 GPUs each. When the training dataset is syntext2, the model can complete one epoch of training normally. However, when the training dataset is syntext1, the model encounters an out-of-memory error after training for around 1000 iterations. When the training dataset is expanded to include totaltext : ic13 : ic15 : mlt : syntext1 : syntext in Pretrain.sh, the model encounters an out-of-memory error after training for around 150 iterations. The specific number of iterations may vary in each training session.
The detailed error message is as follows:
First time:
Traceback (most recent call last): File "main.py", line 382, in <module> main(args) File "main.py", line 293, in main train_stats = train_one_epoch( File "/home/xx/ESTextSpotter-main/engine.py", line 50, in train_one_epoch outputs = model(samples, targets) File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 886, in forward output = self.module(*inputs[0], **kwargs[0]) File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/xx/ESTextSpotter-main/models/ests/ests.py", line 293, in forward hs, reference, rec_references, hs_enc, ref_enc, init_box_proposal = self.transformer(srcs, masks, input_query_bbox, poss,input_query_label,attn_mask, H, W) File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/xx/ESTextSpotter-main/models/ests/deformable_transformer.py", line 432, in forward hs, references, rec_references = self.decoder( File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/xx/ESTextSpotter-main/models/ests/deformable_transformer.py", line 786, in forward output = layer( File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/xx/ESTextSpotter-main/models/ests/deformable_transformer.py", line 1131, in forward tgt = self.forward_ca(tgt, tgt_query_pos, text_pos_embed, tgt_query_sine_embed, \ File "/home/xx/ESTextSpotter-main/models/ests/deformable_transformer.py", line 1098, in forward_ca tgt2 = self.cross_attn(self.with_pos_embed(tgt, tgt_query_pos.unsqueeze(2)).transpose(0, 1).flatten(1,2), File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/xx/ESTextSpotter-main/models/ests/ops/modules/ms_deform_attn.py", line 96, in forward value = value.masked_fill(input_padding_mask[..., None], float(0)) RuntimeError: CUDA out of memory. Tried to allocate 102.00 MiB (GPU 1; 23.65 GiB total capacity; 20.90 GiB already allocated; 43.69 MiB free; 21.46 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 397826 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 397827) of binary: /home/xx/anaconda3/envs/ests/bin/python
Second time:
Traceback (most recent call last): File "main.py", line 382, in <module> main(args) File "main.py", line 293, in main train_stats = train_one_epoch( File "/home/xx/ESTextSpotter-main/engine.py", line 50, in train_one_epoch outputs = model(samples, targets) File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 886, in forward output = self.module(*inputs[0], **kwargs[0]) File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/xx/ESTextSpotter-main/models/ests/ests.py", line 293, in forward hs, reference, rec_references, hs_enc, ref_enc, init_box_proposal = self.transformer(srcs, masks, input_query_bbox, poss,input_query_label,attn_mask, H, W) File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/xx/ESTextSpotter-main/models/ests/deformable_transformer.py", line 432, in forward hs, references, rec_references = self.decoder( File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/xx/ESTextSpotter-main/models/ests/deformable_transformer.py", line 786, in forward output = layer( File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/xx/ESTextSpotter-main/models/ests/deformable_transformer.py", line 1136, in forward tgt = self.forward_sa(tgt, tgt_query_pos, text_pos_embed, tgt_query_sine_embed, \ File "/home/xx/ESTextSpotter-main/models/ests/deformable_transformer.py", line 1039, in forward_sa tgt2_inter = self.attn_inter( File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/activation.py", line 1003, in forward attn_output, attn_output_weights = F.multi_head_attention_forward( File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/functional.py", line 5101, in multi_head_attention_forward attn_output, attn_output_weights = _scaled_dot_product_attention(q, k, v, attn_mask, dropout_p) File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/functional.py", line 4844, in _scaled_dot_product_attention attn = torch.bmm(q, k.transpose(-2, -1)) RuntimeError: CUDA out of memory. Tried to allocate 452.00 MiB (GPU 1; 23.65 GiB total capacity; 20.87 GiB already allocated; 27.69 MiB free; 21.48 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 365387 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 365388) of binary: /home/xx/anaconda3/envs/ests/bin/python
The text was updated successfully, but these errors were encountered:
radish512
changed the title
CUDA out of memory when the batch_size has already been set to 1
ERROR:CUDA out of memory when the batch_size has already been set to 1
Apr 11, 2024
I have deployed the model on two servers with 4090 GPUs each. When the training dataset is syntext2, the model can complete one epoch of training normally. However, when the training dataset is syntext1, the model encounters an out-of-memory error after training for around 1000 iterations. When the training dataset is expanded to include totaltext : ic13 : ic15 : mlt : syntext1 : syntext in Pretrain.sh, the model encounters an out-of-memory error after training for around 150 iterations. The specific number of iterations may vary in each training session.
The detailed error message is as follows:
First time:
Traceback (most recent call last): File "main.py", line 382, in <module> main(args) File "main.py", line 293, in main train_stats = train_one_epoch( File "/home/xx/ESTextSpotter-main/engine.py", line 50, in train_one_epoch outputs = model(samples, targets) File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 886, in forward output = self.module(*inputs[0], **kwargs[0]) File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/xx/ESTextSpotter-main/models/ests/ests.py", line 293, in forward hs, reference, rec_references, hs_enc, ref_enc, init_box_proposal = self.transformer(srcs, masks, input_query_bbox, poss,input_query_label,attn_mask, H, W) File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/xx/ESTextSpotter-main/models/ests/deformable_transformer.py", line 432, in forward hs, references, rec_references = self.decoder( File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/xx/ESTextSpotter-main/models/ests/deformable_transformer.py", line 786, in forward output = layer( File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/xx/ESTextSpotter-main/models/ests/deformable_transformer.py", line 1131, in forward tgt = self.forward_ca(tgt, tgt_query_pos, text_pos_embed, tgt_query_sine_embed, \ File "/home/xx/ESTextSpotter-main/models/ests/deformable_transformer.py", line 1098, in forward_ca tgt2 = self.cross_attn(self.with_pos_embed(tgt, tgt_query_pos.unsqueeze(2)).transpose(0, 1).flatten(1,2), File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/xx/ESTextSpotter-main/models/ests/ops/modules/ms_deform_attn.py", line 96, in forward value = value.masked_fill(input_padding_mask[..., None], float(0)) RuntimeError: CUDA out of memory. Tried to allocate 102.00 MiB (GPU 1; 23.65 GiB total capacity; 20.90 GiB already allocated; 43.69 MiB free; 21.46 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 397826 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 397827) of binary: /home/xx/anaconda3/envs/ests/bin/python
Second time:
Traceback (most recent call last): File "main.py", line 382, in <module> main(args) File "main.py", line 293, in main train_stats = train_one_epoch( File "/home/xx/ESTextSpotter-main/engine.py", line 50, in train_one_epoch outputs = model(samples, targets) File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 886, in forward output = self.module(*inputs[0], **kwargs[0]) File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/xx/ESTextSpotter-main/models/ests/ests.py", line 293, in forward hs, reference, rec_references, hs_enc, ref_enc, init_box_proposal = self.transformer(srcs, masks, input_query_bbox, poss,input_query_label,attn_mask, H, W) File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/xx/ESTextSpotter-main/models/ests/deformable_transformer.py", line 432, in forward hs, references, rec_references = self.decoder( File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/xx/ESTextSpotter-main/models/ests/deformable_transformer.py", line 786, in forward output = layer( File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/xx/ESTextSpotter-main/models/ests/deformable_transformer.py", line 1136, in forward tgt = self.forward_sa(tgt, tgt_query_pos, text_pos_embed, tgt_query_sine_embed, \ File "/home/xx/ESTextSpotter-main/models/ests/deformable_transformer.py", line 1039, in forward_sa tgt2_inter = self.attn_inter( File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/modules/activation.py", line 1003, in forward attn_output, attn_output_weights = F.multi_head_attention_forward( File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/functional.py", line 5101, in multi_head_attention_forward attn_output, attn_output_weights = _scaled_dot_product_attention(q, k, v, attn_mask, dropout_p) File "/home/xx/anaconda3/envs/ests/lib/python3.8/site-packages/torch/nn/functional.py", line 4844, in _scaled_dot_product_attention attn = torch.bmm(q, k.transpose(-2, -1)) RuntimeError: CUDA out of memory. Tried to allocate 452.00 MiB (GPU 1; 23.65 GiB total capacity; 20.87 GiB already allocated; 27.69 MiB free; 21.48 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 365387 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 365388) of binary: /home/xx/anaconda3/envs/ests/bin/python
The text was updated successfully, but these errors were encountered: