An error occurred while I was training model #1

a726 · 2020-06-08T04:31:50Z

[06.08.20|11:09:56] Training epoch: 0
Traceback (most recent call last):
File "main.py", line 33, in
p.start()
File "D:\zxy\st-gcn\processor\processor.py", line 113, in start
self.train()
File "D:\zxy\st-gcn\processor\recognition.py", line 84, in train
for data, label in loader:
File "C:\Users\admin\Anaconda3\envs\pytorch\lib\site-packages\torch\utils\data\dataloader.py", line 819, in iter
return _DataLoaderIter(self)
File "C:\Users\admin\Anaconda3\envs\pytorch\lib\site-packages\torch\utils\data\dataloader.py", line 560, in init
w.start()
File "C:\Users\admin\Anaconda3\envs\pytorch\lib\multiprocessing\process.py", line 105, in start
self._popen = self._Popen(self)
File "C:\Users\admin\Anaconda3\envs\pytorch\lib\multiprocessing\context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "C:\Users\admin\Anaconda3\envs\pytorch\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "C:\Users\admin\Anaconda3\envs\pytorch\lib\multiprocessing\popen_spawn_win32.py", line 65, in init
reduction.dump(process_obj, to_child)
File "C:\Users\admin\Anaconda3\envs\pytorch\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
OverflowError: cannot serialize a bytes object larger than 4 GiB
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\admin\Anaconda3\envs\pytorch\lib\multiprocessing\spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "C:\Users\admin\Anaconda3\envs\pytorch\lib\multiprocessing\spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
EOFError: Ran out of input

Can you help me?

XieLinMofromsomewhere · 2020-06-08T08:16:21Z

i got this error too.
"OverflowError: cannot serialize a bytes object larger than 4 GiB" is the problem,
may you need a good GPU.
Or you can try to change the code , for example batchsize or something else , i am doing this now

a726 · 2020-06-08T09:27:53Z

 What a coincidence, and you reply to me.  My GPU is the NVIDIA GeForce GTX 1080 Ti, there are 4 computer memory to run is 64,  and I changed batch_size and num_epoch, still the same error.

…

------------------ 原始邮件 ------------------ 发件人: "Weixin Luo (罗伟鑫)"<[email protected]>; 发送时间: 2020年6月8日(星期一) 下午4:16 收件人: "1zgh/st-gcn"<[email protected]>; 抄送: "张小媛"<[email protected]>;"Author"<[email protected]>; 主题: Re: [1zgh/st-gcn] An error occurred while I was training model (#1) i got this error too. "OverflowError: cannot serialize a bytes object larger than 4 GiB" is the problem, may you need a good GPU. Or you can try to change the code , for example batchsize or something else , i am doing this now — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

a726 · 2020-06-14T08:45:13Z

 Hello, I use the model to test demo own training, the following error occurs: Traceback (most recent call last):   File "C:\Users\admin\Anaconda3\envs\pytorch\lib\site-packages\torchlight-1.0-py3.6.egg\torchlight\io.py", line 82, in load_weights     __doc__ = _io._TextIOBase.__doc__   File "C:\Users\admin\Anaconda3\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 769, in load_state_dict     self.__class__.__name__, "\n\t".join(error_msgs))) RuntimeError: Error(s) in loading state_dict for Model:         size mismatch for A: copying a param with shape torch.Size([3, 25, 25]) from checkpoint, the shape in current model is torch.Size([3, 18, 18]).         size mismatch for data_bn.weight: copying a param with shape torch.Size([75]) from checkpoint, the shape in current model is torch.Size([54]).         size mismatch for data_bn.bias: copying a param with shape torch.Size([75]) from checkpoint, the shape in current model is torch.Size([54]).         size mismatch for data_bn.running_mean: copying a param with shape torch.Size([75]) from checkpoint, the shape in current model is torch.Size([54]).         size mismatch for data_bn.running_var: copying a param with shape torch.Size([75]) from checkpoint, the shape in current model is torch.Size([54]).         size mismatch for edge_importance.0: copying a param with shape torch.Size([3, 25, 25]) from checkpoint, the shape in current model is torch.Size([3, 18, 18]).         size mismatch for edge_importance.1: copying a param with shape torch.Size([3, 25, 25]) from checkpoint, the shape in current model is torch.Size([3, 18, 18]).         size mismatch for edge_importance.2: copying a param with shape torch.Size([3, 25, 25]) from checkpoint, the shape in current model is torch.Size([3, 18, 18]).         size mismatch for edge_importance.3: copying a param with shape torch.Size([3, 25, 25]) from checkpoint, the shape in current model is torch.Size([3, 18, 18]).         size mismatch for edge_importance.4: copying a param with shape torch.Size([3, 25, 25]) from checkpoint, the shape in current model is torch.Size([3, 18, 18]).         size mismatch for edge_importance.5: copying a param with shape torch.Size([3, 25, 25]) from checkpoint, the shape in current model is torch.Size([3, 18, 18]).         size mismatch for edge_importance.6: copying a param with shape torch.Size([3, 25, 25]) from checkpoint, the shape in current model is torch.Size([3, 18, 18]).         size mismatch for edge_importance.7: copying a param with shape torch.Size([3, 25, 25]) from checkpoint, the shape in current model is torch.Size([3, 18, 18]).         size mismatch for edge_importance.8: copying a param with shape torch.Size([3, 25, 25]) from checkpoint, the shape in current model is torch.Size([3, 18, 18]).         size mismatch for edge_importance.9: copying a param with shape torch.Size([3, 25, 25]) from checkpoint, the shape in current model is torch.Size([3, 18, 18]). During handling of the above exception, another exception occurred: Traceback (most recent call last):   File "main.py", line 31, in <module>     p = Processor(sys.argv[2:])   File "D:\zxy\st-gcn\processor\io.py", line 28, in __init__     self.load_weights()   File "D:\zxy\st-gcn\processor\io.py", line 75, in load_weights     self.arg.ignore_weights)   File "C:\Users\admin\Anaconda3\envs\pytorch\lib\site-packages\torchlight-1.0-py3.6.egg\torchlight\io.py", line 89, in load_weights   File "C:\Users\admin\Anaconda3\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 769, in load_state_dict     self.__class__.__name__, "\n\t".join(error_msgs))) RuntimeError: Error(s) in loading state_dict for Model:         size mismatch for A: copying a param with shape torch.Size([3, 25, 25]) from checkpoint, the shape in current model is torch.Size([3, 18, 18]).         size mismatch for data_bn.weight: copying a param with shape torch.Size([75]) from checkpoint, the shape in current model is torch.Size([54]).         size mismatch for data_bn.bias: copying a param with shape torch.Size([75]) from checkpoint, the shape in current model is torch.Size([54]).         size mismatch for data_bn.running_mean: copying a param with shape torch.Size([75]) from checkpoint, the shape in current model is torch.Size([54]).         size mismatch for data_bn.running_var: copying a param with shape torch.Size([75]) from checkpoint, the shape in current model is torch.Size([54]).         size mismatch for edge_importance.0: copying a param with shape torch.Size([3, 25, 25]) from checkpoint, the shape in current model is torch.Size([3, 18, 18]).         size mismatch for edge_importance.1: copying a param with shape torch.Size([3, 25, 25]) from checkpoint, the shape in current model is torch.Size([3, 18, 18]).         size mismatch for edge_importance.2: copying a param with shape torch.Size([3, 25, 25]) from checkpoint, the shape in current model is torch.Size([3, 18, 18]).         size mismatch for edge_importance.3: copying a param with shape torch.Size([3, 25, 25]) from checkpoint, the shape in current model is torch.Size([3, 18, 18]).         size mismatch for edge_importance.4: copying a param with shape torch.Size([3, 25, 25]) from checkpoint, the shape in current model is torch.Size([3, 18, 18]).         size mismatch for edge_importance.5: copying a param with shape torch.Size([3, 25, 25]) from checkpoint, the shape in current model is torch.Size([3, 18, 18]).         size mismatch for edge_importance.6: copying a param with shape torch.Size([3, 25, 25]) from checkpoint, the shape in current model is torch.Size([3, 18, 18]).         size mismatch for edge_importance.7: copying a param with shape torch.Size([3, 25, 25]) from checkpoint, the shape in current model is torch.Size([3, 18, 18]).         size mismatch for edge_importance.8: copying a param with shape torch.Size([3, 25, 25]) from checkpoint, the shape in current model is torch.Size([3, 18, 18]).         size mismatch for edge_importance.9: copying a param with shape torch.Size([3, 25, 25]) from checkpoint, the shape in current model is torch.Size([3, 18, 18]). do you know this is where there is a problem, how should amend? thank you very much!

…

------------------ 原始邮件 ------------------ 发件人: "Weixin Luo (罗伟鑫)"<[email protected]>; 发送时间: 2020年6月8日(星期一) 下午4:16 收件人: "1zgh/st-gcn"<[email protected]>; 抄送: "张小媛"<[email protected]>;"Author"<[email protected]>; 主题: Re: [1zgh/st-gcn] An error occurred while I was training model (#1) i got this error too. "OverflowError: cannot serialize a bytes object larger than 4 GiB" is the problem, may you need a good GPU. Or you can try to change the code , for example batchsize or something else , i am doing this now — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

XieLinMofromsomewhere · 2020-06-24T05:51:48Z

你解决了吗，我昨天搞了下，发现是因为训练集太大了，一次读入的是29GB，而我的内存是16GB，所以内存不够，或许需要更大的内存。
你可以试试把验证集（2GB）改名为训练集的名字（标签也要改），再看看进行训练，试试还报这个错不

Thomas-yx · 2020-07-14T12:35:57Z

@XieLinMofromsomewhere Hello! I also encountered this problem when training the model. My laptop has only 16GB of RAM. How do you solve it? I am looking forward to your reply.Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

An error occurred while I was training model #1

An error occurred while I was training model #1

a726 commented Jun 8, 2020

XieLinMofromsomewhere commented Jun 8, 2020

a726 commented Jun 8, 2020 via email

a726 commented Jun 14, 2020 via email

XieLinMofromsomewhere commented Jun 24, 2020 •

edited

Loading

Thomas-yx commented Jul 14, 2020

An error occurred while I was training model #1

An error occurred while I was training model #1

Comments

a726 commented Jun 8, 2020

XieLinMofromsomewhere commented Jun 8, 2020

a726 commented Jun 8, 2020 via email

a726 commented Jun 14, 2020 via email

XieLinMofromsomewhere commented Jun 24, 2020 • edited Loading

Thomas-yx commented Jul 14, 2020

XieLinMofromsomewhere commented Jun 24, 2020 •

edited

Loading