You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I recently installed spisonet and attempted to run the tutorial. Pytorch crashed immediately during the training with these errors:
07-01 10:44:32, INFO voxel_size 1.309999942779541
07-01 10:44:33, INFO spIsoNet correction until resolution 3.5A!
Information beyond 3.5A remains unchanged
07-01 10:44:42, INFO Start preparing subvolumes!
07-01 10:44:48, INFO Done preparing subvolumes!
07-01 10:44:48, INFO Start training!
07-01 10:44:52, INFO Port number: 42933
learning rate 0.0003
['isonet_maps/emd_8731_half_map_1_data', 'isonet_maps/emd_8731_half_map_2_data']
0%| | 0/250 [00:00<?, ?batch/s]/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/conv.py:605: UserWarning: Applied workaround for CuDNN issue, install nvrtc.so (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:84.)
return F.conv3d(
0%| | 0/250 [00:05<?, ?batch/s]
Traceback (most recent call last):
File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/bin/spisonet.py", line 8, in <module>
sys.exit(main())
File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/spisonet.py", line 549, in main
fire.Fire(ISONET)
File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/spisonet.py", line 182, in reconstruct
map_refine_n2n(halfmap1,halfmap2, mask_vol, fsc3d, alpha = alpha,beta=beta, voxel_size=voxel_size, output_dir=output_dir,
File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/map_refine.py", line 145, in map_refine_n2n
network.train([data_dir_1,data_dir_2], output_dir, alpha=alpha,beta=beta, output_base=output_base0, batch_size=batch_size, epochs = epochs, steps_per_epoch = 1000,
File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/spIsoNet/models/network_n2n.py", line 265, in train
mp.spawn(ddp_train, args=(self.world_size, self.port_number, self.model,alpha,beta,
File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 281, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 237, in start_processes
while not context.join():
File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 188, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 75, in _wrap
fn(i, *args)
File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/spIsoNet/models/network_n2n.py", line 160, in ddp_train
loss.backward()
File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward
torch.autograd.backward(
File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward
_engine_run_backward(
File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [64]] is at version 4; expected version 3 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
What version of torch is required? We have 2.3.1+cu118. This was run on a single P100 GPU.
The text was updated successfully, but these errors were encountered:
I recently installed spisonet and attempted to run the tutorial. Pytorch crashed immediately during the training with these errors:
What version of torch is required? We have 2.3.1+cu118. This was run on a single P100 GPU.
The text was updated successfully, but these errors were encountered: