pytorch crashed during tutorial #14

davidhoover · 2024-07-01T15:03:33Z

I recently installed spisonet and attempted to run the tutorial. Pytorch crashed immediately during the training with these errors:

07-01 10:44:32, INFO     voxel_size 1.309999942779541
07-01 10:44:33, INFO     spIsoNet correction until resolution 3.5A!
                     Information beyond 3.5A remains unchanged
07-01 10:44:42, INFO     Start preparing subvolumes!
07-01 10:44:48, INFO     Done preparing subvolumes!
07-01 10:44:48, INFO     Start training!
07-01 10:44:52, INFO     Port number: 42933
learning rate 0.0003
['isonet_maps/emd_8731_half_map_1_data', 'isonet_maps/emd_8731_half_map_2_data']
  0%|                                                                                                                              | 0/250 [00:00<?, ?batch/s]/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/conv.py:605: UserWarning: Applied workaround for CuDNN issue, install nvrtc.so (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:84.)
  return F.conv3d(
  0%|                                                                                                                              | 0/250 [00:05<?, ?batch/s]
Traceback (most recent call last):
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/bin/spisonet.py", line 8, in <module>
    sys.exit(main())
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/spisonet.py", line 549, in main
    fire.Fire(ISONET)
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/spisonet.py", line 182, in reconstruct
    map_refine_n2n(halfmap1,halfmap2, mask_vol, fsc3d, alpha = alpha,beta=beta,  voxel_size=voxel_size, output_dir=output_dir,
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/map_refine.py", line 145, in map_refine_n2n
    network.train([data_dir_1,data_dir_2], output_dir, alpha=alpha,beta=beta, output_base=output_base0, batch_size=batch_size, epochs = epochs, steps_per_epoch = 1000,
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/spIsoNet/models/network_n2n.py", line 265, in train
    mp.spawn(ddp_train, args=(self.world_size, self.port_number, self.model,alpha,beta,
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 281, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 237, in start_processes
    while not context.join():
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 188, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 75, in _wrap
    fn(i, *args)
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/spIsoNet/models/network_n2n.py", line 160, in ddp_train
    loss.backward()
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward
    torch.autograd.backward(
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward
    _engine_run_backward(
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [64]] is at version 4; expected version 3 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

What version of torch is required? We have 2.3.1+cu118. This was run on a single P100 GPU.

The text was updated successfully, but these errors were encountered:

davidhoover · 2024-07-02T18:18:28Z

I figured out a constellation of versions that works. I first needed to install using this yml file:

name: spisonet
channels:
  - pytorch
  - nvidia
  - conda-forge
  - defaults

dependencies:
  - python=3.10
  - pytorch
  - pytorch-cuda=11.8
  - numpy=1.26.4
  - setuptools=68.0.0
  - mkl=2024.0
  - pip
  - pip:
    - scikit-image
    - matplotlib
    - mrcfile
    - fire
    - tqdm
    - .

I've attached the list of packages in my conda environment in case anyone else runs into this problem.
n.txt

procyontao · 2024-07-30T20:52:39Z

Hi

Sorry for late reply. Please see this issue: #13 (comment)

They used multiple GPU that avoid this error.

But I can not reproduce this sometime with one GPU. This is related to the torch distributed data parallel and need to be fixed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pytorch crashed during tutorial #14

pytorch crashed during tutorial #14

davidhoover commented Jul 1, 2024

davidhoover commented Jul 2, 2024

procyontao commented Jul 30, 2024

pytorch crashed during tutorial #14

pytorch crashed during tutorial #14

Comments

davidhoover commented Jul 1, 2024

davidhoover commented Jul 2, 2024

procyontao commented Jul 30, 2024