Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pytorch crashed during tutorial #14

Open
davidhoover opened this issue Jul 1, 2024 · 2 comments
Open

pytorch crashed during tutorial #14

davidhoover opened this issue Jul 1, 2024 · 2 comments

Comments

@davidhoover
Copy link

I recently installed spisonet and attempted to run the tutorial. Pytorch crashed immediately during the training with these errors:

07-01 10:44:32, INFO     voxel_size 1.309999942779541
07-01 10:44:33, INFO     spIsoNet correction until resolution 3.5A!
                     Information beyond 3.5A remains unchanged
07-01 10:44:42, INFO     Start preparing subvolumes!
07-01 10:44:48, INFO     Done preparing subvolumes!
07-01 10:44:48, INFO     Start training!
07-01 10:44:52, INFO     Port number: 42933
learning rate 0.0003
['isonet_maps/emd_8731_half_map_1_data', 'isonet_maps/emd_8731_half_map_2_data']
  0%|                                                                                                                              | 0/250 [00:00<?, ?batch/s]/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/conv.py:605: UserWarning: Applied workaround for CuDNN issue, install nvrtc.so (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:84.)
  return F.conv3d(
  0%|                                                                                                                              | 0/250 [00:05<?, ?batch/s]
Traceback (most recent call last):
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/bin/spisonet.py", line 8, in <module>
    sys.exit(main())
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/spisonet.py", line 549, in main
    fire.Fire(ISONET)
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/spisonet.py", line 182, in reconstruct
    map_refine_n2n(halfmap1,halfmap2, mask_vol, fsc3d, alpha = alpha,beta=beta,  voxel_size=voxel_size, output_dir=output_dir,
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/map_refine.py", line 145, in map_refine_n2n
    network.train([data_dir_1,data_dir_2], output_dir, alpha=alpha,beta=beta, output_base=output_base0, batch_size=batch_size, epochs = epochs, steps_per_epoch = 1000,
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/spIsoNet/models/network_n2n.py", line 265, in train
    mp.spawn(ddp_train, args=(self.world_size, self.port_number, self.model,alpha,beta,
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 281, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 237, in start_processes
    while not context.join():
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 188, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 75, in _wrap
    fn(i, *args)
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/spIsoNet/models/network_n2n.py", line 160, in ddp_train
    loss.backward()
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward
    torch.autograd.backward(
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward
    _engine_run_backward(
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [64]] is at version 4; expected version 3 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

What version of torch is required? We have 2.3.1+cu118. This was run on a single P100 GPU.

@davidhoover
Copy link
Author

I figured out a constellation of versions that works. I first needed to install using this yml file:

name: spisonet
channels:
  - pytorch
  - nvidia
  - conda-forge
  - defaults

dependencies:
  - python=3.10
  - pytorch
  - pytorch-cuda=11.8
  - numpy=1.26.4
  - setuptools=68.0.0
  - mkl=2024.0
  - pip
  - pip:
    - scikit-image
    - matplotlib
    - mrcfile
    - fire
    - tqdm
    - .

I've attached the list of packages in my conda environment in case anyone else runs into this problem.
n.txt

@procyontao
Copy link
Collaborator

Hi

Sorry for late reply. Please see this issue: #13 (comment)

They used multiple GPU that avoid this error.

But I can not reproduce this sometime with one GPU. This is related to the torch distributed data parallel and need to be fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants