Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default batch size set to zero? #4

Closed
olibclarke opened this issue Apr 16, 2024 · 0 comments
Closed

Default batch size set to zero? #4

olibclarke opened this issue Apr 16, 2024 · 0 comments

Comments

@olibclarke
Copy link

olibclarke commented Apr 16, 2024

Hi, when I try to run spisonet using the following command:

spisonet.py reconstruct J248_006_volume_map_half_A.mrc J248_006_volume_map_half_B.mrc --aniso_file FSC3D.mrc --mask J248_006_volume_mask_fsc.mrc --limit_res 3.95 --epochs 30 --alpha 1 --beta 0.5 --output_dir isonet_maps --gpuID 0,1,2,3 --acc_batches 4

I get the following output, terminating with an error:

04-16 14:16:07, INFO     The isonet_maps folder already exists, outputs will write into this folder
04-16 14:16:08, INFO     voxel_size 1.125
04-16 14:16:11, WARNING  The isonet_maps/J248_006_volume_map_half_A_data folder already exists. The old isonet_maps/J248_006_volume_map_half_A_data folder will be moved to isonet_maps/J248_006_volume_map_half_A_data~
04-16 14:16:11, WARNING  The isonet_maps/J248_006_volume_map_half_B_data folder already exists. The old isonet_maps/J248_006_volume_map_half_B_data folder will be moved to isonet_maps/J248_006_volume_map_half_B_data~
04-16 14:16:11, INFO     spIsoNet correction until resolution 3.95A!
                     Information beyond 3.95A remains unchanged
04-16 14:16:21, INFO     Start preparing subvolumes!
04-16 14:16:54, INFO     Done preparing subvolumes!
04-16 14:16:54, INFO     Start training!
04-16 14:17:02, INFO     Port number: 44237
learning rate 0.0003
['isonet_maps/J248_006_volume_map_half_A_data', 'isonet_maps/J248_006_volume_map_half_B_data']
Traceback (most recent call last):
  File "/home/user/software/miniconda3/envs/spisonet/bin/spisonet.py", line 8, in <module>
    sys.exit(main())
  File "/home/user/software/miniconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/spisonet.py", line 549, in main
    fire.Fire(ISONET)
  File "/home/user/software/miniconda3/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/user/software/miniconda3/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/user/software/miniconda3/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/user/software/miniconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/spisonet.py", line 182, in reconstruct
    map_refine_n2n(halfmap1,halfmap2, mask_vol, fsc3d, alpha = alpha,beta=beta,  voxel_size=voxel_size, output_dir=output_dir,
  File "/home/user/software/miniconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/map_refine.py", line 145, in map_refine_n2n
    network.train([data_dir_1,data_dir_2], output_dir, alpha=alpha,beta=beta, output_base=output_base0, batch_size=batch_size, epochs = epochs, steps_per_epoch = 1000,
  File "/home/user/software/miniconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/models/network_n2n.py", line 265, in train
    mp.spawn(ddp_train, args=(self.world_size, self.port_number, self.model,alpha,beta,
  File "/home/user/software/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 241, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/home/user/software/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
    while not context.join():
  File "/home/user/software/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 158, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "/home/user/software/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 68, in _wrap
    fn(i, *args)
  File "/home/user/software/miniconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/models/network_n2n.py", line 68, in ddp_train
    train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size_gpu, persistent_workers=True,
  File "/home/user/software/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 356, in __init__
    batch_sampler = BatchSampler(sampler, batch_size, drop_last)
  File "/home/user/software/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/utils/data/sampler.py", line 267, in __init__
    raise ValueError(f"batch_size should be a positive integer value, but got batch_size={batch_size}")
ValueError: batch_size should be a positive integer value, but got batch_size=0

When I explicitly set the batch size to 4 (--batch_size 4) it still fails with the same error. Am I doing something wrong, or is this a bug of some kind? Happy to provide inputs if helpful.

EDIT:

Ah, it seems I was using the --acc_batch and --batch size parameters for a single GPU, but selecting 4 GPUs. When I use the recommended params for 4 GPUs (batch size 8, acc_batch 1), it runs out of GPU RAM though, even though all 4 GPUs are 11GB GPUs... increased acc_batch to 2 and it seems to run successfully now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant