Default batch size set to zero? #4

olibclarke · 2024-04-16T18:21:44Z

Hi, when I try to run spisonet using the following command:

spisonet.py reconstruct J248_006_volume_map_half_A.mrc J248_006_volume_map_half_B.mrc --aniso_file FSC3D.mrc --mask J248_006_volume_mask_fsc.mrc --limit_res 3.95 --epochs 30 --alpha 1 --beta 0.5 --output_dir isonet_maps --gpuID 0,1,2,3 --acc_batches 4

I get the following output, terminating with an error:

04-16 14:16:07, INFO     The isonet_maps folder already exists, outputs will write into this folder
04-16 14:16:08, INFO     voxel_size 1.125
04-16 14:16:11, WARNING  The isonet_maps/J248_006_volume_map_half_A_data folder already exists. The old isonet_maps/J248_006_volume_map_half_A_data folder will be moved to isonet_maps/J248_006_volume_map_half_A_data~
04-16 14:16:11, WARNING  The isonet_maps/J248_006_volume_map_half_B_data folder already exists. The old isonet_maps/J248_006_volume_map_half_B_data folder will be moved to isonet_maps/J248_006_volume_map_half_B_data~
04-16 14:16:11, INFO     spIsoNet correction until resolution 3.95A!
                     Information beyond 3.95A remains unchanged
04-16 14:16:21, INFO     Start preparing subvolumes!
04-16 14:16:54, INFO     Done preparing subvolumes!
04-16 14:16:54, INFO     Start training!
04-16 14:17:02, INFO     Port number: 44237
learning rate 0.0003
['isonet_maps/J248_006_volume_map_half_A_data', 'isonet_maps/J248_006_volume_map_half_B_data']
Traceback (most recent call last):
  File "/home/user/software/miniconda3/envs/spisonet/bin/spisonet.py", line 8, in <module>
    sys.exit(main())
  File "/home/user/software/miniconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/spisonet.py", line 549, in main
    fire.Fire(ISONET)
  File "/home/user/software/miniconda3/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/user/software/miniconda3/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/user/software/miniconda3/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/user/software/miniconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/spisonet.py", line 182, in reconstruct
    map_refine_n2n(halfmap1,halfmap2, mask_vol, fsc3d, alpha = alpha,beta=beta,  voxel_size=voxel_size, output_dir=output_dir,
  File "/home/user/software/miniconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/map_refine.py", line 145, in map_refine_n2n
    network.train([data_dir_1,data_dir_2], output_dir, alpha=alpha,beta=beta, output_base=output_base0, batch_size=batch_size, epochs = epochs, steps_per_epoch = 1000,
  File "/home/user/software/miniconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/models/network_n2n.py", line 265, in train
    mp.spawn(ddp_train, args=(self.world_size, self.port_number, self.model,alpha,beta,
  File "/home/user/software/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 241, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/home/user/software/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
    while not context.join():
  File "/home/user/software/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 158, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "/home/user/software/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 68, in _wrap
    fn(i, *args)
  File "/home/user/software/miniconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/models/network_n2n.py", line 68, in ddp_train
    train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size_gpu, persistent_workers=True,
  File "/home/user/software/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 356, in __init__
    batch_sampler = BatchSampler(sampler, batch_size, drop_last)
  File "/home/user/software/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/utils/data/sampler.py", line 267, in __init__
    raise ValueError(f"batch_size should be a positive integer value, but got batch_size={batch_size}")
ValueError: batch_size should be a positive integer value, but got batch_size=0

When I explicitly set the batch size to 4 (--batch_size 4) it still fails with the same error. Am I doing something wrong, or is this a bug of some kind? Happy to provide inputs if helpful.

EDIT:

Ah, it seems I was using the --acc_batch and --batch size parameters for a single GPU, but selecting 4 GPUs. When I use the recommended params for 4 GPUs (batch size 8, acc_batch 1), it runs out of GPU RAM though, even though all 4 GPUs are 11GB GPUs... increased acc_batch to 2 and it seems to run successfully now.

The text was updated successfully, but these errors were encountered:

olibclarke closed this as completed Apr 16, 2024

donghuachensu mentioned this issue Apr 16, 2024

spIsoNet can't run as relion_external_reconstruct #3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default batch size set to zero? #4

Default batch size set to zero? #4

olibclarke commented Apr 16, 2024 •

edited

Loading

Default batch size set to zero? #4

Default batch size set to zero? #4

Comments

olibclarke commented Apr 16, 2024 • edited Loading

olibclarke commented Apr 16, 2024 •

edited

Loading