Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torch.multiprocessing does not work for multiple GPUs #155

Open
osamarais opened this issue Feb 8, 2024 · 0 comments
Open

torch.multiprocessing does not work for multiple GPUs #155

osamarais opened this issue Feb 8, 2024 · 0 comments

Comments

@osamarais
Copy link

I can successfully train on a single GPU with a batch size of 4, but am unable to train on 4 GPUs with a batch size of 16.

I get the following error message:

Lock file exists in build directory: '/gpfs/u/home/~/.cache/torch_extensions/nvdiffrast_plugin/lock'
tick 0     kimg 0.0      time 27m 55s      sec/tick 1665.6  sec/kimg 104099.05 maintenance 9.2   
==> start visualization
Traceback (most recent call last):
  File "train_3d.py", line 339, in <module>
    main()  # pylint: disable=no-value-for-parameter
  File "/gpfs/u/home/~/envs/get3d/lib/python3.8/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/gpfs/u/home/~/envs/get3d/lib/python3.8/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/gpfs/u/home/~/envs/get3d/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/gpfs/u/home/~/envs/get3d/lib/python3.8/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "train_3d.py", line 333, in main
    launch_training(c=c, desc=desc, outdir=opts.outdir, dry_run=opts.dry_run)
  File "train_3d.py", line 107, in launch_training
    torch.multiprocessing.spawn(fn=subprocess_fn, args=(c, temp_dir), nprocs=c.num_gpus)
  File "/gpfs/u/home/~/envs/get3d/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/gpfs/u/home/~/envs/get3d/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/gpfs/u/home/~/envs/get3d/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 130, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGBUS
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant