Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 --master_port=${PORT:-300} tools/train.py --config configs/foodnet/SETR_Naive_768x768_80k_base_RM.py --work-dir checkpoints/SETR_Naive_ReLeM --launcher pytorch #5

Open
Mark1Dong opened this issue Jul 26, 2021 · 3 comments

Comments

@Mark1Dong
Copy link

Again, I have a question about the process of train, when I use your guidline ,and there are some error:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


Traceback (most recent call last):
File "tools/train.py", line 167, in
main()
File "tools/train.py", line 98, in main
init_dist(args.launcher, **cfg.dist_params)
File "/home/dongxiaoxiao/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/dist_utils.py", line 20, in init_dist
_init_dist_pytorch(backend, **kwargs)
File "/home/dongxiaoxiao/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/dist_utils.py", line 34, in _init_dist_pytorch
dist.init_process_group(backend=backend, **kwargs)
File "/home/dongxiaoxiao/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 422, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/home/dongxiaoxiao/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 172, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Permission denied
Traceback (most recent call last):
File "/home/dongxiaoxiao/anaconda3/envs/open-mmlab/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/dongxiaoxiao/anaconda3/envs/open-mmlab/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/dongxiaoxiao/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 261, in
main()
File "/home/dongxiaoxiao/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 257, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/home/dongxiaoxiao/anaconda3/envs/open-mmlab/bin/python', '-u', 'tools/train.py', '--local_rank=3', '--config', 'configs/foodnet/fpn_r50_512x1024_80k_RM.py', '--work-dir', 'checkpoints/FPN_r50_RM', '--launcher', 'pytorch']' returned non-zero exit status 1.

And I remember that about 20 days ago,I saw some said "--launcher pytorch" may cause some questions, but I don`t know, hope your reply ,Thanks a lot!

@XiongweiWu
Copy link
Collaborator

XiongweiWu commented Jul 30, 2021

@Mark1Dong sorry for replying late since I am super busy recently. Can you first paste your environment (OS, GPU etc.) and I can check it in more details?

@Mark1Dong
Copy link
Author

Thanks for your reply, and I have solved this question. The main problem is the command-line format

@Mark1Dong
Copy link
Author

also , for the port, '-300' is not suitable, so I change the port to ' -36900' ,and the question is solved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants