can not distributed training #46

ALR-alr · 2024-10-23T03:08:15Z

once I run the training with the following command:
CUDA_VISIBLE_DEVICES=0,1 randport=$(shuf -i8000-9999 -n1) # Generate a random port number python -u main.py \ --dist-url "tcp://127.0.0.1:${randport}" --dist-backend 'nccl' \ --multiprocessing-distributed --world-size 1 --rank 0 \ --dataset=cc3m --val-dataset=cc3m \ --exp-name='gill_exp' --image-dir='data/' --log-base-dir='runs/' \ --opt-version='/opt-6.7b' \ --visual-model /checkpoints/clip-vit-large-patch14 \ --precision='bf16' --print-freq=100
the server will be reconnected and the process will disconnect with the following error message:

torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGKILL

UserWarning: resource_tracker: There appear to be 2 leaked semaphore objects to clean up at shutdown

warnings.warn('resource_tracker: There appear to be %d '

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

can not distributed training #46

can not distributed training #46

ALR-alr commented Oct 23, 2024

can not distributed training #46

can not distributed training #46

Comments

ALR-alr commented Oct 23, 2024