How to run on a single linux server with multiple GPUs #20

1999kevin · 2023-04-20T13:52:43Z

Nice Job! I wonder how I can run the code on a single linux server with multiple GPUs. I can run the code on the server with one GPU by not using mpiexec. But what if I want to use multiple GPUs as nn.DataParallel?

stonecropa · 2023-04-21T06:37:05Z

@1999kevin Can you tell me how to use a gpu to generate images using a pretrained model without the communication protocol nccl. Thank you

1999kevin · 2023-04-21T06:59:20Z

@1999kevin Can you tell me how to use a gpu to generate images using a pretrained model without the communication protocol nccl. Thank you

Just delete the mpiexec part in the command of the sampling.

stonecropa · 2023-04-21T08:28:44Z

@1999kevin but in image_samping.py ，I don't find mpiexec.thanks

stonecropa · 2023-04-21T08:32:31Z

Can I have a look at the code after your changes, thanks, I would appreciate it if you could send it over

1999kevin · 2023-04-21T15:16:55Z

Can I have a look at the code after your changes, thanks, I would appreciate it if you could send it over

I'm still working on training phase and not so sure about the inference phase. I guess you can follow Line 48 and Line 51 in scripts/launch.sh to sample the images. If you want to use one thread, just directly use the command: python image_sample.py ...

tyshiwo1 · 2023-04-22T14:47:54Z

I add CUDA_VISIBLE_DEVICES=6,7 in front of the inference command to form CUDA_VISIBLE_DEVICES=6,7 mpiexec -n 2 python ./scripts/image_sample.py ..., and change the code of ./cm/dist_util.py#L27 into:

    if 'CUDA_VISIBLE_DEVICES' not in os.environ:
        os.environ["CUDA_VISIBLE_DEVICES"] = f"{MPI.COMM_WORLD.Get_rank() % GPUS_PER_NODE}"
    else:
        gpu_inds_list = os.environ["CUDA_VISIBLE_DEVICES"].split(',')
        idx = MPI.COMM_WORLD.Get_rank() % GPUS_PER_NODE
        os.environ["CUDA_VISIBLE_DEVICES"] = gpu_inds_list[idx]

Does it work?

1999kevin · 2023-04-23T02:22:34Z

Does it work?

I will test it once I finish current training.

tyshiwo1 · 2023-04-23T04:11:26Z

Does it work?

I will test it once I finish current training.

Btw, I found training with only 4 batchsize and 64 image size costs about 18G memory per GPU. Is there something wrong with it?

1999kevin · 2023-04-23T07:57:25Z

Btw, I found training with only 4 batchsize and 64 image size costs about 18G memory per GPU. Is there something wrong with it?

I also encounter simialr problems in my test. I train the model with batchsize 2 and 256 image size, costing me 35G memory.

stonecropa · 2023-04-23T07:59:22Z

Btw, I found training with only 4 batchsize and 64 image size costs about 18G memory per GPU. Is there something wrong with it?

I also encounter simialr problems in my test. I train the model with batchsize 2 and 256 image size, costing me 35G memory.
Will the pre-training model also use such a large amount of Gpu memory?

1999kevin · 2023-04-24T02:02:09Z

Will the pre-training model also use such a large amount of Gpu memory?

Do not test such case currently.

1999kevin · 2023-04-24T02:29:18Z

I add CUDA_VISIBLE_DEVICES=6,7 in front of the inference command to form CUDA_VISIBLE_DEVICES=6,7 mpiexec -n 2 python ./scripts/image_sample.py ..., and change the code of ./cm/dist_util.py#L27

This change can definitely enable multiple GPUs training. However, it may cause error 'Expected q.stride(-1) == 1 to be true, but got false' as in issus #3. Change the flash attenion to defaclt can resolve the error

1999kevin mentioned this issue Apr 21, 2023

Use one gpu to generate images using a pretrained model without the communication protocol nccl. #21

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to run on a single linux server with multiple GPUs #20

How to run on a single linux server with multiple GPUs #20

1999kevin commented Apr 20, 2023

stonecropa commented Apr 21, 2023

1999kevin commented Apr 21, 2023

stonecropa commented Apr 21, 2023

stonecropa commented Apr 21, 2023

1999kevin commented Apr 21, 2023

tyshiwo1 commented Apr 22, 2023 •

edited

Loading

1999kevin commented Apr 23, 2023

tyshiwo1 commented Apr 23, 2023 •

edited

Loading

1999kevin commented Apr 23, 2023

stonecropa commented Apr 23, 2023

1999kevin commented Apr 24, 2023

1999kevin commented Apr 24, 2023

How to run on a single linux server with multiple GPUs #20

How to run on a single linux server with multiple GPUs #20

Comments

1999kevin commented Apr 20, 2023

stonecropa commented Apr 21, 2023

1999kevin commented Apr 21, 2023

stonecropa commented Apr 21, 2023

stonecropa commented Apr 21, 2023

1999kevin commented Apr 21, 2023

tyshiwo1 commented Apr 22, 2023 • edited Loading

1999kevin commented Apr 23, 2023

tyshiwo1 commented Apr 23, 2023 • edited Loading

1999kevin commented Apr 23, 2023

stonecropa commented Apr 23, 2023

1999kevin commented Apr 24, 2023

1999kevin commented Apr 24, 2023

tyshiwo1 commented Apr 22, 2023 •

edited

Loading

tyshiwo1 commented Apr 23, 2023 •

edited

Loading