Problems with DDP + hydra #393

ashleve · 2022-07-15T18:46:54Z

There have been numerous issues about using DDP with hydra:
#231 #289 #229 #226 #194 #352

Current state of things is well described here:
facebookresearch/hydra#2070

tl;dr:
You should be good when using current lightning-hydra-template with ddp_spawn:

# run ddp_spawn on 4 GPUs
python train.py trainer.strategy=ddp_spawn trainer.accelerator=gpu trainer.devices=4

# simulate ddp_spawn on CPU on 4 processes (for testing)
python train.py trainer.strategy=ddp_spawn trainer.accelerator=cpu trainer.devices=4

This works correctly with normal runs as well as multiruns as far as I'm aware.

(ddp_spawn works a bit slower than normal ddp and should be run with datamodule.num_workers=0 only)

Normal ddp computes correctly but generates multiple output directories.

I have not tested what happens when using SLURM.

For now, I don't see anything that can be done on the template part to fix this. This might change with future hydra releases.

Update (April 2023):
Nornal DDP seems to be working correctly with current lightning release (2.0.2). There are no longer multiple output directories.

The text was updated successfully, but these errors were encountered:

bwdeng20 · 2022-07-16T07:58:21Z

Thanks for the summary👍🏻. Looking forward to future fixs.

turian · 2022-09-12T14:17:18Z

@ashleve Can you explain why ddp_spawn is better than ddp? (I skimmed the issues but couldn't grok it.) I understand from the lightning docs that spawn is slower than ddp

ashleve · 2022-09-12T17:53:30Z

@turian ddp_spawn is not better but it's the only ddp mode that works correctly with hydra right now.

As I mentioned, normal DDP generates multiple unwanted files. This is due to the fact that ddp launches a new process for each GPU, which doesn't go well with the way hydra creates different output dir each time a program is launched.

The problem doesn't exist with ddp_spawn which uses a different strategy of launching new GPU processes.

turian · 2022-09-12T18:07:06Z

@ashleve Just curious because I am using hydra + DDP in a current project. How would I be able to detect if this issue is occurring for me? What evidence should I look for? Thank you for the tip

ashleve · 2022-09-12T18:11:40Z

@turian There will be more output directories as explained in facebookresearch/hydra#2070

Just to make this clear, normal ddp actually computes correctly in hydra single run mode, but you will have multiple output directories with .hydra files

turian · 2022-09-12T22:34:44Z

@ashleve woof that's gross. If you have a good fix, we might considering seeing if we can push it upstream to lightning.

turian · 2022-09-13T19:50:51Z

@ashleve Lightning team appears to be working on this issue?

Lightning-AI/pytorch-lightning#11617 (comment)

I've been lightly commenting in that PR

faroit · 2022-09-16T18:46:57Z

@turian @ashleve what worked for me as a workaround is making the experiment dirs static (especially for multiruns/sweeps)

e.g.

hydra:
  run:
    dir: experiments/${your_run_name}
  job:
    chdir: False
  sweep:
    dir: experiments/${your_run_name}
    subdir: ""

callbacks:
  cp:
    dirpath: "experiments/${your_run_name}/checkpoints"
    filename: "${your_run_name}_best_step{step:08d}"

you would loose the ability to have a separate directory for sweeper results, but you could override this specifically for optuna optimization sweeps if you like

Aceticia · 2022-09-22T17:52:28Z

It looks like the pr has been merged into lightning main!

AiEson · 2023-04-12T15:19:40Z

Has this issue been fixed? Can I use ddp for training instead of ddp_spawn？

libokj · 2023-04-29T16:23:50Z

@ashleve Was this fixed by the newest release to use PyTorch 2.0 and PyTorch Lightning 2.0? Thank you for your time.

ashleve · 2023-05-02T19:40:20Z

@AiEson @libokj It seems like the issue with ddp is indeed fixed.

I've checked on multi-gpu instance and at first glance, everything seems to be computed correctly with no redundant logging directories. Although issues with ddp are often hard to spot so let me know if you encounter some problems.

For reference, here are some of the commands I've checked:

python src/train.py trainer.accelerator=gpu trainer.strategy=ddp trainer.devices=2 
python src/train.py trainer.accelerator=gpu trainer.strategy=ddp trainer.devices=2  data.num_workers=8
python src/train.py trainer.accelerator=gpu trainer.strategy=ddp_spawn trainer.devices=2
python src/train.py trainer.accelerator=cpu trainer.strategy=ddp_spawn trainer.devices=2

I made the appropriate changes to ddp config #571

libokj · 2023-05-06T13:53:14Z

I really appreciate your update! Thank you again.

Tomakko · 2023-07-05T10:23:02Z

Hi,

when using ddp i still end up with two, sometimes three, directories per sweep under logs/train/multirun/.

I am executing python src/train.py -m trainer=ddp trainer.devices=4 data.batch_size=32,64 model.optimizer.lr=0.001,0.004.

My lightning version is 2.0.4.

Is there anything i am missing? Thanks!

yipliu · 2023-07-21T06:27:29Z

There are two files (train.log and train_ddp_process_1.log) and one folder (.hydra) are produced in the ROOT_DIR

libokj · 2023-09-09T10:35:15Z

run:
  dir: ${paths.log_dir}/${job_name}/runs/${now:%Y-%m-%d}_${now:%H-%M-%S}_${tags}
sweep:
  dir: ${paths.log_dir}/${job_name}/multiruns/${now:%Y-%m-%d}_${now:%H-%M-%S}_${tags}
  # Sanitize override_dirname by replacing '/' with ',' to avoid unintended subdirectory creation
  subdir: ${eval:'"${hydra.job.override_dirname}".replace("/", ".")'}

job_logging:
  handlers:
    file:
      filename: ${hydra:runtime.output_dir}/job.log

With my current configuration for hydra, multiple folders will be created for the same ddp job, one for each ddp process with a slight time difference, e.g. 2023-09-09_02-00-58_tags and 2023-09-09_02-01-04_tags. The first created folder will contain all the intended logs but the other redundant folders will contain a job.log for its corresponding ddp process. Any advice on this? @ashleve

hovnatan · 2024-02-28T17:09:50Z

@libokj what if you try

if __name__ == "__main__":
    if os.environ.get("LOCAL_RANK", 0) != 0:
        sys.argv.extend(
            [
                "hydra/hydra_logging=disabled",
                "hydra/job_logging=disabled",
           ]
        )
main()

wher main() is your hydra app

ted-chl · 2024-08-07T07:08:27Z

@hovnatan Is there any method for not creating folder for worker node? I could solve not generation log files above scripts, but still generate logging folder with a slight time difference.

ashleve added bug Something isn't working important High importance issue labels Jul 15, 2022

ashleve pinned this issue Jul 15, 2022

ashleve changed the title ~~Problems with DDP + multirun + SLURM experience~~ Problems with DDP / SLURM + hydra Jul 16, 2022

ashleve changed the title ~~Problems with DDP / SLURM + hydra~~ Problems with DDP + hydra Jul 16, 2022

ashleve mentioned this issue May 2, 2023

Set strategy to ddp in ddp config #571

Merged

5 tasks

ashleve unpinned this issue May 2, 2023

ashleve linked a pull request May 2, 2023 that will close this issue

Release 2.0.2 #573

Merged

5 tasks

ashleve closed this as completed in #573 May 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems with DDP + hydra #393

Problems with DDP + hydra #393

ashleve commented Jul 15, 2022 •

edited

Loading

bwdeng20 commented Jul 16, 2022

turian commented Sep 12, 2022

ashleve commented Sep 12, 2022

turian commented Sep 12, 2022

ashleve commented Sep 12, 2022 •

edited

Loading

turian commented Sep 12, 2022

turian commented Sep 13, 2022

faroit commented Sep 16, 2022 •

edited

Loading

Aceticia commented Sep 22, 2022

AiEson commented Apr 12, 2023 •

edited

Loading

libokj commented Apr 29, 2023 •

edited

Loading

ashleve commented May 2, 2023 •

edited

Loading

libokj commented May 6, 2023

Tomakko commented Jul 5, 2023 •

edited

Loading

yipliu commented Jul 21, 2023

libokj commented Sep 9, 2023

hovnatan commented Feb 28, 2024

ted-chl commented Aug 7, 2024

Problems with DDP + hydra #393

Problems with DDP + hydra #393

Comments

ashleve commented Jul 15, 2022 • edited Loading

bwdeng20 commented Jul 16, 2022

turian commented Sep 12, 2022

ashleve commented Sep 12, 2022

turian commented Sep 12, 2022

ashleve commented Sep 12, 2022 • edited Loading

turian commented Sep 12, 2022

turian commented Sep 13, 2022

faroit commented Sep 16, 2022 • edited Loading

Aceticia commented Sep 22, 2022

AiEson commented Apr 12, 2023 • edited Loading

libokj commented Apr 29, 2023 • edited Loading

ashleve commented May 2, 2023 • edited Loading

libokj commented May 6, 2023

Tomakko commented Jul 5, 2023 • edited Loading

yipliu commented Jul 21, 2023

libokj commented Sep 9, 2023

hovnatan commented Feb 28, 2024

ted-chl commented Aug 7, 2024

ashleve commented Jul 15, 2022 •

edited

Loading

ashleve commented Sep 12, 2022 •

edited

Loading

faroit commented Sep 16, 2022 •

edited

Loading

AiEson commented Apr 12, 2023 •

edited

Loading

libokj commented Apr 29, 2023 •

edited

Loading

ashleve commented May 2, 2023 •

edited

Loading

Tomakko commented Jul 5, 2023 •

edited

Loading