[Train] Reporting metrics/checkpoints from multiple workers #33360

matthewdeng · 2023-03-16T05:59:43Z

Summary

This issue tracks potential improvements for reporting/checkpointing across multiple Ray Train workers for deep learning workloads.

Context

Currently, Ray Train requires each worker to report/checkpoint at the same frequency as a synchronization mechanism. This adheres to the SPMD pattern where each worker runs the same script. Documentation can be found here.

However, this has turned out to be unintuitive, confusing, or even contradictory to user expectations (#33042). As a result, we should explore options for improving this experience.

Proposal

One potential option is to only allow reporting from the rank 0 worker.

Related Issues

#31409
#31434

Yard1 · 2023-03-16T16:37:31Z

Also see #33360 and #33073

matthewdeng added P2 Important issue, but not time-critical train Ray Train Related Issue air labels Mar 16, 2023

Yard1 added the ray-team-created Ray Team created label Mar 22, 2023

anyscalesam removed the air label Oct 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Train] Reporting metrics/checkpoints from multiple workers #33360

[Train] Reporting metrics/checkpoints from multiple workers #33360

matthewdeng commented Mar 16, 2023

Yard1 commented Mar 16, 2023 •

edited

Loading

[Train] Reporting metrics/checkpoints from multiple workers #33360

[Train] Reporting metrics/checkpoints from multiple workers #33360

Comments

matthewdeng commented Mar 16, 2023

Summary

Context

Proposal

Related Issues

Yard1 commented Mar 16, 2023 • edited Loading

Yard1 commented Mar 16, 2023 •

edited

Loading