[Train] Reporting metrics/checkpoints from multiple workers #33360
Labels
P2
Important issue, but not time-critical
ray-team-created
Ray Team created
train
Ray Train Related Issue
Summary
This issue tracks potential improvements for reporting/checkpointing across multiple Ray Train workers for deep learning workloads.
Context
Currently, Ray Train requires each worker to report/checkpoint at the same frequency as a synchronization mechanism. This adheres to the SPMD pattern where each worker runs the same script. Documentation can be found here.
However, this has turned out to be unintuitive, confusing, or even contradictory to user expectations (#33042). As a result, we should explore options for improving this experience.
Proposal
One potential option is to only allow reporting from the rank 0 worker.
Related Issues
#31409
#31434
The text was updated successfully, but these errors were encountered: