Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Train] Reporting metrics/checkpoints from multiple workers #33360

Open
matthewdeng opened this issue Mar 16, 2023 · 1 comment
Open

[Train] Reporting metrics/checkpoints from multiple workers #33360

matthewdeng opened this issue Mar 16, 2023 · 1 comment
Labels
P2 Important issue, but not time-critical ray-team-created Ray Team created train Ray Train Related Issue

Comments

@matthewdeng
Copy link
Contributor

Summary

This issue tracks potential improvements for reporting/checkpointing across multiple Ray Train workers for deep learning workloads.

Context

Currently, Ray Train requires each worker to report/checkpoint at the same frequency as a synchronization mechanism. This adheres to the SPMD pattern where each worker runs the same script. Documentation can be found here.

However, this has turned out to be unintuitive, confusing, or even contradictory to user expectations (#33042). As a result, we should explore options for improving this experience.

Proposal

One potential option is to only allow reporting from the rank 0 worker.

Related Issues

#31409
#31434

@matthewdeng matthewdeng added P2 Important issue, but not time-critical train Ray Train Related Issue air labels Mar 16, 2023
@Yard1
Copy link
Member

Yard1 commented Mar 16, 2023

Also see #33360 and #33073

@Yard1 Yard1 added the ray-team-created Ray Team created label Mar 22, 2023
@anyscalesam anyscalesam removed the air label Oct 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 Important issue, but not time-critical ray-team-created Ray Team created train Ray Train Related Issue
Projects
None yet
Development

No branches or pull requests

3 participants