-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC][Train] Allow for reporting results from multiple workers #31409
Comments
Thanks for putting this together! Few questions:
|
|
For aggregations, can they use |
Yeah, it's possible to use that right now. That being said, |
@Yard1 how would I use |
@Yard1 Also, what are metrics that you want to aggregate from all workers individually? @bveeramani torchmetrics is distributed training compatible... it will automatically aggregate across workers using allreduce. |
@richardliaw I was thinking profiling information could be useful? I don't have a special need myself - this is something we have been talking about on and off for a while. Some users were also interested in this feature, eg. https://discuss.ray.io/t/how-can-i-synchronization-metrics-in-ray-train-valid-loop/8500 https://discuss.ray.io/t/pytorch-distributedtrainable-tune-report-on-rank-0-only/5127/1 |
For both cases seems like we just need to provide best practices - telling users to do a sum/average/median across all workers with torchmetrics, and also reporting the same things on all workers if necessary? |
I'll add that as a proposal! |
sorry if I wasn’t clear before. I don’t think we need to discuss multiple
options here because I don’t see a very concrete use case yet for any of
other alternatives.
Let me know if that makes sense.
…On Wed, Jan 4, 2023 at 3:29 AM Antoni Baum ***@***.***> wrote:
I'll add that as a proposal!
—
Reply to this email directly, view it on GitHub
<#31409 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABCRZZKIGBUD3FPIQRHTNZLWQVNJVANCNFSM6AAAAAATQASTEQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
That's fair. In any case, if we do not want to provide an API for this and instead rely on third party tools like torchmetrics, we should update documentation & provide an example, so that's still an action item. |
Yep exactly. Can we perhaps update this issue to track the action item?
…On Wed, Jan 4, 2023 at 9:29 AM Antoni Baum ***@***.***> wrote:
That's fair. In any case, if we do not want to provide an API for this and
instead rely on third party tools like torchmetrics, we should update
documentation & provide an example, so that's still an action item.
—
Reply to this email directly, view it on GitHub
<#31409 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABCRZZPE474CTBNADQ6TBDDWQWXN7ANCNFSM6AAAAAATQASTEQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I'll make a separate issue for that, and we can defer this one until we have a concrete usecase. |
Closing this one since we have a separate issue for now. When we have a concrete use case we can bring it up again! |
Description
Current state
Currently, Ray Train only reports metrics from the first worker. This is fine in most cases, but for some applications, it may be desirable to report metrics from all workers and/or report aggregations, such as mean and std. We also require that functionality for some tests.
Note: Saving checkpoints from multiple workers is beyond the scope of this proposal.
Before Ray AIR, Ray Train supported reporting result aggregation through result preprocessors (#22099).
With the current structure of the
DataParallelTrainer
, the reporting code is fully contained within the_report
method:As can be seen, it would be trivial to extend this functionality to arbitrary number of workers or aggregation logic. Below are two proposals on how to allow users to do that in a lightweight manner.
Proposal 1: Promote
_report
toDeveloperAPI
and encourage users to subclassIn this proposal, we encourage users to simply subclass
DataParallelTrainer
/TorchTrainer
(and so on) and override thereport
method with their own custom logic, eg.Proposal 2: Add
results_processing_fn
argument toDataParallelTrainer
The class would be modified to include:
Proposal 3: Direct users to use third party libraries like
torchmetrics
For Torch, users can use
torchmetrics
, which has built-in support for DDP. Similar solutions may exist for Tensorflow. It's unclear how that supports non-metric usecases, such as eg. time measurement, profiling info such as memory usage etc. On the other hand, this would require us to only update documentation to mention this approach.Conclusion
Either proposal would be a lightweight way to allow users to modify the data reported to Tune. I do not have a personal preference towards either, though I feel like Proposal 2 fits better with the rest of the API.
Proposal 3 requires only documentation changes, and can be implemented independently (tracked here #31434)
Use case
No response
The text was updated successfully, but these errors were encountered: