Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Dashboard] Revisit Reporter Agent communication protocol to use proto instead of JSON #45191

Open
alexeykudinkin opened this issue May 8, 2024 · 2 comments
Assignees
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core dashboard Issues specific to the Ray Dashboard performance

Comments

@alexeykudinkin
Copy link
Contributor

What happened + What you expected to happen

Currently, Reporter Agent reports resource utilization as JSON, which quickly becomes substantial overhead on the Dashboard process:

  1. Volume of the data passed in scales w/ the # of workers and nodes
  2. Dashboard is a single Python process (that now has to parse a lot of JSON)

#45048 partially alleviates the problem of avoiding blocking the event-loop while parsing the JSON, but it doesn't resolve the underlying problem of inefficiency JSON as a format for application with the large number of stats being passed around.

Instead, we should rebase this payload handling to be proper Protobuf.

Versions / Dependencies

2.20

Reproduction script

  1. Launch a Ray cluster with 100 nodes and 1000 workers
  2. Profile Dashboard process

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@alexeykudinkin alexeykudinkin added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) dashboard Issues specific to the Ray Dashboard performance labels May 8, 2024
@hongchaodeng hongchaodeng added core Issues that should be addressed in Ray Core P0 Issues that should be fixed in short order triage Needs triage (eg: priority, bug/not-bug, and owning component) and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) P0 Issues that should be fixed in short order labels May 8, 2024
@rynewang
Copy link
Contributor

rynewang commented May 8, 2024

Q: does this JSON parsing overhead really affect dashboard API latency to the extent of 10s of seconds? If it's already in another thread, the GIL is preempted to handle the main loop every a few hundred milliseconds so we should witness an increase of latency but not to multiple seconds. Do we have latency data after #45048 is applied?

@hongchaodeng
Copy link
Member

Q: does this JSON parsing overhead really affect dashboard API latency to the extent of 10s of seconds?

+1
Let's try to reproduce this first. Then we can profile it.

@hongchaodeng hongchaodeng removed the triage Needs triage (eg: priority, bug/not-bug, and owning component) label May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core dashboard Issues specific to the Ray Dashboard performance
Projects
None yet
Development

No branches or pull requests

3 participants