-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Additional Ray Data Dashboard Metrics #43628
Conversation
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
premerge failures look unrelated to this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by the way, do we already have OpState.outqueue.memory_usage()
?
python/ray/data/_internal/execution/interfaces/op_runtime_metrics.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few micro nits, but lgtm!
@@ -208,10 +208,111 @@ const DATA_METRICS_CONFIG: MetricsSectionConfig[] = [ | |||
title: "Rows Outputted", | |||
pathParams: "orgId=1&theme=light&panelId=11", | |||
}, | |||
// Inputs-related metrics | |||
{ | |||
title: "Input Blocks Received by Operator", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I assume the ordering of display is based on panelId
? Should these be ordered by that for ~organization or is the order here important?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
panel id is actually unrelated to the order, the order directly follows the order of elements in this .tsx file. the only restriction is that panel id needs to be unique and matches the id from grafana panel
@@ -119,6 +119,330 @@ | |||
fill=0, | |||
stack=False, | |||
), | |||
# Inputs-related metrics | |||
Panel( | |||
id=17, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Similar to above, should we order these by panel id?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
panel id doesn't need to be in increasing order. i found that eventually when we need to insert new metrics/panels, it breaks all of the id's anyways, so either we will need to always continue incrementing id's, or we can just make sure they are all unique id's (this is enforced by the dashboard code already)
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
num_task_inputs_processed: int = field( | ||
default=0, | ||
metadata={ | ||
"description": "Number of input blocks processed by tasks.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe mention "finished processing" to make it more clear.
"description": "Number of input blocks processed by tasks.", | |
"description": "Number of input blocks that operator tasks has finished processing. |
bytes_task_inputs_processed: int = field( | ||
default=0, | ||
metadata={ | ||
"description": "Byte size of blocks processed by tasks.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
metadata={ | ||
"description": ( | ||
"Number of rows in generated output blocks " | ||
"that are from finished tasks." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the original comment is wrong. it's not only from finished tasks. should just be Number of rows generated by tasks.
"metrics_group": "outputs", | ||
}, | ||
) | ||
num_outputs_of_finished_tasks: int = field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok to not expose this metric and the next. they are used to compute another property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed from the grafana and ray data dashboards.
num_tasks_have_outputs: int = field( | ||
default=0, | ||
metadata={ | ||
"description": "Number of tasks with at least one output block.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"description": "Number of tasks with at least one output block.", | |
"description": "Number of tasks that already have output.", |
@@ -133,14 +282,12 @@ def extra_metrics(self) -> Dict[str, Any]: | |||
"""Return a dict of extra metrics.""" | |||
return self._extra_metrics | |||
|
|||
def as_dict(self, metrics_only: bool = False): | |||
def as_dict(self): | |||
"""Return a dict representation of the metrics.""" | |||
result = [] | |||
for f in fields(self): | |||
if f.metadata.get("export", True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this "export" seems not being used any more
@@ -253,25 +398,42 @@ def average_bytes_change_per_task(self) -> Optional[float]: | |||
|
|||
return self.average_bytes_outputs_per_task - self.average_bytes_inputs_per_task | |||
|
|||
@property | |||
def estimated_object_store_usage(self) -> Optional[float]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
object store usage of an op is actually calculated as obj_store_mem_pending_task_outputs + obj_store_mem_internal_outqueue + OpState.outqueue.memory_usage + sum(next_op.obj_store_mem_pending_task_inputs + next_op.obj_store_mem_internal_inqueue)
.
Currently it's calculated in ResourceManager here. because OpState isn't accessible here.
It'd be also useful to report this to the dashboard.
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
if op: | ||
execution_resources = self._resource_manager._op_usages[op] | ||
op_object_store_memory = execution_resources.object_store_memory | ||
op._metrics.obj_store_mem_used = op_object_store_memory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, let's update this in ResourceManager.update_resources
.
so we won't forgot calling this method.
Signed-off-by: Scott Lee <[email protected]>
Why are these changes needed?
Adds remaining metrics from
OpRuntimeMetrics
class in new time series charts on the Grafana and Ray Data dashboards.Clean up the
OpRuntimeMetrics
andStatsActor
code, grouping related metrics by area and consolidating descriptions and comments.Visually group each section of Ray Data metrics. See below for screenshots of each section.
- Programmatically generate Grafana panels fromthis is currently not possible, since we would need to add ray data as a dependency for ray dashboards / serve.OpRuntimeMetrics
fields.Overview:
Inputs:
Outputs:
Tasks:
Object Store Memory:
Iteration:
Related issue number
Closes #42437
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.