-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dashboard] Add RAY_CLUSTER_ACTIVITY_HOOK
to /api/component_activities
#26297
[dashboard] Add RAY_CLUSTER_ACTIVITY_HOOK
to /api/component_activities
#26297
Conversation
…ponent_activities_hook
python/ray/_private/ray_constants.py
Outdated
# activity component type (str) to | ||
# ray.dashboard.modules.snapshot.snapshot_head.RayActivityResponse. | ||
# Example: "your.module.ray_cluster_activity_hook". | ||
RAY_CLUSTER_ACTIVITY_HOOK = "RAY_CLUSTER_ACTIVITY_HOOK" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move it to the dashboard constants?
) | ||
resp[component_type] = dataclasses.asdict(component_activity_output) | ||
except Exception as e: | ||
logger.info( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding this to reason
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chatted with Matt and Sofian, and we felt it is probably better to have the response of this endpoint return if there was an error getting the activity observation of any component. In that case, the reason
can include the error message. What do you think?
driver_activity_info = await self._get_job_activity_info(timeout=timeout) | ||
|
||
external_ray_cluster_activity_output = {} | ||
if ray_constants.RAY_CLUSTER_ACTIVITY_HOOK in os.environ: | ||
external_ray_cluster_activity_output = _load_class( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused how this would work. Isn't it supposed to "run a function" instead of just loading the output class? I believe it is not possible to dynamically set env var in the running process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are expecting the environment variable to be set before ray is initialized, and we not supporting the use case of the environment variable itself being updated. _load_class
loads either a class or a function from an external module and the function is run on line 130 with ()
"ray.dashboard.optional_utils.ClassMethodRouteTable", MockClassMethodRouteTable | ||
).start() | ||
mock_load_class = Mock(return_value=Mock(return_value=cluster_activity_hook_output)) | ||
os.environ[ray_constants.RAY_CLUSTER_ACTIVITY_HOOK] = "mock_module" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you test with a function that dynamically changes the output instead of a static output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. One more test request for e2e test
"driver": {"is_active": False, "reason": None, "timestamp": None}, | ||
"new_component": {"is_active": False, "reason": None, "timestamp": None}, | ||
} | ||
os.environ.pop(RAY_CLUSTER_ACTIVITY_HOOK) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we have a real e2e test here?
a = {}
def endpoint():
return {"is_active": len(a) > 0, "reason": None, "timestamp": None}
RAY_CLUSTER_ACTIVITY_HOOK=tests.test_snapshot.endpoint ray start --head
# Call API and make sure `is_active == False`
a = {"a": 1}
# call API and make sure `is_active==True`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to make this an e2e test. I wasn't sure initially how to import methods from the tests
folder, but this works from ray._private.test_utils
dashboard/consts.py
Outdated
# Hook that is invoked on the dashboard `/api/component_activities` endpoint. | ||
# It does not take any arguments and should return a dictionary mapping | ||
# activity component type (str) to | ||
# ray.dashboard.modules.snapshot.snapshot_head.RayActivityResponse. | ||
# Example: "your.module.ray_cluster_activity_hook". | ||
RAY_CLUSTER_ACTIVITY_HOOK = "RAY_CLUSTER_ACTIVITY_HOOK" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we mention that this path should correspond to a Callable
type.
external_activity_output = _load_class( | ||
os.environ[RAY_CLUSTER_ACTIVITY_HOOK] | ||
)() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
external_activity_output = _load_class( | |
os.environ[RAY_CLUSTER_ACTIVITY_HOOK] | |
)() | |
cluster_activity_callable = _load_class( | |
os.environ[RAY_CLUSTER_ACTIVITY_HOOK] | |
) | |
external_activity_output = cluster_activity_callable() |
This is probably not linted correctly, but maybe making it more explicit about what is going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM.
The backwards incompatible change should be fine since this field was only introduced a week ago (and we haven't released Ray in the last week)
* master: (42 commits) [dashboard][2/2] Add endpoints to dashboard and dashboard_agent for liveness check of raylet and gcs (ray-project#26408) [Doc] Fix docs feedback button (ray-project#26402) [core][1/2] Improve liveness check in GCS (ray-project#26405) [RLlib] Checkpoint and restore connectors. (ray-project#26253) [Workflow] Minor refactoring of workflow exceptions (ray-project#26398) [workflow] Workflow queue (ray-project#24697) [RLlib] Minor simplification of code. (ray-project#26312) [AIR] Update TensorflowPredictor to new API (ray-project#26215) [RLlib] Make Dataset reader default reader and enable CRR to use dataset (ray-project#26304) [runtime_env] [doc] Remove outdated info about "isolated" environment (ray-project#26314) [Doc] Fix rate-the-docs plugin (ray-project#26384) [Docs] [Serve] Has a consistent landing page style (ray-project#26029) [dashboard] Add `RAY_CLUSTER_ACTIVITY_HOOK` to `/api/component_activities` (ray-project#26297) [tune] Use `Checkpoint.to_bytes()` for store_to_object (ray-project#25805) [tune] Fix `SyncerCallback` having a size limit (ray-project#26371) [air] Serialize additional files in dict checkpoints turned dir checkpoints (ray-project#26351) [Docs] Add "rate the docs" plugin for feedback on docs (ray-project#26330) [Doc] Fix actor example (ray-project#26381) Set RAY_USAGE_STATS_EXTRA_TAGS for release tests (ray-project#26366) [Datasets] Update docs for drop_columns and fix typos (ray-project#26317) ...
…ties` (ray-project#26297) Add external hook to /api/component_activities endpoint in dashboard snapshot router Change is_active field of RayActivityResponse to take an enum RayActivityStatus instead of bool. This is a backward incompatible change, but should be ok because [dashboard] Add component_activities API ray-project#25996 wasn't included in any branch cuts. RayActivityResponse now supports informing when there was an error getting the activity observation and the reason. Signed-off-by: Stefan van der Kleij <[email protected]>
Why are these changes needed?
/api/component_activities
endpoint in dashboard snapshot routeris_active
field ofRayActivityResponse
to take an enumRayActivityStatus
instead ofbool
. This is a backward incompatible change, but should be ok because [dashboard] Addcomponent_activities
API #25996 wasn't included in any branch cuts.RayActivityResponse
now supports informing when there was an error getting the activity observation and the reason.Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.