Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dashboard] Add RAY_CLUSTER_ACTIVITY_HOOK to /api/component_activities #26297

Merged
merged 19 commits into from
Jul 8, 2022

Conversation

nikitavemuri
Copy link
Contributor

@nikitavemuri nikitavemuri commented Jul 5, 2022

Why are these changes needed?

  • Add external hook to /api/component_activities endpoint in dashboard snapshot router
  • Change is_active field of RayActivityResponse to take an enum RayActivityStatus instead of bool. This is a backward incompatible change, but should be ok because [dashboard] Add component_activities API #25996 wasn't included in any branch cuts. RayActivityResponse now supports informing when there was an error getting the activity observation and the reason.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

# activity component type (str) to
# ray.dashboard.modules.snapshot.snapshot_head.RayActivityResponse.
# Example: "your.module.ray_cluster_activity_hook".
RAY_CLUSTER_ACTIVITY_HOOK = "RAY_CLUSTER_ACTIVITY_HOOK"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move it to the dashboard constants?

)
resp[component_type] = dataclasses.asdict(component_activity_output)
except Exception as e:
logger.info(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding this to reason?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chatted with Matt and Sofian, and we felt it is probably better to have the response of this endpoint return if there was an error getting the activity observation of any component. In that case, the reason can include the error message. What do you think?

driver_activity_info = await self._get_job_activity_info(timeout=timeout)

external_ray_cluster_activity_output = {}
if ray_constants.RAY_CLUSTER_ACTIVITY_HOOK in os.environ:
external_ray_cluster_activity_output = _load_class(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused how this would work. Isn't it supposed to "run a function" instead of just loading the output class? I believe it is not possible to dynamically set env var in the running process?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are expecting the environment variable to be set before ray is initialized, and we not supporting the use case of the environment variable itself being updated. _load_class loads either a class or a function from an external module and the function is run on line 130 with ()

"ray.dashboard.optional_utils.ClassMethodRouteTable", MockClassMethodRouteTable
).start()
mock_load_class = Mock(return_value=Mock(return_value=cluster_activity_hook_output))
os.environ[ray_constants.RAY_CLUSTER_ACTIVITY_HOOK] = "mock_module"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you test with a function that dynamically changes the output instead of a static output?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, updated

Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. One more test request for e2e test

"driver": {"is_active": False, "reason": None, "timestamp": None},
"new_component": {"is_active": False, "reason": None, "timestamp": None},
}
os.environ.pop(RAY_CLUSTER_ACTIVITY_HOOK)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we have a real e2e test here?

a = {}
def endpoint():
    return {"is_active": len(a) > 0, "reason": None, "timestamp": None}

RAY_CLUSTER_ACTIVITY_HOOK=tests.test_snapshot.endpoint ray start --head

# Call API and make sure `is_active == False`

a = {"a": 1}

# call API and make sure `is_active==True`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to make this an e2e test. I wasn't sure initially how to import methods from the tests folder, but this works from ray._private.test_utils

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 6, 2022
@nikitavemuri nikitavemuri removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 6, 2022
Comment on lines 39 to 44
# Hook that is invoked on the dashboard `/api/component_activities` endpoint.
# It does not take any arguments and should return a dictionary mapping
# activity component type (str) to
# ray.dashboard.modules.snapshot.snapshot_head.RayActivityResponse.
# Example: "your.module.ray_cluster_activity_hook".
RAY_CLUSTER_ACTIVITY_HOOK = "RAY_CLUSTER_ACTIVITY_HOOK"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we mention that this path should correspond to a Callable type.

Comment on lines 138 to 140
external_activity_output = _load_class(
os.environ[RAY_CLUSTER_ACTIVITY_HOOK]
)()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
external_activity_output = _load_class(
os.environ[RAY_CLUSTER_ACTIVITY_HOOK]
)()
cluster_activity_callable = _load_class(
os.environ[RAY_CLUSTER_ACTIVITY_HOOK]
)
external_activity_output = cluster_activity_callable()

This is probably not linted correctly, but maybe making it more explicit about what is going on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, updated

@nikitavemuri nikitavemuri requested a review from ijrsvt July 7, 2022 15:56
Copy link
Contributor

@ijrsvt ijrsvt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM.

The backwards incompatible change should be fine since this field was only introduced a week ago (and we haven't released Ray in the last week)

@rkooo567 rkooo567 merged commit 56716a1 into ray-project:master Jul 8, 2022
truelegion47 pushed a commit to truelegion47/ray that referenced this pull request Jul 9, 2022
* master: (42 commits)
  [dashboard][2/2] Add endpoints to dashboard and dashboard_agent for liveness check of raylet and gcs (ray-project#26408)
  [Doc] Fix docs feedback button (ray-project#26402)
  [core][1/2] Improve liveness check in GCS  (ray-project#26405)
  [RLlib] Checkpoint and restore connectors. (ray-project#26253)
  [Workflow] Minor refactoring of workflow exceptions (ray-project#26398)
  [workflow] Workflow queue (ray-project#24697)
  [RLlib] Minor simplification of code. (ray-project#26312)
  [AIR] Update TensorflowPredictor to new API (ray-project#26215)
  [RLlib] Make Dataset reader default reader and enable CRR to use dataset (ray-project#26304)
  [runtime_env] [doc] Remove outdated info about "isolated" environment (ray-project#26314)
  [Doc] Fix rate-the-docs plugin (ray-project#26384)
  [Docs] [Serve] Has a consistent landing page style (ray-project#26029)
  [dashboard] Add `RAY_CLUSTER_ACTIVITY_HOOK` to `/api/component_activities` (ray-project#26297)
  [tune] Use `Checkpoint.to_bytes()` for store_to_object (ray-project#25805)
  [tune] Fix `SyncerCallback` having a size limit (ray-project#26371)
  [air] Serialize additional files in dict checkpoints turned dir checkpoints (ray-project#26351)
  [Docs] Add "rate the docs" plugin for feedback on docs (ray-project#26330)
  [Doc] Fix actor example (ray-project#26381)
  Set RAY_USAGE_STATS_EXTRA_TAGS for release tests (ray-project#26366)
  [Datasets] Update docs for drop_columns and fix typos (ray-project#26317)
  ...
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022
…ties` (ray-project#26297)

Add external hook to /api/component_activities endpoint in dashboard snapshot router
Change is_active field of RayActivityResponse to take an enum RayActivityStatus instead of bool. This is a backward incompatible change, but should be ok because [dashboard] Add component_activities API ray-project#25996 wasn't included in any branch cuts. RayActivityResponse now supports informing when there was an error getting the activity observation and the reason.

Signed-off-by: Stefan van der Kleij <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants