[serve] Add replica info to metadata rest api #33292

zcin · 2023-03-14T18:23:17Z

Why are these changes needed?

This adds details about the live replicas for each deployment to be fetched from the new GET endpoint.

Sample replica detail from running a test application:

replica_id: app2_BasicDriver#KhlXQe
state: RUNNING
pid: 25853
actor_name: SERVE_REPLICA::app2_BasicDriver#KhlXQe
actor_id: 2355af670b023966af79501501000000
node_id: 3631e75fc5312752c54b567ee66491a1e58a0420f0abc5b1c44e70cf
node_ip: 192.168.0.141
start_time_s: 1678818083.039281

Details:

is_allocated on each replica used to return just the node id for the controller to confirm the replica has been placed on a node and started. Now, it returns a tuple of runtime-context-related info:
- pid
- actor_id
- node_id
- node_ip
The four fields listed above that are retrieved from the replica actor may be None before the actor is actually scheduled, so they are marked optional in the schema. (The rest of the fields are filled in immediately when the replica is created to be tracked in the controller)

class ReplicaDetails(BaseModel, extra=Extra.forbid):
    replica_id: str
    state: ReplicaState
    pid: Optional[int]
    actor_name: str
    actor_id: Optional[str]
    node_id: Optional[str]
    node_ip: Optional[str]
    start_time_s: float

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Cindy Zhang <[email protected]>

shrekris-anyscale

Nice work so far! I left a few comments.

python/ray/serve/_private/replica.py

python/ray/serve/schema.py

shrekris-anyscale · 2023-03-14T21:37:04Z

python/ray/serve/schema.py

+    node_ip: Optional[str] = Field(
+        description="IP address of the node that the replica actor is running on."
+    )
+    start_time_s: float = Field(


Is there a reason this field isn't always the time when the replica actor started? Could the replica store its start time locally and send it back to the controller after the controller recovers?

Yeah I'm not sure why we do this, but the start time is reset upon recovery here.

I'm not sure what you mean by locally - what I was considering was storing the start time on each replica actor, then fetching it either after starting the replica or after the controller recovers. Is this also what you were referring to?

Yeah I'm not sure why we do this, but the start time is reset upon recovery here.

Interesting– I'm not sure either. @edoakes do you know why we reset start time?

I'm not sure what you mean by locally - what I was considering was storing the start time on each replica actor, then fetching it either after starting the replica or after the controller recovers. Is this also what you were referring to?

Yep, that's what I mean too.

We reset start time because the start time is lost when the controller crashes (it's stored in memory).

We could fix this by instead storing the start_time on the replica and returning it w/ the metadata, but it seems like a P1 follow up to me.

Signed-off-by: Cindy Zhang <[email protected]>

python/ray/serve/_private/deployment_state.py

python/ray/serve/schema.py

Signed-off-by: Cindy Zhang <[email protected]>

zcin · 2023-03-16T02:44:36Z

@shrekris-anyscale @sihanwang41 @edoakes Addressed comments, please take another look!

edoakes · 2023-03-16T15:11:02Z

dashboard/modules/serve/tests/test_serve_agent.py

-@pytest.mark.skipif(sys.platform == "darwin", reason="Flaky on OSX.")
+# @pytest.mark.skipif(sys.platform == "darwin", reason="Flaky on OSX.")


is this no longer flaky? if so remove the commented-out line entirely please

Oops I commented that out for testing locally. I'm not sure if these are flaky or not, since they work on my laptop, but I'll leave it as is for now.

Signed-off-by: Cindy Zhang <[email protected]>

This adds details about the live replicas for each deployment to be fetched from the new GET endpoint. Sample replica detail from running a test application: ``` replica_id: app2_BasicDriver#KhlXQe state: RUNNING pid: 25853 actor_name: SERVE_REPLICA::app2_BasicDriver#KhlXQe actor_id: 2355af670b023966af79501501000000 node_id: 3631e75fc5312752c54b567ee66491a1e58a0420f0abc5b1c44e70cf node_ip: 192.168.0.141 start_time_s: 1678818083.039281 ``` Details: * `is_allocated` on each replica used to return just the node id for the controller to confirm the replica has been placed on a node and started. Now, it returns a tuple of runtime-context-related info: * `pid` * `actor_id` * `node_id` * `node_ip` * The four fields listed above that are retrieved from the replica actor may be `None` before the actor is actually scheduled, so they are marked optional in the schema. (The rest of the fields are filled in immediately when the replica is created to be tracked in the controller) ``` class ReplicaDetails(BaseModel, extra=Extra.forbid): replica_id: str state: ReplicaState pid: Optional[int] actor_name: str actor_id: Optional[str] node_id: Optional[str] node_ip: Optional[str] start_time_s: float ``` Signed-off-by: Jack He <[email protected]>

This adds details about the live replicas for each deployment to be fetched from the new GET endpoint. Sample replica detail from running a test application: ``` replica_id: app2_BasicDriver#KhlXQe state: RUNNING pid: 25853 actor_name: SERVE_REPLICA::app2_BasicDriver#KhlXQe actor_id: 2355af670b023966af79501501000000 node_id: 3631e75fc5312752c54b567ee66491a1e58a0420f0abc5b1c44e70cf node_ip: 192.168.0.141 start_time_s: 1678818083.039281 ``` Details: * `is_allocated` on each replica used to return just the node id for the controller to confirm the replica has been placed on a node and started. Now, it returns a tuple of runtime-context-related info: * `pid` * `actor_id` * `node_id` * `node_ip` * The four fields listed above that are retrieved from the replica actor may be `None` before the actor is actually scheduled, so they are marked optional in the schema. (The rest of the fields are filled in immediately when the replica is created to be tracked in the controller) ``` class ReplicaDetails(BaseModel, extra=Extra.forbid): replica_id: str state: ReplicaState pid: Optional[int] actor_name: str actor_id: Optional[str] node_id: Optional[str] node_ip: Optional[str] start_time_s: float ``` Signed-off-by: Edward Oakes <[email protected]>

This adds details about the live replicas for each deployment to be fetched from the new GET endpoint. Sample replica detail from running a test application: ``` replica_id: app2_BasicDriver#KhlXQe state: RUNNING pid: 25853 actor_name: SERVE_REPLICA::app2_BasicDriver#KhlXQe actor_id: 2355af670b023966af79501501000000 node_id: 3631e75fc5312752c54b567ee66491a1e58a0420f0abc5b1c44e70cf node_ip: 192.168.0.141 start_time_s: 1678818083.039281 ``` Details: * `is_allocated` on each replica used to return just the node id for the controller to confirm the replica has been placed on a node and started. Now, it returns a tuple of runtime-context-related info: * `pid` * `actor_id` * `node_id` * `node_ip` * The four fields listed above that are retrieved from the replica actor may be `None` before the actor is actually scheduled, so they are marked optional in the schema. (The rest of the fields are filled in immediately when the replica is created to be tracked in the controller) ``` class ReplicaDetails(BaseModel, extra=Extra.forbid): replica_id: str state: ReplicaState pid: Optional[int] actor_name: str actor_id: Optional[str] node_id: Optional[str] node_ip: Optional[str] start_time_s: float ``` Signed-off-by: chaowang <[email protected]>

This adds details about the live replicas for each deployment to be fetched from the new GET endpoint. Sample replica detail from running a test application: ``` replica_id: app2_BasicDriver#KhlXQe state: RUNNING pid: 25853 actor_name: SERVE_REPLICA::app2_BasicDriver#KhlXQe actor_id: 2355af670b023966af79501501000000 node_id: 3631e75fc5312752c54b567ee66491a1e58a0420f0abc5b1c44e70cf node_ip: 192.168.0.141 start_time_s: 1678818083.039281 ``` Details: * `is_allocated` on each replica used to return just the node id for the controller to confirm the replica has been placed on a node and started. Now, it returns a tuple of runtime-context-related info: * `pid` * `actor_id` * `node_id` * `node_ip` * The four fields listed above that are retrieved from the replica actor may be `None` before the actor is actually scheduled, so they are marked optional in the schema. (The rest of the fields are filled in immediately when the replica is created to be tracked in the controller) ``` class ReplicaDetails(BaseModel, extra=Extra.forbid): replica_id: str state: ReplicaState pid: Optional[int] actor_name: str actor_id: Optional[str] node_id: Optional[str] node_ip: Optional[str] start_time_s: float ``` Signed-off-by: elliottower <[email protected]>

This adds details about the live replicas for each deployment to be fetched from the new GET endpoint. Sample replica detail from running a test application: ``` replica_id: app2_BasicDriver#KhlXQe state: RUNNING pid: 25853 actor_name: SERVE_REPLICA::app2_BasicDriver#KhlXQe actor_id: 2355af670b023966af79501501000000 node_id: 3631e75fc5312752c54b567ee66491a1e58a0420f0abc5b1c44e70cf node_ip: 192.168.0.141 start_time_s: 1678818083.039281 ``` Details: * `is_allocated` on each replica used to return just the node id for the controller to confirm the replica has been placed on a node and started. Now, it returns a tuple of runtime-context-related info: * `pid` * `actor_id` * `node_id` * `node_ip` * The four fields listed above that are retrieved from the replica actor may be `None` before the actor is actually scheduled, so they are marked optional in the schema. (The rest of the fields are filled in immediately when the replica is created to be tracked in the controller) ``` class ReplicaDetails(BaseModel, extra=Extra.forbid): replica_id: str state: ReplicaState pid: Optional[int] actor_name: str actor_id: Optional[str] node_id: Optional[str] node_ip: Optional[str] start_time_s: float ``` Signed-off-by: Jack He <[email protected]>

zcin added 2 commits March 14, 2023 11:15

add replica info

6ddf7ec

Signed-off-by: Cindy Zhang <[email protected]>

clean up

c3d8815

Signed-off-by: Cindy Zhang <[email protected]>

zcin marked this pull request as ready for review March 14, 2023 21:03

zcin requested review from sihanwang41, shrekris-anyscale and edoakes March 14, 2023 21:04

zcin assigned sihanwang41 and shrekris-anyscale Mar 14, 2023

shrekris-anyscale reviewed Mar 14, 2023

View reviewed changes

zcin added 2 commits March 14, 2023 16:22

Merge branch 'master' into replica-rest

6a43760

Signed-off-by: Cindy Zhang <[email protected]>

fix docstring

25b7b60

Signed-off-by: Cindy Zhang <[email protected]>

edoakes reviewed Mar 15, 2023

View reviewed changes

python/ray/serve/_private/deployment_state.py Outdated Show resolved Hide resolved

sihanwang41 reviewed Mar 15, 2023

View reviewed changes

python/ray/serve/_private/deployment_state.py Outdated Show resolved Hide resolved

python/ray/serve/schema.py Outdated Show resolved Hide resolved

python/ray/serve/schema.py Outdated Show resolved Hide resolved

zcin added 3 commits March 15, 2023 12:01

address comments

fe418e8

Signed-off-by: Cindy Zhang <[email protected]>

Merge branch 'master' into replica-rest

1e4c1f9

Signed-off-by: Cindy Zhang <[email protected]>

rename get_replica_details to list_replica_details

d60630c

Signed-off-by: Cindy Zhang <[email protected]>

zcin force-pushed the replica-rest branch from 719fb2a to d60630c Compare March 15, 2023 19:05

zcin added 2 commits March 15, 2023 15:04

improvements

d14bd7c

Signed-off-by: Cindy Zhang <[email protected]>

Merge branch 'master' into replica-rest

27e7413

Signed-off-by: Cindy Zhang <[email protected]>

edoakes approved these changes Mar 16, 2023

View reviewed changes

edoakes reviewed Mar 16, 2023

View reviewed changes

zcin added 2 commits March 16, 2023 08:49

uncomment, typo

610ba16

Signed-off-by: Cindy Zhang <[email protected]>

Merge branch 'master' into replica-rest

01a2c5a

Signed-off-by: Cindy Zhang <[email protected]>

shrekris-anyscale approved these changes Mar 16, 2023

View reviewed changes

zcin added 2 commits March 16, 2023 14:46

typo, remove self._replica_details

080eb2f

Signed-off-by: Cindy Zhang <[email protected]>

Merge branch 'master' into replica-rest

40a7b2f

Signed-off-by: Cindy Zhang <[email protected]>

edoakes merged commit 039c6b8 into ray-project:master Mar 17, 2023

zcin deleted the replica-rest branch March 20, 2023 16:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[serve] Add replica info to metadata rest api #33292

[serve] Add replica info to metadata rest api #33292

zcin commented Mar 14, 2023 •

edited

Loading

shrekris-anyscale left a comment

shrekris-anyscale Mar 14, 2023

zcin Mar 14, 2023 •

edited

Loading

shrekris-anyscale Mar 14, 2023

edoakes Mar 15, 2023

zcin commented Mar 16, 2023

edoakes Mar 16, 2023

zcin Mar 16, 2023

		@pytest.mark.skipif(sys.platform == "darwin", reason="Flaky on OSX.")
		# @pytest.mark.skipif(sys.platform == "darwin", reason="Flaky on OSX.")

[serve] Add replica info to metadata rest api #33292

[serve] Add replica info to metadata rest api #33292

Conversation

zcin commented Mar 14, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

shrekris-anyscale left a comment

Choose a reason for hiding this comment

shrekris-anyscale Mar 14, 2023

Choose a reason for hiding this comment

zcin Mar 14, 2023 • edited Loading

Choose a reason for hiding this comment

shrekris-anyscale Mar 14, 2023

Choose a reason for hiding this comment

edoakes Mar 15, 2023

Choose a reason for hiding this comment

zcin commented Mar 16, 2023

edoakes Mar 16, 2023

Choose a reason for hiding this comment

zcin Mar 16, 2023

Choose a reason for hiding this comment

zcin commented Mar 14, 2023 •

edited

Loading

zcin Mar 14, 2023 •

edited

Loading