-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[serve] Add replica info to metadata rest api #33292
Conversation
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work so far! I left a few comments.
node_ip: Optional[str] = Field( | ||
description="IP address of the node that the replica actor is running on." | ||
) | ||
start_time_s: float = Field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason this field isn't always the time when the replica actor started? Could the replica store its start time locally and send it back to the controller after the controller recovers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'm not sure why we do this, but the start time is reset upon recovery here.
I'm not sure what you mean by locally - what I was considering was storing the start time on each replica actor, then fetching it either after starting the replica or after the controller recovers. Is this also what you were referring to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'm not sure why we do this, but the start time is reset upon recovery here.
Interesting– I'm not sure either. @edoakes do you know why we reset start time?
I'm not sure what you mean by locally - what I was considering was storing the start time on each replica actor, then fetching it either after starting the replica or after the controller recovers. Is this also what you were referring to?
Yep, that's what I mean too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We reset start time because the start time is lost when the controller crashes (it's stored in memory).
We could fix this by instead storing the start_time on the replica and returning it w/ the metadata, but it seems like a P1 follow up to me.
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
@shrekris-anyscale @sihanwang41 @edoakes Addressed comments, please take another look! |
@pytest.mark.skipif(sys.platform == "darwin", reason="Flaky on OSX.") | ||
# @pytest.mark.skipif(sys.platform == "darwin", reason="Flaky on OSX.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this no longer flaky? if so remove the commented-out line entirely please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops I commented that out for testing locally. I'm not sure if these are flaky or not, since they work on my laptop, but I'll leave it as is for now.
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
This adds details about the live replicas for each deployment to be fetched from the new GET endpoint. Sample replica detail from running a test application: ``` replica_id: app2_BasicDriver#KhlXQe state: RUNNING pid: 25853 actor_name: SERVE_REPLICA::app2_BasicDriver#KhlXQe actor_id: 2355af670b023966af79501501000000 node_id: 3631e75fc5312752c54b567ee66491a1e58a0420f0abc5b1c44e70cf node_ip: 192.168.0.141 start_time_s: 1678818083.039281 ``` Details: * `is_allocated` on each replica used to return just the node id for the controller to confirm the replica has been placed on a node and started. Now, it returns a tuple of runtime-context-related info: * `pid` * `actor_id` * `node_id` * `node_ip` * The four fields listed above that are retrieved from the replica actor may be `None` before the actor is actually scheduled, so they are marked optional in the schema. (The rest of the fields are filled in immediately when the replica is created to be tracked in the controller) ``` class ReplicaDetails(BaseModel, extra=Extra.forbid): replica_id: str state: ReplicaState pid: Optional[int] actor_name: str actor_id: Optional[str] node_id: Optional[str] node_ip: Optional[str] start_time_s: float ``` Signed-off-by: Jack He <[email protected]>
This adds details about the live replicas for each deployment to be fetched from the new GET endpoint. Sample replica detail from running a test application: ``` replica_id: app2_BasicDriver#KhlXQe state: RUNNING pid: 25853 actor_name: SERVE_REPLICA::app2_BasicDriver#KhlXQe actor_id: 2355af670b023966af79501501000000 node_id: 3631e75fc5312752c54b567ee66491a1e58a0420f0abc5b1c44e70cf node_ip: 192.168.0.141 start_time_s: 1678818083.039281 ``` Details: * `is_allocated` on each replica used to return just the node id for the controller to confirm the replica has been placed on a node and started. Now, it returns a tuple of runtime-context-related info: * `pid` * `actor_id` * `node_id` * `node_ip` * The four fields listed above that are retrieved from the replica actor may be `None` before the actor is actually scheduled, so they are marked optional in the schema. (The rest of the fields are filled in immediately when the replica is created to be tracked in the controller) ``` class ReplicaDetails(BaseModel, extra=Extra.forbid): replica_id: str state: ReplicaState pid: Optional[int] actor_name: str actor_id: Optional[str] node_id: Optional[str] node_ip: Optional[str] start_time_s: float ``` Signed-off-by: Edward Oakes <[email protected]>
This adds details about the live replicas for each deployment to be fetched from the new GET endpoint. Sample replica detail from running a test application: ``` replica_id: app2_BasicDriver#KhlXQe state: RUNNING pid: 25853 actor_name: SERVE_REPLICA::app2_BasicDriver#KhlXQe actor_id: 2355af670b023966af79501501000000 node_id: 3631e75fc5312752c54b567ee66491a1e58a0420f0abc5b1c44e70cf node_ip: 192.168.0.141 start_time_s: 1678818083.039281 ``` Details: * `is_allocated` on each replica used to return just the node id for the controller to confirm the replica has been placed on a node and started. Now, it returns a tuple of runtime-context-related info: * `pid` * `actor_id` * `node_id` * `node_ip` * The four fields listed above that are retrieved from the replica actor may be `None` before the actor is actually scheduled, so they are marked optional in the schema. (The rest of the fields are filled in immediately when the replica is created to be tracked in the controller) ``` class ReplicaDetails(BaseModel, extra=Extra.forbid): replica_id: str state: ReplicaState pid: Optional[int] actor_name: str actor_id: Optional[str] node_id: Optional[str] node_ip: Optional[str] start_time_s: float ``` Signed-off-by: chaowang <[email protected]>
This adds details about the live replicas for each deployment to be fetched from the new GET endpoint. Sample replica detail from running a test application: ``` replica_id: app2_BasicDriver#KhlXQe state: RUNNING pid: 25853 actor_name: SERVE_REPLICA::app2_BasicDriver#KhlXQe actor_id: 2355af670b023966af79501501000000 node_id: 3631e75fc5312752c54b567ee66491a1e58a0420f0abc5b1c44e70cf node_ip: 192.168.0.141 start_time_s: 1678818083.039281 ``` Details: * `is_allocated` on each replica used to return just the node id for the controller to confirm the replica has been placed on a node and started. Now, it returns a tuple of runtime-context-related info: * `pid` * `actor_id` * `node_id` * `node_ip` * The four fields listed above that are retrieved from the replica actor may be `None` before the actor is actually scheduled, so they are marked optional in the schema. (The rest of the fields are filled in immediately when the replica is created to be tracked in the controller) ``` class ReplicaDetails(BaseModel, extra=Extra.forbid): replica_id: str state: ReplicaState pid: Optional[int] actor_name: str actor_id: Optional[str] node_id: Optional[str] node_ip: Optional[str] start_time_s: float ``` Signed-off-by: elliottower <[email protected]>
This adds details about the live replicas for each deployment to be fetched from the new GET endpoint. Sample replica detail from running a test application: ``` replica_id: app2_BasicDriver#KhlXQe state: RUNNING pid: 25853 actor_name: SERVE_REPLICA::app2_BasicDriver#KhlXQe actor_id: 2355af670b023966af79501501000000 node_id: 3631e75fc5312752c54b567ee66491a1e58a0420f0abc5b1c44e70cf node_ip: 192.168.0.141 start_time_s: 1678818083.039281 ``` Details: * `is_allocated` on each replica used to return just the node id for the controller to confirm the replica has been placed on a node and started. Now, it returns a tuple of runtime-context-related info: * `pid` * `actor_id` * `node_id` * `node_ip` * The four fields listed above that are retrieved from the replica actor may be `None` before the actor is actually scheduled, so they are marked optional in the schema. (The rest of the fields are filled in immediately when the replica is created to be tracked in the controller) ``` class ReplicaDetails(BaseModel, extra=Extra.forbid): replica_id: str state: ReplicaState pid: Optional[int] actor_name: str actor_id: Optional[str] node_id: Optional[str] node_ip: Optional[str] start_time_s: float ``` Signed-off-by: Jack He <[email protected]>
Why are these changes needed?
This adds details about the live replicas for each deployment to be fetched from the new GET endpoint.
Sample replica detail from running a test application:
Details:
is_allocated
on each replica used to return just the node id for the controller to confirm the replica has been placed on a node and started. Now, it returns a tuple of runtime-context-related info:pid
actor_id
node_id
node_ip
None
before the actor is actually scheduled, so they are marked optional in the schema. (The rest of the fields are filled in immediately when the replica is created to be tracked in the controller)Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.