Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[serve] Add replica info to metadata rest api #33292

Merged
merged 13 commits into from
Mar 17, 2023

Conversation

zcin
Copy link
Contributor

@zcin zcin commented Mar 14, 2023

Why are these changes needed?

This adds details about the live replicas for each deployment to be fetched from the new GET endpoint.

Sample replica detail from running a test application:

replica_id: app2_BasicDriver#KhlXQe
state: RUNNING
pid: 25853
actor_name: SERVE_REPLICA::app2_BasicDriver#KhlXQe
actor_id: 2355af670b023966af79501501000000
node_id: 3631e75fc5312752c54b567ee66491a1e58a0420f0abc5b1c44e70cf
node_ip: 192.168.0.141
start_time_s: 1678818083.039281

Details:

  • is_allocated on each replica used to return just the node id for the controller to confirm the replica has been placed on a node and started. Now, it returns a tuple of runtime-context-related info:
    • pid
    • actor_id
    • node_id
    • node_ip
  • The four fields listed above that are retrieved from the replica actor may be None before the actor is actually scheduled, so they are marked optional in the schema. (The rest of the fields are filled in immediately when the replica is created to be tracked in the controller)
class ReplicaDetails(BaseModel, extra=Extra.forbid):
    replica_id: str
    state: ReplicaState
    pid: Optional[int]
    actor_name: str
    actor_id: Optional[str]
    node_id: Optional[str]
    node_ip: Optional[str]
    start_time_s: float

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
@zcin zcin marked this pull request as ready for review March 14, 2023 21:03
Copy link
Contributor

@shrekris-anyscale shrekris-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work so far! I left a few comments.

python/ray/serve/_private/replica.py Outdated Show resolved Hide resolved
python/ray/serve/schema.py Show resolved Hide resolved
node_ip: Optional[str] = Field(
description="IP address of the node that the replica actor is running on."
)
start_time_s: float = Field(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this field isn't always the time when the replica actor started? Could the replica store its start time locally and send it back to the controller after the controller recovers?

Copy link
Contributor Author

@zcin zcin Mar 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm not sure why we do this, but the start time is reset upon recovery here.

I'm not sure what you mean by locally - what I was considering was storing the start time on each replica actor, then fetching it either after starting the replica or after the controller recovers. Is this also what you were referring to?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm not sure why we do this, but the start time is reset upon recovery here.

Interesting– I'm not sure either. @edoakes do you know why we reset start time?

I'm not sure what you mean by locally - what I was considering was storing the start time on each replica actor, then fetching it either after starting the replica or after the controller recovers. Is this also what you were referring to?

Yep, that's what I mean too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We reset start time because the start time is lost when the controller crashes (it's stored in memory).

We could fix this by instead storing the start_time on the replica and returning it w/ the metadata, but it seems like a P1 follow up to me.

Signed-off-by: Cindy Zhang <[email protected]>
python/ray/serve/_private/deployment_state.py Outdated Show resolved Hide resolved
python/ray/serve/schema.py Outdated Show resolved Hide resolved
python/ray/serve/schema.py Outdated Show resolved Hide resolved
Signed-off-by: Cindy Zhang <[email protected]>
@zcin
Copy link
Contributor Author

zcin commented Mar 16, 2023

@shrekris-anyscale @sihanwang41 @edoakes Addressed comments, please take another look!

@pytest.mark.skipif(sys.platform == "darwin", reason="Flaky on OSX.")
# @pytest.mark.skipif(sys.platform == "darwin", reason="Flaky on OSX.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this no longer flaky? if so remove the commented-out line entirely please

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops I commented that out for testing locally. I'm not sure if these are flaky or not, since they work on my laptop, but I'll leave it as is for now.

@edoakes edoakes merged commit 039c6b8 into ray-project:master Mar 17, 2023
@zcin zcin deleted the replica-rest branch March 20, 2023 16:04
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request Mar 21, 2023
This adds details about the live replicas for each deployment to be fetched from the new GET endpoint.

Sample replica detail from running a test application:
```
replica_id: app2_BasicDriver#KhlXQe
state: RUNNING
pid: 25853
actor_name: SERVE_REPLICA::app2_BasicDriver#KhlXQe
actor_id: 2355af670b023966af79501501000000
node_id: 3631e75fc5312752c54b567ee66491a1e58a0420f0abc5b1c44e70cf
node_ip: 192.168.0.141
start_time_s: 1678818083.039281
```

Details:
* `is_allocated` on each replica used to return just the node id for the controller to confirm the replica has been placed on a node and started. Now, it returns a tuple of runtime-context-related info:
  * `pid`
  * `actor_id`
  * `node_id`
  * `node_ip`
* The four fields listed above that are retrieved from the replica actor may be `None` before the actor is actually scheduled, so they are marked optional in the schema. (The rest of the fields are filled in immediately when the replica is created to be tracked in the controller)
```
class ReplicaDetails(BaseModel, extra=Extra.forbid):
    replica_id: str
    state: ReplicaState
    pid: Optional[int]
    actor_name: str
    actor_id: Optional[str]
    node_id: Optional[str]
    node_ip: Optional[str]
    start_time_s: float
```

Signed-off-by: Jack He <[email protected]>
edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023
This adds details about the live replicas for each deployment to be fetched from the new GET endpoint.

Sample replica detail from running a test application:
```
replica_id: app2_BasicDriver#KhlXQe
state: RUNNING
pid: 25853
actor_name: SERVE_REPLICA::app2_BasicDriver#KhlXQe
actor_id: 2355af670b023966af79501501000000
node_id: 3631e75fc5312752c54b567ee66491a1e58a0420f0abc5b1c44e70cf
node_ip: 192.168.0.141
start_time_s: 1678818083.039281
```

Details:
* `is_allocated` on each replica used to return just the node id for the controller to confirm the replica has been placed on a node and started. Now, it returns a tuple of runtime-context-related info:
  * `pid`
  * `actor_id`
  * `node_id`
  * `node_ip`
* The four fields listed above that are retrieved from the replica actor may be `None` before the actor is actually scheduled, so they are marked optional in the schema. (The rest of the fields are filled in immediately when the replica is created to be tracked in the controller)
```
class ReplicaDetails(BaseModel, extra=Extra.forbid):
    replica_id: str
    state: ReplicaState
    pid: Optional[int]
    actor_name: str
    actor_id: Optional[str]
    node_id: Optional[str]
    node_ip: Optional[str]
    start_time_s: float
```

Signed-off-by: Edward Oakes <[email protected]>
chaowanggg pushed a commit to chaowanggg/ray-dev that referenced this pull request Apr 4, 2023
This adds details about the live replicas for each deployment to be fetched from the new GET endpoint.

Sample replica detail from running a test application:
```
replica_id: app2_BasicDriver#KhlXQe
state: RUNNING
pid: 25853
actor_name: SERVE_REPLICA::app2_BasicDriver#KhlXQe
actor_id: 2355af670b023966af79501501000000
node_id: 3631e75fc5312752c54b567ee66491a1e58a0420f0abc5b1c44e70cf
node_ip: 192.168.0.141
start_time_s: 1678818083.039281
```

Details:
* `is_allocated` on each replica used to return just the node id for the controller to confirm the replica has been placed on a node and started. Now, it returns a tuple of runtime-context-related info:
  * `pid`
  * `actor_id`
  * `node_id`
  * `node_ip`
* The four fields listed above that are retrieved from the replica actor may be `None` before the actor is actually scheduled, so they are marked optional in the schema. (The rest of the fields are filled in immediately when the replica is created to be tracked in the controller)
```
class ReplicaDetails(BaseModel, extra=Extra.forbid):
    replica_id: str
    state: ReplicaState
    pid: Optional[int]
    actor_name: str
    actor_id: Optional[str]
    node_id: Optional[str]
    node_ip: Optional[str]
    start_time_s: float
```

Signed-off-by: chaowang <[email protected]>
elliottower pushed a commit to elliottower/ray that referenced this pull request Apr 22, 2023
This adds details about the live replicas for each deployment to be fetched from the new GET endpoint.

Sample replica detail from running a test application:
```
replica_id: app2_BasicDriver#KhlXQe
state: RUNNING
pid: 25853
actor_name: SERVE_REPLICA::app2_BasicDriver#KhlXQe
actor_id: 2355af670b023966af79501501000000
node_id: 3631e75fc5312752c54b567ee66491a1e58a0420f0abc5b1c44e70cf
node_ip: 192.168.0.141
start_time_s: 1678818083.039281
```

Details:
* `is_allocated` on each replica used to return just the node id for the controller to confirm the replica has been placed on a node and started. Now, it returns a tuple of runtime-context-related info:
  * `pid`
  * `actor_id`
  * `node_id`
  * `node_ip`
* The four fields listed above that are retrieved from the replica actor may be `None` before the actor is actually scheduled, so they are marked optional in the schema. (The rest of the fields are filled in immediately when the replica is created to be tracked in the controller)
```
class ReplicaDetails(BaseModel, extra=Extra.forbid):
    replica_id: str
    state: ReplicaState
    pid: Optional[int]
    actor_name: str
    actor_id: Optional[str]
    node_id: Optional[str]
    node_ip: Optional[str]
    start_time_s: float
```

Signed-off-by: elliottower <[email protected]>
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request May 4, 2023
This adds details about the live replicas for each deployment to be fetched from the new GET endpoint.

Sample replica detail from running a test application:
```
replica_id: app2_BasicDriver#KhlXQe
state: RUNNING
pid: 25853
actor_name: SERVE_REPLICA::app2_BasicDriver#KhlXQe
actor_id: 2355af670b023966af79501501000000
node_id: 3631e75fc5312752c54b567ee66491a1e58a0420f0abc5b1c44e70cf
node_ip: 192.168.0.141
start_time_s: 1678818083.039281
```

Details:
* `is_allocated` on each replica used to return just the node id for the controller to confirm the replica has been placed on a node and started. Now, it returns a tuple of runtime-context-related info:
  * `pid`
  * `actor_id`
  * `node_id`
  * `node_ip`
* The four fields listed above that are retrieved from the replica actor may be `None` before the actor is actually scheduled, so they are marked optional in the schema. (The rest of the fields are filled in immediately when the replica is created to be tracked in the controller)
```
class ReplicaDetails(BaseModel, extra=Extra.forbid):
    replica_id: str
    state: ReplicaState
    pid: Optional[int]
    actor_name: str
    actor_id: Optional[str]
    node_id: Optional[str]
    node_ip: Optional[str]
    start_time_s: float
```

Signed-off-by: Jack He <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants