Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rayservice] Add support for getting multi-app status #1136

Merged
merged 7 commits into from
Jun 9, 2023

Conversation

zcin
Copy link
Contributor

@zcin zcin commented Jun 2, 2023

Why are these changes needed?

Add a field serveConfigV2 in the RayService CRD which will be filled in in a follow-up PR.
serveConfigV2 corresponds to the Serve multi-application schema. If the user specifies the serve config using serveConfigV2, we should read the status from the multi-app endpoint (GET /api/serve/applications/).

This PR only adds the support for pulling the status from the multi-app endpoint. A follow up PR will add support for submitting the serve config to the multi-app endpoint.

Example Service Status:

Status:
  Active Service Status:
    Application Statuses:
      Default:
        Health Last Update Time:  2023-05-26T20:15:01Z
        Last Update Time:         2023-05-26T20:15:01Z
        Serve Deployment Statuses:
          DAG Driver:
            Health Last Update Time:  2023-05-26T20:15:01Z
            Last Update Time:         2023-05-26T20:15:01Z
            Status:                   HEALTHY
          Fruit Market:
            Health Last Update Time:  2023-05-26T20:15:01Z
            Last Update Time:         2023-05-26T20:15:01Z
            Status:                   HEALTHY
          Mango Stand:
            Health Last Update Time:  2023-05-26T20:15:01Z
            Last Update Time:         2023-05-26T20:15:01Z
            Status:                   HEALTHY
          Orange Stand:
            Health Last Update Time:  2023-05-26T20:15:01Z
            Last Update Time:         2023-05-26T20:15:01Z
            Status:                   HEALTHY
          Pear Stand:
            Health Last Update Time:  2023-05-26T20:15:01Z
            Last Update Time:         2023-05-26T20:15:01Z
            Status:                   HEALTHY
        Status:                       RUNNING
    Dashboard Status:
      Health Last Update Time:  2023-05-26T20:15:01Z
      Is Healthy:               true
      Last Update Time:         2023-05-26T20:15:01Z
    Ray Cluster Name:           rayservice-sample-raycluster-2tqs5
    Ray Cluster Status:
      Head:
  Observed Generation:  1
  Pending Service Status:
    Dashboard Status:
    Ray Cluster Status:
      Head:
  Service Status:  Running

Other changes:

  • Removed lastupdatetime and healthlastupdatetime from the return types of the dashboard client functions, because those values are completely untouched by the dashboard client, and instead only filled in inside rayservice controller methods. I've separated the return types of dashboard client functions from the rayservice status types to make this possible. As a result, I have also removed the time input from the testing function generateServeStatus, since it is (1) unrealistic because in reality the dashboard client would never fill in those values and (2) ineffective because the values are overwritten in the reconcile loop of the rayservice controller anyways.

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@zcin zcin force-pushed the multi-app-status branch 5 times, most recently from 1c75391 to 755c41c Compare June 2, 2023 07:04
@zcin zcin marked this pull request as ready for review June 2, 2023 22:32
@zcin zcin requested a review from kevin85421 June 2, 2023 22:32
@zcin zcin force-pushed the multi-app-status branch 2 times, most recently from d2372cd to 18195e6 Compare June 6, 2023 01:53
@zcin
Copy link
Contributor Author

zcin commented Jun 6, 2023

@kevin85421 The tests should be passing, please take a look when you get the chance!

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Others: Could you ensure that all variables in the tests are valid and avoid hardcoding the values? For example, both Status: "unhealthy" and deploymentStatus.Status != "HEALTHY" are present in the codebase.

ray-operator/config/manager/kustomization.yaml Outdated Show resolved Hide resolved
@@ -648,68 +672,99 @@ func (r *RayServiceReconciler) updateServeDeployment(ctx context.Context, raySer
// updates health timestamps, and checks if the RayCluster is overall healthy.
// It's return values should be interpreted as
// (Serve app healthy?, Serve app ready?, error if any)
func (r *RayServiceReconciler) getAndCheckServeStatus(ctx context.Context, dashboardClient utils.RayDashboardClientInterface, rayServiceServeStatus *rayv1alpha1.RayServiceStatus, unhealthySecondThreshold *int32) (bool, bool, error) {
func (r *RayServiceReconciler) getAndCheckServeStatus(ctx context.Context, dashboardClient utils.RayDashboardClientInterface, rayServiceServeStatus *rayv1alpha1.RayServiceStatus, serveConfigType utils.RayServeConfigType, unhealthySecondThreshold *int32) (bool, bool, error) {
serviceUnhealthySecondThreshold := ServiceUnhealthySecondThreshold
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although the variable names are self-explanatory, it would be helpful to have comments for them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the comments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added comments for serveConfigType above the function

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I am talking about ServiceUnhealthySecondThreshold and serveConfigTypeForTesting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see! Added comments now for them too :)

}

// Check app status
if app.Status != rayv1alpha1.ApplicationStatusEnum.RUNNING {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the definition of RUNNING for a Serve application?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RUNNING = all deployments are HEALTHY

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it correct that isHealthy must be true if app.Status is not RUNNING?

ray-operator/controllers/ray/utils/dashboard_httpclient.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/utils/dashboard_httpclient.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/utils/dashboard_httpclient.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/utils/dashboard_httpclient.go Outdated Show resolved Hide resolved
Signed-off-by: Cindy Zhang <[email protected]>
@zcin
Copy link
Contributor Author

zcin commented Jun 6, 2023

Others: Could you ensure that all variables in the tests are valid and avoid hardcoding the values? For example, both Status: "unhealthy" and deploymentStatus.Status != "HEALTHY" are present in the codebase.

@kevin85421 I've changed all of the hardcoded strings to the enums.

cindyz added 2 commits June 6, 2023 20:25
Signed-off-by: cindyz <[email protected]>
Signed-off-by: cindyz <[email protected]>
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave some minor comments. By the way, I have not reviewed changes in apiserver and proto. We can either ping some KubeRay API Server folks to review them or revert the changes and do it in a follow-up PR, or ask KubeRay API Server folks for help.

// If the user has set a value for `ServeConfigV2`, the config type is MULTI_APP
// Otherwise, the user should have set a value for `ServeConfig`, in which case the config type is SINGLE_APP
func (r *RayServiceReconciler) determineServeConfigType(ctx context.Context, rayServiceInstance *rayv1alpha1.RayService) utils.RayServeConfigType {
if rayServiceInstance.Spec.ServeConfigV2 == (rayv1alpha1.ServeConfigV2{}) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it safe to use == to compare two structs or should we use DeepEqual instead?

@@ -648,68 +672,99 @@ func (r *RayServiceReconciler) updateServeDeployment(ctx context.Context, raySer
// updates health timestamps, and checks if the RayCluster is overall healthy.
// It's return values should be interpreted as
// (Serve app healthy?, Serve app ready?, error if any)
func (r *RayServiceReconciler) getAndCheckServeStatus(ctx context.Context, dashboardClient utils.RayDashboardClientInterface, rayServiceServeStatus *rayv1alpha1.RayServiceStatus, unhealthySecondThreshold *int32) (bool, bool, error) {
func (r *RayServiceReconciler) getAndCheckServeStatus(ctx context.Context, dashboardClient utils.RayDashboardClientInterface, rayServiceServeStatus *rayv1alpha1.RayServiceStatus, serveConfigType utils.RayServeConfigType, unhealthySecondThreshold *int32) (bool, bool, error) {
serviceUnhealthySecondThreshold := ServiceUnhealthySecondThreshold
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the comments?

}

// Check app status
if app.Status != rayv1alpha1.ApplicationStatusEnum.RUNNING {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it correct that isHealthy must be true if app.Status is not RUNNING?

HealthLastUpdateTime: &timeNow,
}

if deployment.Status != rayv1alpha1.DeploymentStatusEnum.HEALTHY {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a replica is not healthy, it is unable to serve any requests. Is it correct?

ray-operator/controllers/ray/rayservice_controller_test.go Outdated Show resolved Hide resolved
@zcin
Copy link
Contributor Author

zcin commented Jun 6, 2023

Leave some minor comments. By the way, I have not reviewed changes in apiserver and proto. We can either ping some KubeRay API Server folks to review them or revert the changes and do it in a follow-up PR, or ask KubeRay API Server folks for help.

Hmm, I don't think it's possible to revert the changes (correct me if I'm wrong), because the code in apiserver references the rayservice types. Since I changed the rayservice types, I believe the code would error.
Could you ping someone who can review these changes? Should be very minor.

cindyz added 2 commits June 6, 2023 21:28
Signed-off-by: cindyz <[email protected]>
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (for changes not under apiserver/ and proto/)

cc @Jeffwan @scarlet25151 @blublinsky would you mind taking a look at the changes for apiserver/ and proto/? Thanks!

@kevin85421
Copy link
Member

Tomorrow, I will review the changes made under the proto/ and apiserver/ directories. @zcin, could you please provide more details on how to test the modifications in these two directories? In addition, could you rebase with the master branch to resolve the conflicts? Thanks!

@zcin
Copy link
Contributor Author

zcin commented Jun 9, 2023

Tomorrow, I will review the changes made under the proto/ and apiserver/ directories. @zcin, could you please provide more details on how to test the modifications in these two directories? In addition, could you rebase with the master branch to resolve the conflicts? Thanks!

@kevin85421 Resolved the conflicts! As for the changes under apiserver and proto, I only made the bare minimum changes to make the code compile because the structs changed in rayservice_types.go. I am not sure how to test it.

@blublinsky
Copy link
Contributor

Changes to proto/ and apiserver/ LGTM

@scarlet25151
Copy link
Collaborator

scarlet25151 commented Jun 9, 2023

proto/ and apiserver/ parts LGTM.

@kevin85421
Copy link
Member

Thank @blublinsky and @scarlet25151 for the review!

@kevin85421
Copy link
Member

CC: @architkulkarni, this pull request should not be included in the minor release v0.5.2. Is it acceptable to merge this pull request at this time?

@architkulkarni
Copy link
Contributor

@kevin85421 Sounds good, fine with me.

@kevin85421 kevin85421 merged commit 0ae9fc1 into ray-project:master Jun 9, 2023
@zcin zcin mentioned this pull request Jun 17, 2023
2 tasks
@zcin zcin deleted the multi-app-status branch August 25, 2023 17:35
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants