Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix config frequent update #1014

Merged

Conversation

sihanwang41
Copy link
Contributor

@sihanwang41 sihanwang41 commented Apr 5, 2023

Why are these changes needed?

Current logic will cause the ray-operator keep sending requests when there is no deployment statuses. When the runtime env preparation time is long, ray-operator will keep redeploying the ray serve cluster.

So after ray-operator sends the config request, it will eventually get the status from ray serve.

Collaborating with customer to fix the runtime env issue, and verify RayService can be started with this change.
Config:

Change serveConfig in ray-operator/config/samples/ray_v1alpha1_rayservice.yaml to

  serveConfig:
    importPath: test_grpc:entry
    runtimeEnv: |
      working_dir: "https://github.com/sihanwang41/test-placeholder/raw/main/grpc_test.zip"

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@sihanwang41 sihanwang41 marked this pull request as ready for review April 11, 2023 16:06
Copy link
Contributor

@shrekris-anyscale shrekris-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change needs to be unit tested. I'm approving this for now since there are users that need this fix to go through asap, and the current testing framework doesn't let us easily test timing interactions with the runtime_env. We should follow up with a PR that adds tests.

@kevin85421
Copy link
Member

  1. Would you mind explaining the behavior of GET /api/serve/deployments/status?
  2. When will the deployment_statuses become non-empty (link)?
  3. Where does RayService controller update ServeStatuses?
  4. How to reproduce the issue? Can you reproduce this issue with v0.5.0?

When the runtime env preparation time is long, ray-operator will keep redeploying the ray serve cluster.

What's the relationship between runtime env preparation and the value of deployment_statuses?

@sihanwang41
Copy link
Contributor Author

  1. Would you mind explaining the behavior of GET /api/serve/deployments/status?
  2. When will the deployment_statuses become non-empty (link)?
  3. Where does RayService controller update ServeStatuses?
  4. How to reproduce the issue? Can you reproduce this issue with v0.5.0?

When the runtime env preparation time is long, ray-operator will keep redeploying the ray serve cluster.

What's the relationship between runtime env preparation and the value of deployment_statuses?

When the runtime env take very long time (e.g. user has a long list with pip install), serve will not provide deployment status since there is no serve deployment starting yet.
Without the fix, service operator will keep sending deploy request to the ray serve. it is causing the deploy never succeed. This is still the issue in v0.5.0.

@kevin85421
Copy link
Member

Sync with @sihanwang41 offline.

  1. Would you mind explaining the behavior of GET /api/serve/deployments/status?
  2. When will the deployment_statuses become non-empty (link)?
  3. Where does RayService controller update ServeStatuses?
  4. How to reproduce the issue? Can you reproduce this issue with v0.5.0?
  5. What's the relationship between runtime env preparation and the value of deployment_statuses?
  • Q1, Q2, Q5: deployment_statuses will be empty if the runtime environment is under preparation.

  • Q4: This issue can be reproduced with v0.5.0.

  • Q3: I read the code, and ServeStatuses seems to be updated in the following code path: Reconcile() -> updateStatusForActiveCluster() -> r.Status().Update() (in Reconcile()). This is pretty weird for me, but this is a different issue.

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Wait for the testing result of test_sample_rayservice_yamls.py.

@sihanwang41
Copy link
Contributor Author

sihanwang41 commented Apr 13, 2023

LGTM. Wait for the testing result of test_sample_rayservice_yamls.py.

Ran the test_sample_rayservice_yamls.py, there are

2023-04-12T23:26:21.989Z	ERROR	controllers.RayService	Fail to reconcileServe.	{"ServiceName": "default/rayservice-sample", "error": "Put \"http://rayservice-sample-raycluster-k99rg-dashboard-svc.default.svc.cluster.local:52365/api/serve/deployments/\": dial tcp 10.96.153.82:52365: connect: connection refused"}

it is consistent error even with latest master. I don't think it is caused by this pr.
All others look normal.

@kevin85421
Copy link
Member

Ran the test_sample_rayservice_yamls.py, there are

2023-04-12T23:26:21.989Z	ERROR	controllers.RayService	Fail to reconcileServe.	{"ServiceName": "default/rayservice-sample", "error": "Put \"http://rayservice-sample-raycluster-k99rg-dashboard-svc.default.svc.cluster.local:52365/api/serve/deployments/\": dial tcp 10.96.153.82:52365: connect: connection refused"}

it is consistent error even with latest master. I don't think it is caused by this pr. All others look normal.

Sync with Sihan offline. This error is due to use x86 image on Mac M1. test_sample_rayservice_yamls.py passes 25 times consecutively on my devbox. This PR does not introduce flakiness.

@kevin85421 kevin85421 merged commit deb29bd into ray-project:master Apr 13, 2023
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
…rve application (ray-project#1014)

Ignore deployments status to decide whether to deploy serve application
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants