Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve][Doc] Add Failure Recovery Doc #19166

Merged
merged 2 commits into from
Oct 21, 2021

Conversation

simon-mo
Copy link
Contributor

@simon-mo simon-mo commented Oct 6, 2021

Why are these changes needed?

Port over the gist and highlight its experimental status.
image

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@@ -207,6 +207,46 @@ Please refer to the Kubernetes documentation for more information.
.. _`NodePort`: https://kubernetes.io/docs/concepts/services-networking/service/#publishing-services-service-types


Failure Recovery
================
Ray Serve is resilient to any component failures within the Ray cluster out of the box.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please specify how process and worker node failures are handled?

Failure Recovery
================
Ray Serve is resilient to any component failures within the Ray cluster out of the box.
However, when the Ray cluster goes down, you would need to recover the state by creating a new Ray cluster and re-deploys all Serve deployments into that cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specify that this mean the head node specifically

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And that Ray is not currently HA, but it's on the long term roadmap so this is a "temporary limitation"

Comment on lines 245 to 246
While we have native support for on disk and AWS S3 storage, there is no reason we cannot support more.
You can easily try to plug into your own implementation using the ``custom://`` path and inherit the `KVStoreBase`_ class.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add call to action to ask people to open a github issue and/or contribute a backend for this.

@edoakes edoakes added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 6, 2021
@simon-mo simon-mo requested a review from edoakes October 20, 2021 01:43
@simon-mo simon-mo removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 20, 2021
@simon-mo simon-mo merged commit 32e648e into ray-project:master Oct 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants