Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[serve] split ReplicaStartupState.PENDING into PENDING_ALLOCATION and PENDING_INITIALIZATION #19431

Merged

Conversation

iasoon
Copy link
Member

@iasoon iasoon commented Oct 17, 2021

Why are these changes needed?

As descibed in #19156, at this time we cannot tell whether replica startup is slow due to user code (in a deployment __init__) , or due to the cluster having insufficient resources to start a worker for the replica.
This change separates WrappedReplica initialization ("allocation") and Replica initialization itself ("initialization") into two distinct StartupStates, allowing us to provide more precise user feedback.

@simon-mo would you mind having a quick look at this POC to check whether the approach aligns with what you had in mind? If so, I'll clean up the implementation a bit and add some tests (I tested manually with a small script that creates a deployment with a time.sleep in __init__).

Related issue number

Closes #19156

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@iasoon iasoon marked this pull request as draft October 17, 2021 14:03
@jiaodong jiaodong self-assigned this Oct 18, 2021
Copy link
Member

@jiaodong jiaodong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @iasoon thanks for the contribution ! It's great to have users who are interested in ray serve as well as willing to help it to improve :D

python/ray/serve/backend_state.py Show resolved Hide resolved
python/ray/serve/backend_state.py Outdated Show resolved Hide resolved
logger.warning(
f"Deployment '{self._name}' has "
f"{len(pending_initialization)} replicas that have taken "
f"more than {SLOW_STARTUP_WARNING_S}s to initialize")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe expand this a bit further to indicate what are the possible causes and how it's different from pending_allocation replicas ? Something like "running your deployment's constructor such as loading model weights, or reconfiguring your deployment "

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. We typically would add what can the user do to alleviate this warning. Something like:
This means your constructor or configure methods are slow, please try to measure their runtime locally. You can ignore this message if your are expecting the initialization takes a long time.

@@ -106,7 +106,7 @@ def recover(self):

def check_ready(self) -> ReplicaStartupStatus:
ready = self.ready
self.ready = ReplicaStartupStatus.PENDING
self.ready = ReplicaStartupStatus.PENDING_INITIALIZATION
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also add tests to capture the PENDING_ALLOCATION -> PENDING_INITIALIZATION transition ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a suggestion on how this would best be tested? At the moment ReplicaStartupStatus is kept internal to the BackendState so as far as I can see there is no evident way to do this.
I tested this code on my machine by creating a deployment when no resources were available (to stay in PENDING_ALLOCATION) and one with a wait in __init__ (to stay in PENDING_INITIALIZATION), and waiting for the warning logs. Maybe I could replicate that setup in a test?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edoakes @jiaodong any suggestion here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm it will be a little hard to unit test unfortunately. At a minimum, could you please add an integration test (e.g., in test_deploy.py that tests the two conditions:

  • First case: try to schedule a replica that requires more resources than are available in the cluster (e.g., num_cpus=1000). Check to see that only the first warning is printed.
  • Second case: schedule a replica that hangs forever in its constructor, wait for the warning to be printed.
  • Ideally, also a third case that checks both: start with unschedulable, add resources to the cluster to make it schedulable (e.g., add another. node using cluster_utils, we do this in tests/test_cluster.py), then have it hang in constructor. Check that both are printed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I added some tests corresponding to the cases you outlined @edoakes ! Let me know what you think!

python/ray/serve/replica.py Outdated Show resolved Hide resolved
# Is it fine that backend is None here?
# Should we add a check in all methods that use self.backend
# or, alternatively, create an async get_backend() method?
self.backend = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it's worth double checking on the implications here because we're making replica constructor an async function as opposed to only return when RayServeReplica is constructed (it's a cheap instance however). It might be ok since we don't consider a replica ready until reconfigure() returned, so we won't direct traffic to it while there's no backend for handle_request(). cc: @simon-mo can you help to confirm ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be fine just keep the constructor in the old way. Ray actor will run the init in a blocking way; you can use ray.wait([ready_ref]) to ensure that the actor has been properly startup. Is there any special reason for this?

Copy link
Member Author

@iasoon iasoon Oct 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, the issue we're running into is that when backend initialization is slow (due to a slow init in user code, for example), the RayServeWrappedReplica.__init__ will also be slow. We cannot call a remote method on an actor before its __init__ has completed, right? If not, I think we'll either need this async backend initialization, or a different way to signal allocation success back to BackendState.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah i see your confusion here. We can approach this by only initialize user backend when the controller explicitly tells the replica to initialize it. We don't need to do it in an async method. Async method actually won't help here because time consuming init function will still block the event loop and delay response.

So we can have something like

class RayServeWrappedReplica:
    def __init__(self): init serve code
    def perform_user_init(self): init user code (need better name here)
    def handle_request(self):
        self.check_backend_initialized. 

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might still be confused a bit but I think that's what the current implementation does (admittedly in a bit of a dirty way). I capture the init of the user code in a closure (start_replica) and call that before reconfigureing. I think the async is required to allow for async __init__ functions in the user deployment code, no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup i read the code again and it make sense to me!

@simon-mo simon-mo self-assigned this Oct 25, 2021
Copy link
Contributor

@simon-mo simon-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending the tests and small nits.

logger.warning(
f"Deployment '{self._name}' has "
f"{len(pending_initialization)} replicas that have taken "
f"more than {SLOW_STARTUP_WARNING_S}s to initialize")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. We typically would add what can the user do to alleviate this warning. Something like:
This means your constructor or configure methods are slow, please try to measure their runtime locally. You can ignore this message if your are expecting the initialization takes a long time.

@@ -108,8 +119,21 @@ async def handle_request(
query = Query(request_args, request_kwargs, request_metadata)
return await self.backend.handle_request(query)

async def poke(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
async def poke(self):
async def is_allocated(self):

@iasoon iasoon force-pushed the feature/19156-slow-startup-error-messages branch from a259d4c to b98c572 Compare October 28, 2021 20:26
Copy link
Contributor

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iasoon this looks really great! Just a nit and a suggestion for an integration test. LMK if you have any questions on how to write this.

Comment on lines 1120 to 1121
f"{len(pending_allocation)} replicas that have taken "
f"more than {SLOW_STARTUP_WARNING_S}s to start up. This "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
f"{len(pending_allocation)} replicas that have taken "
f"more than {SLOW_STARTUP_WARNING_S}s to start up. This "
f"{len(pending_allocation)} replicas that have taken "
f"more than {SLOW_STARTUP_WARNING_S}s to be scheduled. This "

@@ -106,7 +106,7 @@ def recover(self):

def check_ready(self) -> ReplicaStartupStatus:
ready = self.ready
self.ready = ReplicaStartupStatus.PENDING
self.ready = ReplicaStartupStatus.PENDING_INITIALIZATION
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm it will be a little hard to unit test unfortunately. At a minimum, could you please add an integration test (e.g., in test_deploy.py that tests the two conditions:

  • First case: try to schedule a replica that requires more resources than are available in the cluster (e.g., num_cpus=1000). Check to see that only the first warning is printed.
  • Second case: schedule a replica that hangs forever in its constructor, wait for the warning to be printed.
  • Ideally, also a third case that checks both: start with unschedulable, add resources to the cluster to make it schedulable (e.g., add another. node using cluster_utils, we do this in tests/test_cluster.py), then have it hang in constructor. Check that both are printed.

@edoakes edoakes added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 28, 2021
@iasoon iasoon marked this pull request as ready for review October 29, 2021 01:26
@iasoon
Copy link
Member Author

iasoon commented Oct 29, 2021

Tests are failing on windows. Is this expected, and should the test be marked as skip on windows?

Copy link
Contributor

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iasoon looks good, just pending adding the tests to run in CI.

Re: windows build, yes please go ahead and skip it on windows.

@@ -0,0 +1,44 @@
import time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iasoon because you're adding a new test file, you need to add an entry in ray/serve/BUILD in order for this to run in CI. Could you please either add an entry there (copy one of the existing ones, should be straightforward) or else move these tests to an existing test file?

Comment on lines 16 to 18
expected_warning = (f"Deployment '{D.name}' has "
f"1 replicas that have taken "
f"more than 30s to be scheduled.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would be nice to also have this check that it's only printed once per deployment even if there are many replicas blocking (not necessary to add here to merge)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! I had a go at implementing it :)


# wait long enough for the warning to be printed
# with a small grace period
time.sleep(SLOW_STARTUP_WARNING_PERIOD_S * 1.5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little worried this will be flaky in CI (CI is sometimes very very slow) and also that the test will take a long time to run. Any way you can think of to make it deterministic?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it's a bit tricky because this behaviour is purely time-based. Maybe we could re-use MockTimer from tests/test_backend_state.py for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update: AFAIK we can't use MockTimer since the serve controller runs in a different process from the test, as to my knowledge you cannot mock in another process.

@edoakes
Copy link
Contributor

edoakes commented Nov 1, 2021

@iasoon looks like tests all passed but you have some merge conflicts -- mind addressing those and re-pushing?

@edoakes
Copy link
Contributor

edoakes commented Nov 2, 2021

Sorry @iasoon, we have an outage with our build system so tests are not running properly (hence the failures). Should be fixed soon...

@simon-mo
Copy link
Contributor

simon-mo commented Nov 2, 2021

please pull master to resolve the build breakage! thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] [Serve] More helpful error message for long startup time
4 participants