[serve] split ReplicaStartupState.PENDING into PENDING_ALLOCATION and PENDING_INITIALIZATION #19431

iasoon · 2021-10-17T14:03:30Z

Why are these changes needed?

As descibed in #19156, at this time we cannot tell whether replica startup is slow due to user code (in a deployment __init__) , or due to the cluster having insufficient resources to start a worker for the replica.
This change separates WrappedReplica initialization ("allocation") and Replica initialization itself ("initialization") into two distinct StartupStates, allowing us to provide more precise user feedback.

@simon-mo would you mind having a quick look at this POC to check whether the approach aligns with what you had in mind? If so, I'll clean up the implementation a bit and add some tests (I tested manually with a small script that creates a deployment with a time.sleep in __init__).

Related issue number

Closes #19156

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

jiaodong

hi @iasoon thanks for the contribution ! It's great to have users who are interested in ray serve as well as willing to help it to improve :D

python/ray/serve/backend_state.py

jiaodong · 2021-10-19T22:22:38Z

python/ray/serve/backend_state.py

+                logger.warning(
+                    f"Deployment '{self._name}' has "
+                    f"{len(pending_initialization)} replicas that have taken "
+                    f"more than {SLOW_STARTUP_WARNING_S}s to initialize")


maybe expand this a bit further to indicate what are the possible causes and how it's different from pending_allocation replicas ? Something like "running your deployment's constructor such as loading model weights, or reconfiguring your deployment "

+1. We typically would add what can the user do to alleviate this warning. Something like:
This means your constructor or configure methods are slow, please try to measure their runtime locally. You can ignore this message if your are expecting the initialization takes a long time.

jiaodong · 2021-10-19T22:37:43Z

python/ray/serve/tests/test_backend_state.py

@@ -106,7 +106,7 @@ def recover(self):

    def check_ready(self) -> ReplicaStartupStatus:
        ready = self.ready
-        self.ready = ReplicaStartupStatus.PENDING
+        self.ready = ReplicaStartupStatus.PENDING_INITIALIZATION


can we also add tests to capture the PENDING_ALLOCATION -> PENDING_INITIALIZATION transition ?

Do you have a suggestion on how this would best be tested? At the moment ReplicaStartupStatus is kept internal to the BackendState so as far as I can see there is no evident way to do this.
I tested this code on my machine by creating a deployment when no resources were available (to stay in PENDING_ALLOCATION) and one with a wait in __init__ (to stay in PENDING_INITIALIZATION), and waiting for the warning logs. Maybe I could replicate that setup in a test?

@edoakes @jiaodong any suggestion here

Hmm it will be a little hard to unit test unfortunately. At a minimum, could you please add an integration test (e.g., in test_deploy.py that tests the two conditions:

First case: try to schedule a replica that requires more resources than are available in the cluster (e.g., num_cpus=1000). Check to see that only the first warning is printed.

Second case: schedule a replica that hangs forever in its constructor, wait for the warning to be printed.

Ideally, also a third case that checks both: start with unschedulable, add resources to the cluster to make it schedulable (e.g., add another. node using cluster_utils, we do this in tests/test_cluster.py), then have it hang in constructor. Check that both are printed.

Alright, I added some tests corresponding to the cases you outlined @edoakes ! Let me know what you think!

python/ray/serve/replica.py

jiaodong · 2021-10-19T22:48:18Z

python/ray/serve/replica.py

+            # Is it fine that backend is None here?
+            # Should we add a check in all methods that use self.backend
+            # or, alternatively, create an async get_backend() method?
+            self.backend = None


i think it's worth double checking on the implications here because we're making replica constructor an async function as opposed to only return when RayServeReplica is constructed (it's a cheap instance however). It might be ok since we don't consider a replica ready until reconfigure() returned, so we won't direct traffic to it while there's no backend for handle_request(). cc: @simon-mo can you help to confirm ?

We should be fine just keep the constructor in the old way. Ray actor will run the init in a blocking way; you can use ray.wait([ready_ref]) to ensure that the actor has been properly startup. Is there any special reason for this?

If I understand correctly, the issue we're running into is that when backend initialization is slow (due to a slow init in user code, for example), the RayServeWrappedReplica.__init__ will also be slow. We cannot call a remote method on an actor before its __init__ has completed, right? If not, I think we'll either need this async backend initialization, or a different way to signal allocation success back to BackendState.

Ah i see your confusion here. We can approach this by only initialize user backend when the controller explicitly tells the replica to initialize it. We don't need to do it in an async method. Async method actually won't help here because time consuming init function will still block the event loop and delay response.

So we can have something like

class RayServeWrappedReplica: def __init__(self): init serve code def perform_user_init(self): init user code (need better name here) def handle_request(self): self.check_backend_initialized.

I might still be confused a bit but I think that's what the current implementation does (admittedly in a bit of a dirty way). I capture the init of the user code in a closure (start_replica) and call that before reconfigureing. I think the async is required to allow for async __init__ functions in the user deployment code, no?

yup i read the code again and it make sense to me!

simon-mo

LGTM pending the tests and small nits.

simon-mo · 2021-10-26T23:42:45Z

python/ray/serve/backend_state.py

+                logger.warning(
+                    f"Deployment '{self._name}' has "
+                    f"{len(pending_initialization)} replicas that have taken "
+                    f"more than {SLOW_STARTUP_WARNING_S}s to initialize")


+1. We typically would add what can the user do to alleviate this warning. Something like:
This means your constructor or configure methods are slow, please try to measure their runtime locally. You can ignore this message if your are expecting the initialization takes a long time.

simon-mo · 2021-10-26T23:43:41Z

python/ray/serve/replica.py

@@ -108,8 +119,21 @@ async def handle_request(
            query = Query(request_args, request_kwargs, request_metadata)
            return await self.backend.handle_request(query)

+        async def poke(self):


Suggested change

async def poke(self):

async def is_allocated(self):

…ates

edoakes

@iasoon this looks really great! Just a nit and a suggestion for an integration test. LMK if you have any questions on how to write this.

edoakes · 2021-10-28T22:24:38Z

python/ray/serve/backend_state.py

+                    f"{len(pending_allocation)} replicas that have taken "
+                    f"more than {SLOW_STARTUP_WARNING_S}s to start up. This "


Suggested change

f"{len(pending_allocation)} replicas that have taken "

f"more than {SLOW_STARTUP_WARNING_S}s to start up. This "

f"{len(pending_allocation)} replicas that have taken "

f"more than {SLOW_STARTUP_WARNING_S}s to be scheduled. This "

edoakes · 2021-10-28T22:29:07Z

python/ray/serve/tests/test_backend_state.py

@@ -106,7 +106,7 @@ def recover(self):

    def check_ready(self) -> ReplicaStartupStatus:
        ready = self.ready
-        self.ready = ReplicaStartupStatus.PENDING
+        self.ready = ReplicaStartupStatus.PENDING_INITIALIZATION


Hmm it will be a little hard to unit test unfortunately. At a minimum, could you please add an integration test (e.g., in test_deploy.py that tests the two conditions:

First case: try to schedule a replica that requires more resources than are available in the cluster (e.g., num_cpus=1000). Check to see that only the first warning is printed.

Second case: schedule a replica that hangs forever in its constructor, wait for the warning to be printed.

Ideally, also a third case that checks both: start with unschedulable, add resources to the cluster to make it schedulable (e.g., add another. node using cluster_utils, we do this in tests/test_cluster.py), then have it hang in constructor. Check that both are printed.

iasoon · 2021-10-29T08:50:23Z

Tests are failing on windows. Is this expected, and should the test be marked as skip on windows?

edoakes

@iasoon looks good, just pending adding the tests to run in CI.

Re: windows build, yes please go ahead and skip it on windows.

edoakes · 2021-10-29T14:24:14Z

python/ray/serve/tests/test_warnings.py

@@ -0,0 +1,44 @@
+import time


@iasoon because you're adding a new test file, you need to add an entry in ray/serve/BUILD in order for this to run in CI. Could you please either add an entry there (copy one of the existing ones, should be straightforward) or else move these tests to an existing test file?

edoakes · 2021-10-29T14:25:27Z

python/ray/serve/tests/test_warnings.py

+    expected_warning = (f"Deployment '{D.name}' has "
+                        f"1 replicas that have taken "
+                        f"more than 30s to be scheduled.")


nit: would be nice to also have this check that it's only printed once per deployment even if there are many replicas blocking (not necessary to add here to merge)

Good idea! I had a go at implementing it :)

edoakes · 2021-10-29T16:38:58Z

python/ray/serve/tests/test_warnings.py

+
+    # wait long enough for the warning to be printed
+    # with a small grace period
+    time.sleep(SLOW_STARTUP_WARNING_PERIOD_S * 1.5)


I'm a little worried this will be flaky in CI (CI is sometimes very very slow) and also that the test will take a long time to run. Any way you can think of to make it deterministic?

Hmm, it's a bit tricky because this behaviour is purely time-based. Maybe we could re-use MockTimer from tests/test_backend_state.py for this?

update: AFAIK we can't use MockTimer since the serve controller runs in a different process from the test, as to my knowledge you cannot mock in another process.

edoakes · 2021-11-01T18:56:45Z

@iasoon looks like tests all passed but you have some merge conflicts -- mind addressing those and re-pushing?

…w-startup-error-messages

edoakes · 2021-11-02T02:25:06Z

Sorry @iasoon, we have an outage with our build system so tests are not running properly (hence the failures). Should be fixed soon...

simon-mo · 2021-11-02T15:58:46Z

please pull master to resolve the build breakage! thank you

…w-startup-error-messages

iasoon marked this pull request as draft October 17, 2021 14:03

jiaodong self-assigned this Oct 18, 2021

simon-mo requested a review from jiaodong October 19, 2021 20:33

jiaodong reviewed Oct 19, 2021

View reviewed changes

simon-mo self-assigned this Oct 25, 2021

simon-mo approved these changes Oct 26, 2021

View reviewed changes

simon-mo assigned edoakes Oct 27, 2021

iasoon added 5 commits October 28, 2021 21:49

disambiguate between PENDING_ALLOCATION and PENDING_INITIALIZATION st…

0104de4

…ates

correctly set MockReplicaActorWrapper default ready state

8ba51f1

re-fetch initialization proof in recover method

f4f1cff

docstrings and better naming

e597a18

better naming, comments and messages

b98c572

iasoon force-pushed the feature/19156-slow-startup-error-messages branch from a259d4c to b98c572 Compare October 28, 2021 20:26

edoakes reviewed Oct 28, 2021

View reviewed changes

edoakes added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 28, 2021

test startup state transitions and warnings

a552e0f

iasoon marked this pull request as ready for review October 29, 2021 01:26

edoakes reviewed Oct 29, 2021

View reviewed changes

iasoon added 2 commits October 29, 2021 17:45

update test config

18c4139

assert that only one warning is issued

922d0e5

edoakes reviewed Oct 29, 2021

View reviewed changes

Merge remote-tracking branch 'upstream/master' into feature/19156-slo…

73f792d

…w-startup-error-messages

Merge remote-tracking branch 'upstream/master' into feature/19156-slo…

0c12d8c

…w-startup-error-messages

edoakes merged commit 7e6ea9e into ray-project:master Nov 2, 2021

Hankpipi mentioned this pull request Nov 24, 2021

[serve] Feature request: timeout or max_retries to limit the time spent waiting for a deployment to complete #18432

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[serve] split ReplicaStartupState.PENDING into PENDING_ALLOCATION and PENDING_INITIALIZATION #19431

[serve] split ReplicaStartupState.PENDING into PENDING_ALLOCATION and PENDING_INITIALIZATION #19431

iasoon commented Oct 17, 2021 •

edited

Loading

jiaodong left a comment

jiaodong Oct 19, 2021

simon-mo Oct 26, 2021

jiaodong Oct 19, 2021

iasoon Oct 20, 2021

simon-mo Oct 26, 2021

edoakes Oct 28, 2021

iasoon Oct 29, 2021

jiaodong Oct 19, 2021

simon-mo Oct 19, 2021

iasoon Oct 20, 2021 •

edited

Loading

simon-mo Oct 20, 2021

iasoon Oct 20, 2021

simon-mo Oct 26, 2021

simon-mo left a comment

simon-mo Oct 26, 2021

simon-mo Oct 26, 2021

edoakes left a comment

edoakes Oct 28, 2021

edoakes Oct 28, 2021

iasoon commented Oct 29, 2021

edoakes left a comment

edoakes Oct 29, 2021

edoakes Oct 29, 2021

iasoon Oct 29, 2021

edoakes Oct 29, 2021

iasoon Oct 31, 2021

iasoon Nov 2, 2021

edoakes commented Nov 1, 2021

edoakes commented Nov 2, 2021 •

edited

Loading

simon-mo commented Nov 2, 2021

		f"{len(pending_allocation)} replicas that have taken "
		f"more than {SLOW_STARTUP_WARNING_S}s to start up. This "

[serve] split ReplicaStartupState.PENDING into PENDING_ALLOCATION and PENDING_INITIALIZATION #19431

[serve] split ReplicaStartupState.PENDING into PENDING_ALLOCATION and PENDING_INITIALIZATION #19431

Conversation

iasoon commented Oct 17, 2021 • edited Loading

Why are these changes needed?

Related issue number

Checks

jiaodong left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iasoon Oct 20, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simon-mo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

edoakes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iasoon commented Oct 29, 2021

edoakes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

edoakes commented Nov 1, 2021

edoakes commented Nov 2, 2021 • edited Loading

simon-mo commented Nov 2, 2021

iasoon commented Oct 17, 2021 •

edited

Loading

iasoon Oct 20, 2021 •

edited

Loading

edoakes commented Nov 2, 2021 •

edited

Loading