-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[serve] split ReplicaStartupState.PENDING into PENDING_ALLOCATION and PENDING_INITIALIZATION #19431
[serve] split ReplicaStartupState.PENDING into PENDING_ALLOCATION and PENDING_INITIALIZATION #19431
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @iasoon thanks for the contribution ! It's great to have users who are interested in ray serve as well as willing to help it to improve :D
python/ray/serve/backend_state.py
Outdated
logger.warning( | ||
f"Deployment '{self._name}' has " | ||
f"{len(pending_initialization)} replicas that have taken " | ||
f"more than {SLOW_STARTUP_WARNING_S}s to initialize") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe expand this a bit further to indicate what are the possible causes and how it's different from pending_allocation replicas ? Something like "running your deployment's constructor such as loading model weights, or reconfiguring your deployment "
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. We typically would add what can the user do to alleviate this warning. Something like:
This means your constructor or configure methods are slow, please try to measure their runtime locally. You can ignore this message if your are expecting the initialization takes a long time.
@@ -106,7 +106,7 @@ def recover(self): | |||
|
|||
def check_ready(self) -> ReplicaStartupStatus: | |||
ready = self.ready | |||
self.ready = ReplicaStartupStatus.PENDING | |||
self.ready = ReplicaStartupStatus.PENDING_INITIALIZATION |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also add tests to capture the PENDING_ALLOCATION -> PENDING_INITIALIZATION transition ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a suggestion on how this would best be tested? At the moment ReplicaStartupStatus
is kept internal to the BackendState
so as far as I can see there is no evident way to do this.
I tested this code on my machine by creating a deployment when no resources were available (to stay in PENDING_ALLOCATION) and one with a wait in __init__
(to stay in PENDING_INITIALIZATION), and waiting for the warning logs. Maybe I could replicate that setup in a test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm it will be a little hard to unit test unfortunately. At a minimum, could you please add an integration test (e.g., in test_deploy.py
that tests the two conditions:
- First case: try to schedule a replica that requires more resources than are available in the cluster (e.g., num_cpus=1000). Check to see that only the first warning is printed.
- Second case: schedule a replica that hangs forever in its constructor, wait for the warning to be printed.
- Ideally, also a third case that checks both: start with unschedulable, add resources to the cluster to make it schedulable (e.g., add another. node using cluster_utils, we do this in
tests/test_cluster.py
), then have it hang in constructor. Check that both are printed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I added some tests corresponding to the cases you outlined @edoakes ! Let me know what you think!
# Is it fine that backend is None here? | ||
# Should we add a check in all methods that use self.backend | ||
# or, alternatively, create an async get_backend() method? | ||
self.backend = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it's worth double checking on the implications here because we're making replica constructor an async function as opposed to only return when RayServeReplica is constructed (it's a cheap instance however). It might be ok since we don't consider a replica ready until reconfigure()
returned, so we won't direct traffic to it while there's no backend for handle_request()
. cc: @simon-mo can you help to confirm ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be fine just keep the constructor in the old way. Ray actor will run the init in a blocking way; you can use ray.wait([ready_ref])
to ensure that the actor has been properly startup. Is there any special reason for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, the issue we're running into is that when backend initialization is slow (due to a slow init in user code, for example), the RayServeWrappedReplica.__init__
will also be slow. We cannot call a remote method on an actor before its __init__
has completed, right? If not, I think we'll either need this async backend initialization, or a different way to signal allocation success back to BackendState
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah i see your confusion here. We can approach this by only initialize user backend when the controller explicitly tells the replica to initialize it. We don't need to do it in an async method. Async method actually won't help here because time consuming init function will still block the event loop and delay response.
So we can have something like
class RayServeWrappedReplica:
def __init__(self): init serve code
def perform_user_init(self): init user code (need better name here)
def handle_request(self):
self.check_backend_initialized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might still be confused a bit but I think that's what the current implementation does (admittedly in a bit of a dirty way). I capture the init of the user code in a closure (start_replica
) and call that before reconfigure
ing. I think the async is required to allow for async __init__
functions in the user deployment code, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup i read the code again and it make sense to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM pending the tests and small nits.
python/ray/serve/backend_state.py
Outdated
logger.warning( | ||
f"Deployment '{self._name}' has " | ||
f"{len(pending_initialization)} replicas that have taken " | ||
f"more than {SLOW_STARTUP_WARNING_S}s to initialize") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. We typically would add what can the user do to alleviate this warning. Something like:
This means your constructor or configure methods are slow, please try to measure their runtime locally. You can ignore this message if your are expecting the initialization takes a long time.
python/ray/serve/replica.py
Outdated
@@ -108,8 +119,21 @@ async def handle_request( | |||
query = Query(request_args, request_kwargs, request_metadata) | |||
return await self.backend.handle_request(query) | |||
|
|||
async def poke(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
async def poke(self): | |
async def is_allocated(self): |
a259d4c
to
b98c572
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iasoon this looks really great! Just a nit and a suggestion for an integration test. LMK if you have any questions on how to write this.
python/ray/serve/backend_state.py
Outdated
f"{len(pending_allocation)} replicas that have taken " | ||
f"more than {SLOW_STARTUP_WARNING_S}s to start up. This " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f"{len(pending_allocation)} replicas that have taken " | |
f"more than {SLOW_STARTUP_WARNING_S}s to start up. This " | |
f"{len(pending_allocation)} replicas that have taken " | |
f"more than {SLOW_STARTUP_WARNING_S}s to be scheduled. This " |
@@ -106,7 +106,7 @@ def recover(self): | |||
|
|||
def check_ready(self) -> ReplicaStartupStatus: | |||
ready = self.ready | |||
self.ready = ReplicaStartupStatus.PENDING | |||
self.ready = ReplicaStartupStatus.PENDING_INITIALIZATION |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm it will be a little hard to unit test unfortunately. At a minimum, could you please add an integration test (e.g., in test_deploy.py
that tests the two conditions:
- First case: try to schedule a replica that requires more resources than are available in the cluster (e.g., num_cpus=1000). Check to see that only the first warning is printed.
- Second case: schedule a replica that hangs forever in its constructor, wait for the warning to be printed.
- Ideally, also a third case that checks both: start with unschedulable, add resources to the cluster to make it schedulable (e.g., add another. node using cluster_utils, we do this in
tests/test_cluster.py
), then have it hang in constructor. Check that both are printed.
Tests are failing on windows. Is this expected, and should the test be marked as skip on windows? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iasoon looks good, just pending adding the tests to run in CI.
Re: windows build, yes please go ahead and skip it on windows.
@@ -0,0 +1,44 @@ | |||
import time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iasoon because you're adding a new test file, you need to add an entry in ray/serve/BUILD
in order for this to run in CI. Could you please either add an entry there (copy one of the existing ones, should be straightforward) or else move these tests to an existing test file?
expected_warning = (f"Deployment '{D.name}' has " | ||
f"1 replicas that have taken " | ||
f"more than 30s to be scheduled.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: would be nice to also have this check that it's only printed once per deployment even if there are many replicas blocking (not necessary to add here to merge)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea! I had a go at implementing it :)
|
||
# wait long enough for the warning to be printed | ||
# with a small grace period | ||
time.sleep(SLOW_STARTUP_WARNING_PERIOD_S * 1.5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little worried this will be flaky in CI (CI is sometimes very very slow) and also that the test will take a long time to run. Any way you can think of to make it deterministic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it's a bit tricky because this behaviour is purely time-based. Maybe we could re-use MockTimer
from tests/test_backend_state.py
for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update: AFAIK we can't use MockTimer since the serve controller runs in a different process from the test, as to my knowledge you cannot mock in another process.
@iasoon looks like tests all passed but you have some merge conflicts -- mind addressing those and re-pushing? |
…w-startup-error-messages
Sorry @iasoon, we have an outage with our build system so tests are not running properly (hence the failures). Should be fixed soon... |
please pull master to resolve the build breakage! thank you |
…w-startup-error-messages
Why are these changes needed?
As descibed in #19156, at this time we cannot tell whether replica startup is slow due to user code (in a deployment
__init__
) , or due to the cluster having insufficient resources to start a worker for the replica.This change separates WrappedReplica initialization ("allocation") and Replica initialization itself ("initialization") into two distinct StartupStates, allowing us to provide more precise user feedback.
@simon-mo would you mind having a quick look at this POC to check whether the approach aligns with what you had in mind? If so, I'll clean up the implementation a bit and add some tests (I tested manually with a small script that creates a deployment with a
time.sleep
in__init__
).Related issue number
Closes #19156
Checks
scripts/format.sh
to lint the changes in this PR.