-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[serve][deploy refactor][2/X] Move lightweight update logic to DeploymentVersion #34430
Conversation
6f30290
to
1818e6b
Compare
Signed-off-by: Cindy Zhang <[email protected]>
2b4dde6
to
4b1047f
Compare
Signed-off-by: Cindy Zhang <[email protected]>
8d8f9b5
to
2b5f193
Compare
Signed-off-by: Cindy Zhang <[email protected]>
2b5f193
to
c06156f
Compare
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
python/ray/serve/_private/version.py
Outdated
"max_concurrent_queries": deployment_config.max_concurrent_queries, | ||
"graceful_shutdown_timeout_s": ( | ||
deployment_config.graceful_shutdown_timeout_s | ||
), | ||
"graceful_shutdown_wait_loop_s": ( | ||
deployment_config.graceful_shutdown_wait_loop_s | ||
), | ||
"health_check_period_s": deployment_config.health_check_period_s, | ||
"health_check_timeout_s": deployment_config.health_check_timeout_s, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these the deployment config options that, if changed, we want to redeploy the replicas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if we could encode the options that require lightweight/heavyweight updates directly in the code somehow. e.g., an annotation on the deployment_config
struct definition. can you think of a way to do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but to answer your question directly, if we are broadcasting these updates to the replicas using the existing logic for user_config
(as I saw above), then none of these should require restarting the replicas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if we could encode the options that require lightweight/heavyweight updates directly in the code somehow. e.g., an annotation on the
deployment_config
struct definition. can you think of a way to do that?
@edoakes I've added it as an extra arg in each pydantic field
@zcin this is fantastic, it also solves a long-standing issue we've had where some of the config options are not actually dynamically updated because we weren't broadcasting the config! |
@@ -21,7 +21,7 @@ def __init__(self, dag: RayServeDAGHandle): | |||
self.dag = dag | |||
|
|||
async def __call__(self): | |||
return await self.dag.remote() | |||
return await (await self.dag.remote()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason we have this change because of this pr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure this code didn't work before 🤦♀️. Not sure how all the tests were passing in test_deploy_config_update
, but some of the false positives were because the response weren't casted to int, so I've updated the test to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ouch
"import_path": app_config.import_path, | ||
"runtime_env": app_config.runtime_env, | ||
}, | ||
sort_keys=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice with sort_keys.
@@ -1180,10 +1219,14 @@ def deploy(self, deployment_info: DeploymentInfo) -> bool: | |||
existing_info = self._target_state.info | |||
if existing_info is not None: | |||
# Redeploying should not reset the deployment's start time. | |||
deployment_info.start_time_ms = existing_info.start_time_ms | |||
if not self._target_state.deleting: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if not self._target_state.deleting
, what the reason for adding this check? is this an existing bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, I don't know how often this case will occur (probably not often?).
This is specifically for when a deployment has been deleted (e.g. the deployment graph changed) and then redeployed (e.g. the deployment graph changed back).
The biggest change I made here is on line 1211, so that if all configs are the same but the current target state is deleting, the deployment isn't blocked from redeploying. I ran into this problem when debugging some unit tests, not sure why this pops up in my PR specifically.
self.version = DeploymentVersion( | ||
self.version.code_version, user_config=user_config | ||
user_config_changed = False | ||
if ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should we directly if self.deployment_config.user_config != deployment_config.user_config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the replica is first initialized, deployment_config is None.
self._hash = crc32(serialized_user_config + self.code_version.encode("utf-8")) | ||
self.deployment_config: DeploymentConfig = deployment_config | ||
self.ray_actor_options: Dict = ray_actor_options | ||
self.compute_hashes() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume we don't allow anyone to update the attribute, since it will change the hash value. I suggest to change the class DeploymentVersion into pydantic and freeze it after initialization. Or we raise error when anyone update the attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it become too heavyweight if we change it to pydantic? I'm in favor of making deployment version immutable, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both sounds good to me, "freezing the class" makes the code less error-proned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, agreed. I'll probably look into this in a potential follow up PR.
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zcin can you add an integration test that graceful_shutdown_period_s
is correctly updated without restarting the replica? Ideally we'd have this for each of the options that can be set.
You can do this by:
- Start with an extremely long
graceful_shutdown_period_s
. - Update it to be very short.
- Check that the replica ID/actor ID doesn't change.
- Send a request that hangs forever.
- Shutdown or do an update, check that it completes quickly.
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
…utdown timeout Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
846f349
to
40e8bfb
Compare
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
@zcin serve tests failing. |
@edoakes The failure is caused by Sihan's PR, he is working on a fix right now. The errors look unrelated to this PR. |
@@ -1243,48 +1271,47 @@ def _stop_wrong_version_replicas(self, max_to_stop=math.inf) -> int: | |||
max_replicas=max_to_stop, | |||
ranking_function=rank_replicas_for_stopping, | |||
) | |||
replicas_stopped = False | |||
replicas_changed = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: rename _stop_wrong_version_replicas function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to _stop_or_update_outdated_version_replicas
for replica in replicas_to_update: | ||
# If the code version is a mismatch, we stop the replica. A new one | ||
# with the correct version will be started later as part of the | ||
replicas_changed = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In reconfigure use case, will this trigger _stop_wrong_version_replicas
returning True and potentially trigger a long poll broadcast which is not necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point, it's only necessary for replicas that have max concurrent queries updated. I've changed the code to only return True in that case.
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
could you rebase master? dashboard tests are fixed. |
Signed-off-by: Cindy Zhang <[email protected]>
@edoakes Tests look good, dataset tests unrelated and failing on master. |
…mentVersion (ray-project#34430) `DeploymentVersion` already has existing support for code version + other config options that affect the version (before, only user config was taken into account). We can leverage this to do lightweight config updates, so that: - The config's `import_path` and `runtime_env` tells us the code version for the deployments - The logic for lightweight config updates is hidden in `DeploymentVersion` So instead of having `DeploymentVersion` only rely on user config, we want it to rely on other info from `DeploymentInfo`, so we can decide which config options should trigger a redeployment and which shouldn't. I've added annotations `DeploymentOptionUpdateType` to fields in `DeploymentConfig` to indicate which options fall into which category as described below. * Heavyweight options that will force replicas to restart: `version` and `ray_actor_options` (which is part of replica config) * Lightweight options that need to call reconfigure on the replica actor: `user_config` and `graceful_shutdown_wait_loop_s` * Lightweight options (that need to update replicas in replica state container, but won't need to call into the actual actor): `max_concurrent_queries`, `graceful_shutdown_timeout_s`, `health_check_period_s`, `health_check_timeout_s` * Purely lightweight options: `num_replicas` and `autoscaling_config` <img width="799" alt="Screen Shot 2023-04-20 at 6 50 13 PM" src="https://user-images.githubusercontent.com/15851518/233521791-537bf512-3f7a-4add-8a9a-701fde8ebeb1.png"> <img width="800" alt="Screen Shot 2023-04-20 at 6 50 31 PM" src="https://user-images.githubusercontent.com/15851518/233521825-b3a74e76-d863-4715-b66f-fa128319a461.png"> This fixes the issue of deployments not listed in config being redeployed. See the newly added test `test_deployments_not_listed_in_config`. Signed-off-by: Jack He <[email protected]>
…mentVersion (ray-project#34430) `DeploymentVersion` already has existing support for code version + other config options that affect the version (before, only user config was taken into account). We can leverage this to do lightweight config updates, so that: - The config's `import_path` and `runtime_env` tells us the code version for the deployments - The logic for lightweight config updates is hidden in `DeploymentVersion` So instead of having `DeploymentVersion` only rely on user config, we want it to rely on other info from `DeploymentInfo`, so we can decide which config options should trigger a redeployment and which shouldn't. I've added annotations `DeploymentOptionUpdateType` to fields in `DeploymentConfig` to indicate which options fall into which category as described below. * Heavyweight options that will force replicas to restart: `version` and `ray_actor_options` (which is part of replica config) * Lightweight options that need to call reconfigure on the replica actor: `user_config` and `graceful_shutdown_wait_loop_s` * Lightweight options (that need to update replicas in replica state container, but won't need to call into the actual actor): `max_concurrent_queries`, `graceful_shutdown_timeout_s`, `health_check_period_s`, `health_check_timeout_s` * Purely lightweight options: `num_replicas` and `autoscaling_config` <img width="799" alt="Screen Shot 2023-04-20 at 6 50 13 PM" src="https://user-images.githubusercontent.com/15851518/233521791-537bf512-3f7a-4add-8a9a-701fde8ebeb1.png"> <img width="800" alt="Screen Shot 2023-04-20 at 6 50 31 PM" src="https://user-images.githubusercontent.com/15851518/233521825-b3a74e76-d863-4715-b66f-fa128319a461.png"> This fixes the issue of deployments not listed in config being redeployed. See the newly added test `test_deployments_not_listed_in_config`.
Why are these changes needed?
DeploymentVersion
already has existing support for code version + other config options that affect the version (before, only user config was taken into account). We can leverage this to do lightweight config updates, so that:import_path
andruntime_env
tells us the code version for the deploymentsDeploymentVersion
So instead of having
DeploymentVersion
only rely on user config, we want it to rely on other info fromDeploymentInfo
, so we can decide which config options should trigger a redeployment and which shouldn't. I've added annotationsDeploymentOptionUpdateType
to fields inDeploymentConfig
to indicate which options fall into which category as described below.version
andray_actor_options
(which is part of replica config)user_config
andgraceful_shutdown_wait_loop_s
max_concurrent_queries
,graceful_shutdown_timeout_s
,health_check_period_s
,health_check_timeout_s
num_replicas
andautoscaling_config
This fixes the issue of deployments not listed in config being redeployed. See the newly added test
test_deployments_not_listed_in_config
.Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.