Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[serve][deploy refactor][2/X] Move lightweight update logic to DeploymentVersion #34430

Merged
merged 35 commits into from
Apr 26, 2023

Conversation

zcin
Copy link
Contributor

@zcin zcin commented Apr 14, 2023

Why are these changes needed?

DeploymentVersion already has existing support for code version + other config options that affect the version (before, only user config was taken into account). We can leverage this to do lightweight config updates, so that:

  • The config's import_path and runtime_env tells us the code version for the deployments
  • The logic for lightweight config updates is hidden in DeploymentVersion

So instead of having DeploymentVersion only rely on user config, we want it to rely on other info from DeploymentInfo, so we can decide which config options should trigger a redeployment and which shouldn't. I've added annotations DeploymentOptionUpdateType to fields in DeploymentConfig to indicate which options fall into which category as described below.

  • Heavyweight options that will force replicas to restart: version and ray_actor_options (which is part of replica config)
  • Lightweight options that need to call reconfigure on the replica actor: user_config and graceful_shutdown_wait_loop_s
  • Lightweight options (that need to update replicas in replica state container, but won't need to call into the actual actor): max_concurrent_queries, graceful_shutdown_timeout_s, health_check_period_s, health_check_timeout_s
  • Purely lightweight options: num_replicas and autoscaling_config

Screen Shot 2023-04-20 at 6 50 13 PM

Screen Shot 2023-04-20 at 6 50 31 PM

This fixes the issue of deployments not listed in config being redeployed. See the newly added test test_deployments_not_listed_in_config.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@zcin zcin changed the title [serve] Move lightweight update logic to DeploymentVersion [serve][deploy refactor][2/X] Move lightweight update logic to DeploymentVersion Apr 14, 2023
@zcin zcin force-pushed the version-improvements branch 2 times, most recently from 2b4dde6 to 4b1047f Compare April 15, 2023 15:16
Signed-off-by: Cindy Zhang <[email protected]>
@zcin zcin force-pushed the version-improvements branch 13 times, most recently from 8d8f9b5 to 2b5f193 Compare April 18, 2023 06:08
Signed-off-by: Cindy Zhang <[email protected]>
@zcin zcin marked this pull request as ready for review April 18, 2023 15:56
Comment on lines 83 to 91
"max_concurrent_queries": deployment_config.max_concurrent_queries,
"graceful_shutdown_timeout_s": (
deployment_config.graceful_shutdown_timeout_s
),
"graceful_shutdown_wait_loop_s": (
deployment_config.graceful_shutdown_wait_loop_s
),
"health_check_period_s": deployment_config.health_check_period_s,
"health_check_timeout_s": deployment_config.health_check_timeout_s,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these the deployment config options that, if changed, we want to redeploy the replicas?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if we could encode the options that require lightweight/heavyweight updates directly in the code somehow. e.g., an annotation on the deployment_config struct definition. can you think of a way to do that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but to answer your question directly, if we are broadcasting these updates to the replicas using the existing logic for user_config (as I saw above), then none of these should require restarting the replicas.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if we could encode the options that require lightweight/heavyweight updates directly in the code somehow. e.g., an annotation on the deployment_config struct definition. can you think of a way to do that?

@edoakes I've added it as an extra arg in each pydantic field

@edoakes
Copy link
Contributor

edoakes commented Apr 18, 2023

@zcin this is fantastic, it also solves a long-standing issue we've had where some of the config options are not actually dynamically updated because we weren't broadcasting the config!

@@ -21,7 +21,7 @@ def __init__(self, dag: RayServeDAGHandle):
self.dag = dag

async def __call__(self):
return await self.dag.remote()
return await (await self.dag.remote())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason we have this change because of this pr?

Copy link
Contributor Author

@zcin zcin Apr 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure this code didn't work before 🤦‍♀️. Not sure how all the tests were passing in test_deploy_config_update, but some of the false positives were because the response weren't casted to int, so I've updated the test to do that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ouch

"import_path": app_config.import_path,
"runtime_env": app_config.runtime_env,
},
sort_keys=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice with sort_keys.

@@ -1180,10 +1219,14 @@ def deploy(self, deployment_info: DeploymentInfo) -> bool:
existing_info = self._target_state.info
if existing_info is not None:
# Redeploying should not reset the deployment's start time.
deployment_info.start_time_ms = existing_info.start_time_ms
if not self._target_state.deleting:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not self._target_state.deleting, what the reason for adding this check? is this an existing bug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, I don't know how often this case will occur (probably not often?).
This is specifically for when a deployment has been deleted (e.g. the deployment graph changed) and then redeployed (e.g. the deployment graph changed back).

The biggest change I made here is on line 1211, so that if all configs are the same but the current target state is deleting, the deployment isn't blocked from redeploying. I ran into this problem when debugging some unit tests, not sure why this pops up in my PR specifically.

self.version = DeploymentVersion(
self.version.code_version, user_config=user_config
user_config_changed = False
if (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we directly if self.deployment_config.user_config != deployment_config.user_config

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the replica is first initialized, deployment_config is None.

self._hash = crc32(serialized_user_config + self.code_version.encode("utf-8"))
self.deployment_config: DeploymentConfig = deployment_config
self.ray_actor_options: Dict = ray_actor_options
self.compute_hashes()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we don't allow anyone to update the attribute, since it will change the hash value. I suggest to change the class DeploymentVersion into pydantic and freeze it after initialization. Or we raise error when anyone update the attribute.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it become too heavyweight if we change it to pydantic? I'm in favor of making deployment version immutable, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both sounds good to me, "freezing the class" makes the code less error-proned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, agreed. I'll probably look into this in a potential follow up PR.

python/ray/serve/config.py Show resolved Hide resolved
Copy link
Contributor

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zcin can you add an integration test that graceful_shutdown_period_s is correctly updated without restarting the replica? Ideally we'd have this for each of the options that can be set.

You can do this by:

  • Start with an extremely long graceful_shutdown_period_s.
  • Update it to be very short.
  • Check that the replica ID/actor ID doesn't change.
  • Send a request that hangs forever.
  • Shutdown or do an update, check that it completes quickly.

@edoakes
Copy link
Contributor

edoakes commented Apr 25, 2023

@zcin serve tests failing.

@zcin
Copy link
Contributor Author

zcin commented Apr 25, 2023

@edoakes The failure is caused by Sihan's PR, he is working on a fix right now. The errors look unrelated to this PR.
Screen Shot 2023-04-25 at 2 23 42 PM

@@ -1243,48 +1271,47 @@ def _stop_wrong_version_replicas(self, max_to_stop=math.inf) -> int:
max_replicas=max_to_stop,
ranking_function=rank_replicas_for_stopping,
)
replicas_stopped = False
replicas_changed = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rename _stop_wrong_version_replicas function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to _stop_or_update_outdated_version_replicas

for replica in replicas_to_update:
# If the code version is a mismatch, we stop the replica. A new one
# with the correct version will be started later as part of the
replicas_changed = True
Copy link
Contributor

@sihanwang41 sihanwang41 Apr 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In reconfigure use case, will this trigger _stop_wrong_version_replicas returning True and potentially trigger a long poll broadcast which is not necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, it's only necessary for replicas that have max concurrent queries updated. I've changed the code to only return True in that case.

@sihanwang41
Copy link
Contributor

could you rebase master? dashboard tests are fixed.

@zcin
Copy link
Contributor Author

zcin commented Apr 26, 2023

@edoakes Tests look good, dataset tests unrelated and failing on master.

@edoakes edoakes merged commit bc1c378 into ray-project:master Apr 26, 2023
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request May 4, 2023
…mentVersion (ray-project#34430)

`DeploymentVersion` already has existing support for code version + other config options that affect the version (before, only user config was taken into account). We can leverage this to do lightweight config updates, so that:
- The config's `import_path` and `runtime_env` tells us the code version for the deployments
- The logic for lightweight config updates is hidden in `DeploymentVersion`

So instead of having `DeploymentVersion` only rely on user config, we want it to rely on other info from `DeploymentInfo`, so we can decide which config options should trigger a redeployment and which shouldn't. I've added annotations `DeploymentOptionUpdateType` to fields in `DeploymentConfig` to indicate which options fall into which category as described below.

* Heavyweight options that will force replicas to restart: `version` and `ray_actor_options` (which is part of replica config)
* Lightweight options that need to call reconfigure on the replica actor: `user_config` and `graceful_shutdown_wait_loop_s`
* Lightweight options (that need to update replicas in replica state container, but won't need to call into the actual actor): `max_concurrent_queries`, `graceful_shutdown_timeout_s`, `health_check_period_s`, `health_check_timeout_s`
* Purely lightweight options: `num_replicas` and `autoscaling_config`
<img width="799" alt="Screen Shot 2023-04-20 at 6 50 13 PM" src="https://user-images.githubusercontent.com/15851518/233521791-537bf512-3f7a-4add-8a9a-701fde8ebeb1.png">
<img width="800" alt="Screen Shot 2023-04-20 at 6 50 31 PM" src="https://user-images.githubusercontent.com/15851518/233521825-b3a74e76-d863-4715-b66f-fa128319a461.png">

This fixes the issue of deployments not listed in config being redeployed. See the newly added test `test_deployments_not_listed_in_config`.

Signed-off-by: Jack He <[email protected]>
architkulkarni pushed a commit to architkulkarni/ray that referenced this pull request May 16, 2023
…mentVersion (ray-project#34430)

`DeploymentVersion` already has existing support for code version + other config options that affect the version (before, only user config was taken into account). We can leverage this to do lightweight config updates, so that:
- The config's `import_path` and `runtime_env` tells us the code version for the deployments
- The logic for lightweight config updates is hidden in `DeploymentVersion`

So instead of having `DeploymentVersion` only rely on user config, we want it to rely on other info from `DeploymentInfo`, so we can decide which config options should trigger a redeployment and which shouldn't. I've added annotations `DeploymentOptionUpdateType` to fields in `DeploymentConfig` to indicate which options fall into which category as described below.

* Heavyweight options that will force replicas to restart: `version` and `ray_actor_options` (which is part of replica config)
* Lightweight options that need to call reconfigure on the replica actor: `user_config` and `graceful_shutdown_wait_loop_s`
* Lightweight options (that need to update replicas in replica state container, but won't need to call into the actual actor): `max_concurrent_queries`, `graceful_shutdown_timeout_s`, `health_check_period_s`, `health_check_timeout_s`
* Purely lightweight options: `num_replicas` and `autoscaling_config`
<img width="799" alt="Screen Shot 2023-04-20 at 6 50 13 PM" src="https://user-images.githubusercontent.com/15851518/233521791-537bf512-3f7a-4add-8a9a-701fde8ebeb1.png">
<img width="800" alt="Screen Shot 2023-04-20 at 6 50 31 PM" src="https://user-images.githubusercontent.com/15851518/233521825-b3a74e76-d863-4715-b66f-fa128319a461.png">

This fixes the issue of deployments not listed in config being redeployed. See the newly added test `test_deployments_not_listed_in_config`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants