Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[serve] Handle None in ReplicaConfig's resource_dict #23851

Merged
merged 20 commits into from
Apr 18, 2022

Conversation

shrekris-anyscale
Copy link
Contributor

Why are these changes needed?

At least two lines of code rely on none of ReplicaConfig's resource_dict values being None (line 1, line 2). However, #23619 made "memory" in the resource_dict default to None.

This causes flakiness whenever either of the two lines are executed. In particular, line 1 is called in the resource_requirements() function, which is only called in DeploymentState's _check_and_update_replicas() function whenever a replica is slow to start. This makes updates flaky when a replica starts slowly (example).

Line 2 runs for all deployments; however, it uses a generator function, so it only causes an error if both "num_cpus" and "num_gpus" are non-zero but "memory" is None. Since "num_gpus" has a default of 0, this is somewhat unlikely.

This change sets "memory"'s default to 0 in the resource_dict but keeps the default as None in ray_actor_options. It adds logic to both problematic lines to handle None in case of future settings updates. It also adds unit tests to prevent regressions.

Related issue number

Addresses flakiness in #23747.

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
      • New unit tests are added to prevent regression.

@@ -264,9 +264,11 @@ def start(self, deployment_info: DeploymentInfo, version: DeploymentVersion):
)

self._actor_resources = deployment_info.replica_config.resource_dict
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: what's the value of keeping a "None" value for actor resource, and what if we always cast it to 0 upstream, and never need to deal with None value in serve codebase ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The context for that is in #23619. Essentially, the None is passed to Ray, so deployments use the Ray actor memory default.

Internally, Serve treats the default memory as 0 (even though it's actually using Ray's default) to indicate that the deployment doesn't have any memory requirements. In reality, Ray doesn't allow actors to set memory to 0.

That's why in this change we set memory to 0 in the resource_dict, but we use None in ray_actor_options.

Copy link
Contributor

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix.

It looks like you're currently both setting the default to 0 and handling the None case in the offending code. Do we need to be doing both here?

def test_resource_requirements_none(mock_deployment_state):
"""Ensure resource_requirements doesn't break if a requirement is None"""

from ray.serve.deployment_state import DeploymentReplica
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import at top of file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, moved the import.

@suquark
Copy link
Member

suquark commented Apr 13, 2022

My question: why not just skips the None options? I mean we should not even keep them in the directory. After #23127, all options not specified in @ray.remote or .options() will stay unspecified. .options() will no longer assign default values.

Also I cannot understand why we need to modify ray_actor_options in place, this is very bad for debugging and reproducing errors.

If these were tech debts to comfort flaws in Ray options processing, after #23127 we should cleanup the tech debts, instead of polishing the tech debts.

@shrekris-anyscale
Copy link
Contributor Author

why not just skips the None options?

@suquark What do you mean by skip the None options? Do you mean that we should stop validating them at the Serve level, or that we should leave them out of the resource_dict? As far as I can tell, this code only validates the ray_actor_options and provides default values in case the user explicitly specifies any option as None.

@suquark
Copy link
Member

suquark commented Apr 14, 2022

why not just skips the None options?

@suquark What do you mean by skip the None options? Do you mean that we should stop validating them at the Serve level, or that we should leave them out of the resource_dict? As far as I can tell, this code only validates the ray_actor_options and provides default values in case the user explicitly specifies any option as None.

Yes, but finally the options are passed to Ray actors and tasks, right? It is necessary to set num_cpus to 1 according to the code, but you can just leave out other arguments. If they do not exist in ray_actor_options, do not self.ray_actor_options.setdefault("memory", None). Also we should check the arguments here, but we can use

def validate_actor_options(options: Dict[str, Any], in_options: bool):
instead. The current check is incomplete, for example, you can input invalid options like CPU, GPU.

I my mind, the better way is to

  • check we does not include unwanted fields in ray_actor_options like in the current code
  • check ray_actor_options with
    def validate_actor_options(options: Dict[str, Any], in_options: bool):
  • convert ray_actor_options to resource_dict with a pure function. It should not modify ray_actor_options.
  • Ray actor/task will treat any missing fields as default value. You do not need to set them to None.

BTW, it is not necessary to do it now. I can cover it with a later PR.

Copy link
Contributor

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as a localized bugfix for the issue we observed in release tests.

@shrekris-anyscale I think we should follow up with @suquark's suggestion to improve this as we update the validation & schema logic for deployment graphs.

@edoakes edoakes merged commit fab33fa into ray-project:master Apr 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants