-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[serve] Handle None
in ReplicaConfig
's resource_dict
#23851
Conversation
@@ -264,9 +264,11 @@ def start(self, deployment_info: DeploymentInfo, version: DeploymentVersion): | |||
) | |||
|
|||
self._actor_resources = deployment_info.replica_config.resource_dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: what's the value of keeping a "None" value for actor resource, and what if we always cast it to 0 upstream, and never need to deal with None value in serve codebase ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The context for that is in #23619. Essentially, the None
is passed to Ray, so deployments use the Ray actor memory default.
Internally, Serve treats the default memory as 0
(even though it's actually using Ray's default) to indicate that the deployment doesn't have any memory requirements. In reality, Ray doesn't allow actors to set memory to 0
.
That's why in this change we set memory
to 0
in the resource_dict
, but we use None
in ray_actor_options
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix.
It looks like you're currently both setting the default to 0
and handling the None
case in the offending code. Do we need to be doing both here?
def test_resource_requirements_none(mock_deployment_state): | ||
"""Ensure resource_requirements doesn't break if a requirement is None""" | ||
|
||
from ray.serve.deployment_state import DeploymentReplica |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import at top of file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, moved the import.
My question: why not just skips the Also I cannot understand why we need to modify If these were tech debts to comfort flaws in Ray options processing, after #23127 we should cleanup the tech debts, instead of polishing the tech debts. |
@suquark What do you mean by skip the |
Yes, but finally the options are passed to Ray actors and tasks, right? It is necessary to set ray/python/ray/_private/ray_option_utils.py Line 193 in f400c20
CPU , GPU .
I my mind, the better way is to
BTW, it is not necessary to do it now. I can cover it with a later PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as a localized bugfix for the issue we observed in release tests.
@shrekris-anyscale I think we should follow up with @suquark's suggestion to improve this as we update the validation & schema logic for deployment graphs.
Why are these changes needed?
At least two lines of code rely on none of
ReplicaConfig
'sresource_dict
values beingNone
(line 1, line 2). However, #23619 made"memory"
in theresource_dict
default toNone
.This causes flakiness whenever either of the two lines are executed. In particular, line 1 is called in the
resource_requirements()
function, which is only called inDeploymentState
's_check_and_update_replicas()
function whenever a replica is slow to start. This makes updates flaky when a replica starts slowly (example).Line 2 runs for all deployments; however, it uses a generator function, so it only causes an error if both
"num_cpus"
and"num_gpus"
are non-zero but"memory"
isNone
. Since"num_gpus"
has a default of 0, this is somewhat unlikely.This change sets
"memory"
's default to0
in theresource_dict
but keeps the default asNone
inray_actor_options
. It adds logic to both problematic lines to handleNone
in case of future settings updates. It also adds unit tests to prevent regressions.Related issue number
Addresses flakiness in #23747.
Checks
scripts/format.sh
to lint the changes in this PR.