Skip to content

Commit

Permalink
[serve] remove max_concurrent_queries (#46427)
Browse files Browse the repository at this point in the history
[serve] remove max_concurrent_queries

Remove `max_concurrent_queries`, which was replaced by
`max_ongoing_requests` and was deprecated in ray 2.10.

Signed-off-by: Cindy Zhang <[email protected]>

Signed-off-by: Cindy Zhang <[email protected]>
  • Loading branch information
zcin authored Jul 16, 2024
1 parent aafc7a6 commit b610a0b
Show file tree
Hide file tree
Showing 20 changed files with 45 additions and 169 deletions.
3 changes: 0 additions & 3 deletions doc/source/serve/advanced-guides/advanced-autoscaling.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,9 +30,6 @@ Always load test your workloads. For example, if the use case is latency sensiti
As an example, suppose you have two replicas of a synchronous deployment that has 100ms latency, serving a traffic load of 30 QPS. Then Serve assigns requests to replicas faster than the replicas can finish processing them; more and more requests queue up at the replica (these requests are "ongoing requests") as time progresses, and then the average number of ongoing requests at each replica steadily increases. Latency also increases because new requests have to wait for old requests to finish processing. If you set `target_ongoing_requests = 1`, Serve detects a higher than desired number of ongoing requests per replica, and adds more replicas. At 3 replicas, your system would be able to process 30 QPS with 1 ongoing request per replica on average.
:::

#### **max_concurrent_queries [default=5] (DEPRECATED)**
This parameter is renamed to `max_ongoing_requests`. `max_concurrent_queries` will be removed in a future release.

#### **max_ongoing_requests [default=5]**
:::{note}
The default for `max_ongoing_requests` changed from 100 to 5 in Ray 2.32.0. You can continue to set it manually to override the default.
Expand Down
4 changes: 2 additions & 2 deletions doc/source/serve/autoscaling-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ You can set `num_replicas="auto"` and override its default values (shown above)
Let's dive into what each of these parameters do.

* **target_ongoing_requests** (replaces the deprecated `target_num_ongoing_requests_per_replica`) is the average number of ongoing requests per replica that the Serve autoscaler tries to ensure. You can adjust it based on your request processing length (the longer the requests, the smaller this number should be) as well as your latency objective (the shorter you want your latency to be, the smaller this number should be).
* **max_ongoing_requests** (replaces the deprecated `max_concurrent_queries`) is the maximum number of ongoing requests allowed for a replica. Note this parameter is not part of the autoscaling config because it's relevant to all deployments, but it's important to set it relative to the target value if you turn on autoscaling for your deployment.
* **max_ongoing_requests** is the maximum number of ongoing requests allowed for a replica. Note this parameter is not part of the autoscaling config because it's relevant to all deployments, but it's important to set it relative to the target value if you turn on autoscaling for your deployment.
* **min_replicas** is the minimum number of replicas for the deployment. Set this to 0 if there are long periods of no traffic and some extra tail latency during upscale is acceptable. Otherwise, set this to what you think you need for low traffic.
* **max_replicas** is the maximum number of replicas for the deployment. Set this to ~20% higher than what you think you need for peak traffic.

Expand Down Expand Up @@ -104,4 +104,4 @@ The Ray Serve Autoscaler is an application-level autoscaler that sits on top of
Concretely, this means that the Ray Serve autoscaler asks Ray to start a number of replica actors based on the request demand.
If the Ray Autoscaler determines there aren't enough available resources (e.g. CPUs, GPUs, etc.) to place these actors, it responds by requesting more Ray nodes.
The underlying cloud provider then responds by adding more nodes.
Similarly, when Ray Serve scales down and terminates replica Actors, it attempts to make as many nodes idle as possible so the Ray Autoscaler can remove them. To learn more about the architecture underlying Ray Serve Autoscaling, see [Ray Serve Autoscaling Architecture](serve-autoscaling-architecture).
Similarly, when Ray Serve scales down and terminates replica Actors, it attempts to make as many nodes idle as possible so the Ray Autoscaler can remove them. To learn more about the architecture underlying Ray Serve Autoscaling, see [Ray Serve Autoscaling Architecture](serve-autoscaling-architecture).
2 changes: 1 addition & 1 deletion doc/source/serve/configure-serve-deployment.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ You can also refer to the [API reference](../serve/api/doc/ray.serve.deployment_
- `name` - Name uniquely identifying this deployment within the application. If not provided, the name of the class or function is used.
- `num_replicas` - Controls the number of replicas to run that handle requests to this deployment. This can be a positive integer, in which case the number of replicas stays constant, or `auto`, in which case the number of replicas will autoscale with a default configuration (see [Ray Serve Autoscaling](serve-autoscaling) for more). Defaults to 1.
- `ray_actor_options` - Options to pass to the Ray Actor decorator, such as resource requirements. Valid options are: `accelerator_type`, `memory`, `num_cpus`, `num_gpus`, `object_store_memory`, `resources`, and `runtime_env` For more details - [Resource management in Serve](serve-cpus-gpus)
- `max_ongoing_requests` (replaces the deprecated `max_concurrent_queries`) - Maximum number of queries that are sent to a replica of this deployment without receiving a response. Defaults to 100 (the default will change to 5 in an upcoming release). This may be an important parameter to configure for [performance tuning](serve-perf-tuning).
- `max_ongoing_requests` - Maximum number of queries that are sent to a replica of this deployment without receiving a response. Defaults to 100 (the default will change to 5 in an upcoming release). This may be an important parameter to configure for [performance tuning](serve-perf-tuning).
- `autoscaling_config` - Parameters to configure autoscaling behavior. If this is set, you can't set `num_replicas` to a number. For more details on configurable parameters for autoscaling, see [Ray Serve Autoscaling](serve-autoscaling).
- `user_config` - Config to pass to the reconfigure method of the deployment. This can be updated dynamically without restarting the replicas of the deployment. The user_config must be fully JSON-serializable. For more details, see [Serve User Config](serve-user-config).
- `health_check_period_s` - Duration between health check calls for the replica. Defaults to 10s. The health check is by default a no-op Actor call to the replica, but you can define your own health check using the "check_health" method in your deployment that raises an exception when unhealthy.
Expand Down
2 changes: 1 addition & 1 deletion doc/source/serve/doc_code/load_shedding.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@

@serve.deployment(
# Each replica will be sent 2 requests at a time.
max_concurrent_queries=2,
max_ongoing_requests=2,
# Each caller queues up to 2 requests at a time.
# (beyond those that are sent to replicas).
max_queued_requests=2,
Expand Down
4 changes: 2 additions & 2 deletions doc/source/serve/production-guide/best-practices.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ This controls the maximum number of requests that each {mod}`DeploymentHandle <r
Once the limit is reached, enqueueing any new requests immediately raises a {mod}`BackPressureError <ray.serve.exceptions.BackPressureError>`.
HTTP requests will return a `503` status code (service unavailable).

The following example defines a deployment that emulates slow request handling and has `max_concurrent_queries` and `max_queued_requests` configured.
The following example defines a deployment that emulates slow request handling and has `max_ongoing_requests` and `max_queued_requests` configured.

```{literalinclude} ../doc_code/load_shedding.py
:start-after: __example_deployment_start__
Expand All @@ -68,7 +68,7 @@ The following example defines a deployment that emulates slow request handling a
```

To test the behavior, send HTTP requests in parallel to emulate multiple clients.
Serve accepts `max_concurrent_queries` and `max_queued_requests` requests, and rejects further requests with a `503`, or service unavailable, status.
Serve accepts `max_ongoing_requests` and `max_queued_requests` requests, and rejects further requests with a `503`, or service unavailable, status.

```{literalinclude} ../doc_code/load_shedding.py
:start-after: __client_test_start__
Expand Down
25 changes: 0 additions & 25 deletions java/serve/src/main/java/io/ray/serve/config/DeploymentConfig.java
Original file line number Diff line number Diff line change
Expand Up @@ -22,12 +22,6 @@ public class DeploymentConfig implements Serializable {
*/
private Integer numReplicas = 1;

/**
* [DEPRECATED] The maximum number of queries that can be sent to a replica of this deployment
* without receiving a response. Defaults to 100.
*/
private Integer maxConcurrentQueries = 100;

/**
* The maximum number of requests that can be sent to a replica of this deployment without
* receiving a response. Defaults to 100.
Expand Down Expand Up @@ -81,22 +75,10 @@ public DeploymentConfig setNumReplicas(Integer numReplicas) {
return this;
}

public Integer getMaxConcurrentQueries() {
return maxConcurrentQueries;
}

public Integer getMaxOngoingRequests() {
return maxOngoingRequests;
}

public DeploymentConfig setMaxConcurrentQueries(Integer maxConcurrentQueries) {
if (maxConcurrentQueries != null) {
Preconditions.checkArgument(maxConcurrentQueries > 0, "max_concurrent_queries must be > 0");
this.maxConcurrentQueries = maxConcurrentQueries;
}
return this;
}

public DeploymentConfig setMaxOngoingRequests(Integer maxOngoingRequests) {
if (maxOngoingRequests != null) {
Preconditions.checkArgument(maxOngoingRequests > 0, "max_ongoing_requests must be > 0");
Expand Down Expand Up @@ -208,12 +190,6 @@ public void setPrevVersion(String prevVersion) {
}

public byte[] toProtoBytes() {
Integer maxOngoingRequests;
if (this.maxOngoingRequests == null) {
maxOngoingRequests = this.maxConcurrentQueries;
} else {
maxOngoingRequests = this.maxOngoingRequests;
}
io.ray.serve.generated.DeploymentConfig.Builder builder =
io.ray.serve.generated.DeploymentConfig.newBuilder()
.setNumReplicas(numReplicas)
Expand Down Expand Up @@ -262,7 +238,6 @@ public static DeploymentConfig fromProto(io.ray.serve.generated.DeploymentConfig
return deploymentConfig;
}
deploymentConfig.setNumReplicas(proto.getNumReplicas());
deploymentConfig.setMaxConcurrentQueries(proto.getMaxOngoingRequests());
deploymentConfig.setMaxOngoingRequests(proto.getMaxOngoingRequests());
deploymentConfig.setGracefulShutdownWaitLoopS(proto.getGracefulShutdownWaitLoopS());
deploymentConfig.setGracefulShutdownTimeoutS(proto.getGracefulShutdownTimeoutS());
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ public DeploymentCreator options() {
.setRoutePrefix(this.routePrefix)
.setRayActorOptions(this.replicaConfig.getRayActorOptions())
.setUserConfig(this.deploymentConfig.getUserConfig())
.setMaxConcurrentQueries(this.deploymentConfig.getMaxConcurrentQueries())
.setMaxOngoingRequests(this.deploymentConfig.getMaxOngoingRequests())
.setAutoscalingConfig(this.deploymentConfig.getAutoscalingConfig())
.setGracefulShutdownWaitLoopS(this.deploymentConfig.getGracefulShutdownWaitLoopS())
.setGracefulShutdownTimeoutS(this.deploymentConfig.getGracefulShutdownTimeoutS())
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ public class DeploymentCreator {
* The maximum number of queries that will be sent to a replica of this deployment without
* receiving a response. Defaults to 100.
*/
private Integer maxConcurrentQueries;
private Integer maxOngoingRequests;

private AutoscalingConfig autoscalingConfig;

Expand Down Expand Up @@ -105,7 +105,7 @@ public Deployment create(boolean check) {
DeploymentConfig deploymentConfig =
new DeploymentConfig()
.setNumReplicas(numReplicas != null ? numReplicas : 1)
.setMaxConcurrentQueries(maxConcurrentQueries)
.setMaxOngoingRequests(maxOngoingRequests)
.setUserConfig(userConfig)
.setAutoscalingConfig(autoscalingConfig)
.setGracefulShutdownWaitLoopS(gracefulShutdownWaitLoopS)
Expand Down Expand Up @@ -204,12 +204,12 @@ public DeploymentCreator setUserConfig(Object userConfig) {
return this;
}

public Integer getMaxConcurrentQueries() {
return maxConcurrentQueries;
public Integer getMaxOngoingRequests() {
return maxOngoingRequests;
}

public DeploymentCreator setMaxConcurrentQueries(Integer maxConcurrentQueries) {
this.maxConcurrentQueries = maxConcurrentQueries;
public DeploymentCreator setMaxOngoingRequests(Integer maxOngoingRequests) {
this.maxOngoingRequests = maxOngoingRequests;
return this;
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@ private ObjectRef<Object> tryAssignReplica(Query query) {
}
int randomIndex = RandomUtils.nextInt(0, handles.size());
BaseActorHandle replica =
handles.get(randomIndex); // TODO controll concurrency using maxConcurrentQueries
handles.get(randomIndex); // TODO controll concurrency using maxOngoingRequests
LOGGER.debug("Assigned query {} to replica {}.", query.getMetadata().getRequestId(), replica);
if (replica instanceof PyActorHandle) {
Object[] args =
Expand Down
6 changes: 2 additions & 4 deletions python/ray/serve/_private/application_state.py
Original file line number Diff line number Diff line change
Expand Up @@ -1108,10 +1108,8 @@ def override_deployment_info(

# Override options for each deployment listed in the config.
for options in deployment_override_options:
if "max_concurrent_queries" in options or "max_ongoing_requests" in options:
options["max_ongoing_requests"] = options.get(
"max_ongoing_requests"
) or options.get("max_concurrent_queries")
if "max_ongoing_requests" in options:
options["max_ongoing_requests"] = options.get("max_ongoing_requests")

deployment_name = options["name"]
info = deployment_infos[deployment_name]
Expand Down
28 changes: 6 additions & 22 deletions python/ray/serve/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,7 @@
ReplicaConfig,
handle_num_replicas_auto,
)
from ray.serve._private.constants import (
DEFAULT_MAX_ONGOING_REQUESTS,
SERVE_DEFAULT_APP_NAME,
SERVE_LOGGER_NAME,
)
from ray.serve._private.constants import SERVE_DEFAULT_APP_NAME, SERVE_LOGGER_NAME
from ray.serve._private.deployment_graph_build import build as pipeline_build
from ray.serve._private.deployment_graph_build import (
get_and_validate_ingress_deployment,
Expand Down Expand Up @@ -256,7 +252,6 @@ def deployment(
placement_group_strategy: Default[str] = DEFAULT.VALUE,
max_replicas_per_node: Default[int] = DEFAULT.VALUE,
user_config: Default[Optional[Any]] = DEFAULT.VALUE,
max_concurrent_queries: Default[int] = DEFAULT.VALUE,
max_ongoing_requests: Default[int] = DEFAULT.VALUE,
max_queued_requests: Default[int] = DEFAULT.VALUE,
autoscaling_config: Default[Union[Dict, AutoscalingConfig, None]] = DEFAULT.VALUE,
Expand Down Expand Up @@ -305,8 +300,6 @@ class MyDeployment:
user_config: Config to pass to the reconfigure method of the deployment. This
can be updated dynamically without restarting the replicas of the
deployment. The user_config must be fully JSON-serializable.
max_concurrent_queries: [DEPRECATED] Maximum number of queries that are sent to
a replica of this deployment without receiving a response. Defaults to 5.
max_ongoing_requests: Maximum number of requests that are sent to a
replica of this deployment without receiving a response. Defaults to 5.
max_queued_requests: [EXPERIMENTAL] Maximum number of requests to this
Expand Down Expand Up @@ -369,14 +362,11 @@ class MyDeployment:
if max_ongoing_requests is None:
raise ValueError("`max_ongoing_requests` must be non-null, got None.")
elif max_ongoing_requests is DEFAULT.VALUE:
if max_concurrent_queries is None:
logger.warning(
"The default value for `max_ongoing_requests` has changed from "
"100 to 5 in Ray 2.32.0."
)
max_ongoing_requests = DEFAULT_MAX_ONGOING_REQUESTS
else:
max_ongoing_requests = max_concurrent_queries
logger.warning(
"The default value for `max_ongoing_requests` has changed from "
"100 to 5 in Ray 2.32.0."
)

if num_replicas == "auto":
num_replicas = None
max_ongoing_requests, autoscaling_config = handle_num_replicas_auto(
Expand Down Expand Up @@ -422,12 +412,6 @@ class MyDeployment:
"`serve.run` instead."
)

if max_concurrent_queries is not DEFAULT.VALUE:
logger.warning(
"DeprecationWarning: `max_concurrent_queries` in `@serve.deployment` has "
"been deprecated and replaced by `max_ongoing_requests`."
)

if max_ongoing_requests is DEFAULT.VALUE:
logger.warning(
"The default value for `max_ongoing_requests` has changed from 100 to 5 in "
Expand Down
27 changes: 2 additions & 25 deletions python/ray/serve/deployment.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
ReplicaConfig,
handle_num_replicas_auto,
)
from ray.serve._private.constants import DEFAULT_MAX_ONGOING_REQUESTS, SERVE_LOGGER_NAME
from ray.serve._private.constants import SERVE_LOGGER_NAME
from ray.serve._private.usage import ServeUsageTag
from ray.serve._private.utils import DEFAULT, Default
from ray.serve.config import AutoscalingConfig
Expand Down Expand Up @@ -181,16 +181,6 @@ def user_config(self) -> Any:
"""Dynamic user-provided config options."""
return self._deployment_config.user_config

@property
def max_concurrent_queries(self) -> int:
"""[DEPRECATED] Max number of requests a replica can handle at once."""

logger.warning(
"DeprecationWarning: `max_concurrent_queries` is deprecated, please use "
"`max_ongoing_requests` instead."
)
return self._deployment_config.max_ongoing_requests

@property
def max_ongoing_requests(self) -> int:
"""Max number of requests a replica can handle at once."""
Expand Down Expand Up @@ -326,7 +316,6 @@ def options(
placement_group_strategy: Default[str] = DEFAULT.VALUE,
max_replicas_per_node: Default[int] = DEFAULT.VALUE,
user_config: Default[Optional[Any]] = DEFAULT.VALUE,
max_concurrent_queries: Default[int] = DEFAULT.VALUE,
max_ongoing_requests: Default[int] = DEFAULT.VALUE,
max_queued_requests: Default[int] = DEFAULT.VALUE,
autoscaling_config: Default[
Expand All @@ -353,11 +342,6 @@ def options(
# `num_replicas="auto"`
if max_ongoing_requests is None:
raise ValueError("`max_ongoing_requests` must be non-null, got None.")
elif max_ongoing_requests is DEFAULT.VALUE:
if max_concurrent_queries is None:
max_ongoing_requests = DEFAULT_MAX_ONGOING_REQUESTS
else:
max_ongoing_requests = max_concurrent_queries
if num_replicas == "auto":
num_replicas = None
max_ongoing_requests, autoscaling_config = handle_num_replicas_auto(
Expand Down Expand Up @@ -413,12 +397,6 @@ def options(
"into `serve.run` instead."
)

if not _internal and max_concurrent_queries is not DEFAULT.VALUE:
logger.warning(
"DeprecationWarning: `max_concurrent_queries` in `@serve.deployment` "
"has been deprecated and replaced by `max_ongoing_requests`."
)

elif num_replicas not in [DEFAULT.VALUE, None]:
new_deployment_config.num_replicas = num_replicas

Expand Down Expand Up @@ -566,7 +544,6 @@ def deployment_to_schema(
"num_replicas": None
if d._deployment_config.autoscaling_config
else d.num_replicas,
"max_concurrent_queries": d.max_ongoing_requests,
"max_ongoing_requests": d.max_ongoing_requests,
"max_queued_requests": d.max_queued_requests,
"user_config": d.user_config,
Expand Down Expand Up @@ -635,7 +612,7 @@ def schema_to_deployment(s: DeploymentSchema) -> Deployment:
deployment_config = DeploymentConfig.from_default(
num_replicas=s.num_replicas,
user_config=s.user_config,
max_ongoing_requests=s.max_ongoing_requests or s.max_concurrent_queries,
max_ongoing_requests=s.max_ongoing_requests,
max_queued_requests=s.max_queued_requests,
autoscaling_config=s.autoscaling_config,
graceful_shutdown_wait_loop_s=s.graceful_shutdown_wait_loop_s,
Expand Down
Loading

0 comments on commit b610a0b

Please sign in to comment.