[serve] remove max_concurrent_queries (#46427)

[serve] remove max_concurrent_queries Remove `max_concurrent_queries`, which was replaced by `max_ongoing_requests` and was deprecated in ray 2.10. Signed-off-by: Cindy Zhang <[email protected]> Signed-off-by: Cindy Zhang <[email protected]>
ray-project · Jul 16, 2024 · b610a0b · b610a0b
1 parent aafc7a6
commit b610a0b
Show file tree

Hide file tree

Showing 20 changed files with 45 additions and 169 deletions.
diff --git a/doc/source/serve/advanced-guides/advanced-autoscaling.md b/doc/source/serve/advanced-guides/advanced-autoscaling.md
@@ -30,9 +30,6 @@ Always load test your workloads. For example, if the use case is latency sensiti
 As an example, suppose you have two replicas of a synchronous deployment that has 100ms latency, serving a traffic load of 30 QPS. Then Serve assigns requests to replicas faster than the replicas can finish processing them; more and more requests queue up at the replica (these requests are "ongoing requests") as time progresses, and then the average number of ongoing requests at each replica steadily increases. Latency also increases because new requests have to wait for old requests to finish processing. If you set `target_ongoing_requests = 1`, Serve detects a higher than desired number of ongoing requests per replica, and adds more replicas. At 3 replicas, your system would be able to process 30 QPS with 1 ongoing request per replica on average.
 :::
 
-#### **max_concurrent_queries [default=5] (DEPRECATED)**
-This parameter is renamed to `max_ongoing_requests`. `max_concurrent_queries` will be removed in a future release.
-
 #### **max_ongoing_requests [default=5]**
 :::{note}
 The default for `max_ongoing_requests` changed from 100 to 5 in Ray 2.32.0. You can continue to set it manually to override the default.

diff --git a/doc/source/serve/autoscaling-guide.md b/doc/source/serve/autoscaling-guide.md
@@ -46,7 +46,7 @@ You can set `num_replicas="auto"` and override its default values (shown above)
 Let's dive into what each of these parameters do.
 
 * **target_ongoing_requests** (replaces the deprecated `target_num_ongoing_requests_per_replica`) is the average number of ongoing requests per replica that the Serve autoscaler tries to ensure. You can adjust it based on your request processing length (the longer the requests, the smaller this number should be) as well as your latency objective (the shorter you want your latency to be, the smaller this number should be).
-* **max_ongoing_requests** (replaces the deprecated `max_concurrent_queries`) is the maximum number of ongoing requests allowed for a replica. Note this parameter is not part of the autoscaling config because it's relevant to all deployments, but it's important to set it relative to the target value if you turn on autoscaling for your deployment.
+* **max_ongoing_requests** is the maximum number of ongoing requests allowed for a replica. Note this parameter is not part of the autoscaling config because it's relevant to all deployments, but it's important to set it relative to the target value if you turn on autoscaling for your deployment.
 * **min_replicas** is the minimum number of replicas for the deployment. Set this to 0 if there are long periods of no traffic and some extra tail latency during upscale is acceptable. Otherwise, set this to what you think you need for low traffic.
 * **max_replicas** is the maximum number of replicas for the deployment. Set this to ~20% higher than what you think you need for peak traffic.
 
@@ -104,4 +104,4 @@ The Ray Serve Autoscaler is an application-level autoscaler that sits on top of
 Concretely, this means that the Ray Serve autoscaler asks Ray to start a number of replica actors based on the request demand.
 If the Ray Autoscaler determines there aren't enough available resources (e.g. CPUs, GPUs, etc.) to place these actors, it responds by requesting more Ray nodes.
 The underlying cloud provider then responds by adding more nodes.
-Similarly, when Ray Serve scales down and terminates replica Actors, it attempts to make as many nodes idle as possible so the Ray Autoscaler can remove them. To learn more about the architecture underlying Ray Serve Autoscaling, see [Ray Serve Autoscaling Architecture](serve-autoscaling-architecture).
+Similarly, when Ray Serve scales down and terminates replica Actors, it attempts to make as many nodes idle as possible so the Ray Autoscaler can remove them. To learn more about the architecture underlying Ray Serve Autoscaling, see [Ray Serve Autoscaling Architecture](serve-autoscaling-architecture).
diff --git a/doc/source/serve/configure-serve-deployment.md b/doc/source/serve/configure-serve-deployment.md
@@ -15,7 +15,7 @@ You can also refer to the [API reference](../serve/api/doc/ray.serve.deployment_
 - `name` - Name uniquely identifying this deployment within the application. If not provided, the name of the class or function is used.
 - `num_replicas` - Controls the number of replicas to run that handle requests to this deployment. This can be a positive integer, in which case the number of replicas stays constant, or `auto`, in which case the number of replicas will autoscale with a default configuration (see [Ray Serve Autoscaling](serve-autoscaling) for more). Defaults to 1.
 - `ray_actor_options` - Options to pass to the Ray Actor decorator, such as resource requirements. Valid options are: `accelerator_type`, `memory`, `num_cpus`, `num_gpus`, `object_store_memory`, `resources`, and `runtime_env` For more details - [Resource management in Serve](serve-cpus-gpus)
-- `max_ongoing_requests` (replaces the deprecated `max_concurrent_queries`) - Maximum number of queries that are sent to a replica of this deployment without receiving a response. Defaults to 100 (the default will change to 5 in an upcoming release). This may be an important parameter to configure for [performance tuning](serve-perf-tuning).
+- `max_ongoing_requests` - Maximum number of queries that are sent to a replica of this deployment without receiving a response. Defaults to 100 (the default will change to 5 in an upcoming release). This may be an important parameter to configure for [performance tuning](serve-perf-tuning).
 - `autoscaling_config` - Parameters to configure autoscaling behavior. If this is set, you can't set `num_replicas` to a number. For more details on configurable parameters for autoscaling, see [Ray Serve Autoscaling](serve-autoscaling). 
 - `user_config` -  Config to pass to the reconfigure method of the deployment. This can be updated dynamically without restarting the replicas of the deployment. The user_config must be fully JSON-serializable. For more details, see [Serve User Config](serve-user-config). 
 - `health_check_period_s` - Duration between health check calls for the replica. Defaults to 10s. The health check is by default a no-op Actor call to the replica, but you can define your own health check using the "check_health" method in your deployment that raises an exception when unhealthy.

diff --git a/doc/source/serve/doc_code/load_shedding.py b/doc/source/serve/doc_code/load_shedding.py
@@ -8,7 +8,7 @@
 
 @serve.deployment(
     # Each replica will be sent 2 requests at a time.
-    max_concurrent_queries=2,
+    max_ongoing_requests=2,
     # Each caller queues up to 2 requests at a time.
     # (beyond those that are sent to replicas).
     max_queued_requests=2,

diff --git a/doc/source/serve/production-guide/best-practices.md b/doc/source/serve/production-guide/best-practices.md
@@ -59,7 +59,7 @@ This controls the maximum number of requests that each {mod}`DeploymentHandle <r
 Once the limit is reached, enqueueing any new requests immediately raises a {mod}`BackPressureError <ray.serve.exceptions.BackPressureError>`.
 HTTP requests will return a `503` status code (service unavailable).
 
-The following example defines a deployment that emulates slow request handling and has `max_concurrent_queries` and `max_queued_requests` configured.
+The following example defines a deployment that emulates slow request handling and has `max_ongoing_requests` and `max_queued_requests` configured.
 
 ```{literalinclude} ../doc_code/load_shedding.py
 :start-after: __example_deployment_start__
@@ -68,7 +68,7 @@ The following example defines a deployment that emulates slow request handling a
 ```
 
 To test the behavior, send HTTP requests in parallel to emulate multiple clients.
-Serve accepts `max_concurrent_queries` and `max_queued_requests` requests, and rejects further requests with a `503`, or service unavailable, status.
+Serve accepts `max_ongoing_requests` and `max_queued_requests` requests, and rejects further requests with a `503`, or service unavailable, status.
 
 ```{literalinclude} ../doc_code/load_shedding.py
 :start-after: __client_test_start__

diff --git a/java/serve/src/main/java/io/ray/serve/config/DeploymentConfig.java b/java/serve/src/main/java/io/ray/serve/config/DeploymentConfig.java
@@ -22,12 +22,6 @@ public class DeploymentConfig implements Serializable {
    */
   private Integer numReplicas = 1;
 
-  /**
-   * [DEPRECATED] The maximum number of queries that can be sent to a replica of this deployment
-   * without receiving a response. Defaults to 100.
-   */
-  private Integer maxConcurrentQueries = 100;
-
   /**
    * The maximum number of requests that can be sent to a replica of this deployment without
    * receiving a response. Defaults to 100.
@@ -81,22 +75,10 @@ public DeploymentConfig setNumReplicas(Integer numReplicas) {
     return this;
   }
 
-  public Integer getMaxConcurrentQueries() {
-    return maxConcurrentQueries;
-  }
-
   public Integer getMaxOngoingRequests() {
     return maxOngoingRequests;
   }
 
-  public DeploymentConfig setMaxConcurrentQueries(Integer maxConcurrentQueries) {
-    if (maxConcurrentQueries != null) {
-      Preconditions.checkArgument(maxConcurrentQueries > 0, "max_concurrent_queries must be > 0");
-      this.maxConcurrentQueries = maxConcurrentQueries;
-    }
-    return this;
-  }
-
   public DeploymentConfig setMaxOngoingRequests(Integer maxOngoingRequests) {
     if (maxOngoingRequests != null) {
       Preconditions.checkArgument(maxOngoingRequests > 0, "max_ongoing_requests must be > 0");
@@ -208,12 +190,6 @@ public void setPrevVersion(String prevVersion) {
   }
 
   public byte[] toProtoBytes() {
-    Integer maxOngoingRequests;
-    if (this.maxOngoingRequests == null) {
-      maxOngoingRequests = this.maxConcurrentQueries;
-    } else {
-      maxOngoingRequests = this.maxOngoingRequests;
-    }
     io.ray.serve.generated.DeploymentConfig.Builder builder =
         io.ray.serve.generated.DeploymentConfig.newBuilder()
             .setNumReplicas(numReplicas)
@@ -262,7 +238,6 @@ public static DeploymentConfig fromProto(io.ray.serve.generated.DeploymentConfig
       return deploymentConfig;
     }
     deploymentConfig.setNumReplicas(proto.getNumReplicas());
-    deploymentConfig.setMaxConcurrentQueries(proto.getMaxOngoingRequests());
     deploymentConfig.setMaxOngoingRequests(proto.getMaxOngoingRequests());
     deploymentConfig.setGracefulShutdownWaitLoopS(proto.getGracefulShutdownWaitLoopS());
     deploymentConfig.setGracefulShutdownTimeoutS(proto.getGracefulShutdownTimeoutS());

diff --git a/java/serve/src/main/java/io/ray/serve/deployment/Deployment.java b/java/serve/src/main/java/io/ray/serve/deployment/Deployment.java
@@ -100,7 +100,7 @@ public DeploymentCreator options() {
         .setRoutePrefix(this.routePrefix)
         .setRayActorOptions(this.replicaConfig.getRayActorOptions())
         .setUserConfig(this.deploymentConfig.getUserConfig())
-        .setMaxConcurrentQueries(this.deploymentConfig.getMaxConcurrentQueries())
+        .setMaxOngoingRequests(this.deploymentConfig.getMaxOngoingRequests())
         .setAutoscalingConfig(this.deploymentConfig.getAutoscalingConfig())
         .setGracefulShutdownWaitLoopS(this.deploymentConfig.getGracefulShutdownWaitLoopS())
         .setGracefulShutdownTimeoutS(this.deploymentConfig.getGracefulShutdownTimeoutS())

diff --git a/java/serve/src/main/java/io/ray/serve/deployment/DeploymentCreator.java b/java/serve/src/main/java/io/ray/serve/deployment/DeploymentCreator.java
@@ -66,7 +66,7 @@ public class DeploymentCreator {
    * The maximum number of queries that will be sent to a replica of this deployment without
    * receiving a response. Defaults to 100.
    */
-  private Integer maxConcurrentQueries;
+  private Integer maxOngoingRequests;
 
   private AutoscalingConfig autoscalingConfig;
 
@@ -105,7 +105,7 @@ public Deployment create(boolean check) {
     DeploymentConfig deploymentConfig =
         new DeploymentConfig()
             .setNumReplicas(numReplicas != null ? numReplicas : 1)
-            .setMaxConcurrentQueries(maxConcurrentQueries)
+            .setMaxOngoingRequests(maxOngoingRequests)
             .setUserConfig(userConfig)
             .setAutoscalingConfig(autoscalingConfig)
             .setGracefulShutdownWaitLoopS(gracefulShutdownWaitLoopS)
@@ -204,12 +204,12 @@ public DeploymentCreator setUserConfig(Object userConfig) {
     return this;
   }
 
-  public Integer getMaxConcurrentQueries() {
-    return maxConcurrentQueries;
+  public Integer getMaxOngoingRequests() {
+    return maxOngoingRequests;
   }
 
-  public DeploymentCreator setMaxConcurrentQueries(Integer maxConcurrentQueries) {
-    this.maxConcurrentQueries = maxConcurrentQueries;
+  public DeploymentCreator setMaxOngoingRequests(Integer maxOngoingRequests) {
+    this.maxOngoingRequests = maxOngoingRequests;
     return this;
   }
 

diff --git a/java/serve/src/main/java/io/ray/serve/router/ReplicaSet.java b/java/serve/src/main/java/io/ray/serve/router/ReplicaSet.java
@@ -102,7 +102,7 @@ private ObjectRef<Object> tryAssignReplica(Query query) {
     }
     int randomIndex = RandomUtils.nextInt(0, handles.size());
     BaseActorHandle replica =
-        handles.get(randomIndex); // TODO controll concurrency using maxConcurrentQueries
+        handles.get(randomIndex); // TODO controll concurrency using maxOngoingRequests
     LOGGER.debug("Assigned query {} to replica {}.", query.getMetadata().getRequestId(), replica);
     if (replica instanceof PyActorHandle) {
       Object[] args =

diff --git a/python/ray/serve/_private/application_state.py b/python/ray/serve/_private/application_state.py
@@ -1108,10 +1108,8 @@ def override_deployment_info(
 
     # Override options for each deployment listed in the config.
     for options in deployment_override_options:
-        if "max_concurrent_queries" in options or "max_ongoing_requests" in options:
-            options["max_ongoing_requests"] = options.get(
-                "max_ongoing_requests"
-            ) or options.get("max_concurrent_queries")
+        if "max_ongoing_requests" in options:
+            options["max_ongoing_requests"] = options.get("max_ongoing_requests")
 
         deployment_name = options["name"]
         info = deployment_infos[deployment_name]

diff --git a/python/ray/serve/api.py b/python/ray/serve/api.py
@@ -16,11 +16,7 @@
     ReplicaConfig,
     handle_num_replicas_auto,
 )
-from ray.serve._private.constants import (
-    DEFAULT_MAX_ONGOING_REQUESTS,
-    SERVE_DEFAULT_APP_NAME,
-    SERVE_LOGGER_NAME,
-)
+from ray.serve._private.constants import SERVE_DEFAULT_APP_NAME, SERVE_LOGGER_NAME
 from ray.serve._private.deployment_graph_build import build as pipeline_build
 from ray.serve._private.deployment_graph_build import (
     get_and_validate_ingress_deployment,
@@ -256,7 +252,6 @@ def deployment(
     placement_group_strategy: Default[str] = DEFAULT.VALUE,
     max_replicas_per_node: Default[int] = DEFAULT.VALUE,
     user_config: Default[Optional[Any]] = DEFAULT.VALUE,
-    max_concurrent_queries: Default[int] = DEFAULT.VALUE,
     max_ongoing_requests: Default[int] = DEFAULT.VALUE,
     max_queued_requests: Default[int] = DEFAULT.VALUE,
     autoscaling_config: Default[Union[Dict, AutoscalingConfig, None]] = DEFAULT.VALUE,
@@ -305,8 +300,6 @@ class MyDeployment:
         user_config: Config to pass to the reconfigure method of the deployment. This
             can be updated dynamically without restarting the replicas of the
             deployment. The user_config must be fully JSON-serializable.
-        max_concurrent_queries: [DEPRECATED] Maximum number of queries that are sent to
-            a replica of this deployment without receiving a response. Defaults to 5.
         max_ongoing_requests: Maximum number of requests that are sent to a
             replica of this deployment without receiving a response. Defaults to 5.
         max_queued_requests: [EXPERIMENTAL] Maximum number of requests to this
@@ -369,14 +362,11 @@ class MyDeployment:
     if max_ongoing_requests is None:
         raise ValueError("`max_ongoing_requests` must be non-null, got None.")
     elif max_ongoing_requests is DEFAULT.VALUE:
-        if max_concurrent_queries is None:
-            logger.warning(
-                "The default value for `max_ongoing_requests` has changed from "
-                "100 to 5 in Ray 2.32.0."
-            )
-            max_ongoing_requests = DEFAULT_MAX_ONGOING_REQUESTS
-        else:
-            max_ongoing_requests = max_concurrent_queries
+        logger.warning(
+            "The default value for `max_ongoing_requests` has changed from "
+            "100 to 5 in Ray 2.32.0."
+        )
+
     if num_replicas == "auto":
         num_replicas = None
         max_ongoing_requests, autoscaling_config = handle_num_replicas_auto(
@@ -422,12 +412,6 @@ class MyDeployment:
             "`serve.run` instead."
         )
 
-    if max_concurrent_queries is not DEFAULT.VALUE:
-        logger.warning(
-            "DeprecationWarning: `max_concurrent_queries` in `@serve.deployment` has "
-            "been deprecated and replaced by `max_ongoing_requests`."
-        )
-
     if max_ongoing_requests is DEFAULT.VALUE:
         logger.warning(
             "The default value for `max_ongoing_requests` has changed from 100 to 5 in "

diff --git a/python/ray/serve/deployment.py b/python/ray/serve/deployment.py
@@ -11,7 +11,7 @@
     ReplicaConfig,
     handle_num_replicas_auto,
 )
-from ray.serve._private.constants import DEFAULT_MAX_ONGOING_REQUESTS, SERVE_LOGGER_NAME
+from ray.serve._private.constants import SERVE_LOGGER_NAME
 from ray.serve._private.usage import ServeUsageTag
 from ray.serve._private.utils import DEFAULT, Default
 from ray.serve.config import AutoscalingConfig
@@ -181,16 +181,6 @@ def user_config(self) -> Any:
         """Dynamic user-provided config options."""
         return self._deployment_config.user_config
 
-    @property
-    def max_concurrent_queries(self) -> int:
-        """[DEPRECATED] Max number of requests a replica can handle at once."""
-
-        logger.warning(
-            "DeprecationWarning: `max_concurrent_queries` is deprecated, please use "
-            "`max_ongoing_requests` instead."
-        )
-        return self._deployment_config.max_ongoing_requests
-
     @property
     def max_ongoing_requests(self) -> int:
         """Max number of requests a replica can handle at once."""
@@ -326,7 +316,6 @@ def options(
         placement_group_strategy: Default[str] = DEFAULT.VALUE,
         max_replicas_per_node: Default[int] = DEFAULT.VALUE,
         user_config: Default[Optional[Any]] = DEFAULT.VALUE,
-        max_concurrent_queries: Default[int] = DEFAULT.VALUE,
         max_ongoing_requests: Default[int] = DEFAULT.VALUE,
         max_queued_requests: Default[int] = DEFAULT.VALUE,
         autoscaling_config: Default[
@@ -353,11 +342,6 @@ def options(
         # `num_replicas="auto"`
         if max_ongoing_requests is None:
             raise ValueError("`max_ongoing_requests` must be non-null, got None.")
-        elif max_ongoing_requests is DEFAULT.VALUE:
-            if max_concurrent_queries is None:
-                max_ongoing_requests = DEFAULT_MAX_ONGOING_REQUESTS
-            else:
-                max_ongoing_requests = max_concurrent_queries
         if num_replicas == "auto":
             num_replicas = None
             max_ongoing_requests, autoscaling_config = handle_num_replicas_auto(
@@ -413,12 +397,6 @@ def options(
                 "into `serve.run` instead."
             )
 
-        if not _internal and max_concurrent_queries is not DEFAULT.VALUE:
-            logger.warning(
-                "DeprecationWarning: `max_concurrent_queries` in `@serve.deployment` "
-                "has been deprecated and replaced by `max_ongoing_requests`."
-            )
-
         elif num_replicas not in [DEFAULT.VALUE, None]:
             new_deployment_config.num_replicas = num_replicas
 
@@ -566,7 +544,6 @@ def deployment_to_schema(
         "num_replicas": None
         if d._deployment_config.autoscaling_config
         else d.num_replicas,
-        "max_concurrent_queries": d.max_ongoing_requests,
         "max_ongoing_requests": d.max_ongoing_requests,
         "max_queued_requests": d.max_queued_requests,
         "user_config": d.user_config,
@@ -635,7 +612,7 @@ def schema_to_deployment(s: DeploymentSchema) -> Deployment:
     deployment_config = DeploymentConfig.from_default(
         num_replicas=s.num_replicas,
         user_config=s.user_config,
-        max_ongoing_requests=s.max_ongoing_requests or s.max_concurrent_queries,
+        max_ongoing_requests=s.max_ongoing_requests,
         max_queued_requests=s.max_queued_requests,
         autoscaling_config=s.autoscaling_config,
         graceful_shutdown_wait_loop_s=s.graceful_shutdown_wait_loop_s,