Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve] ray_serve_deployment_queued_queries doesn't handle client disconnects #37943

Closed
shrekris-anyscale opened this issue Jul 31, 2023 · 0 comments · Fixed by #37965
Closed
Assignees
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks serve Ray Serve Related Issue

Comments

@shrekris-anyscale
Copy link
Contributor

What happened + What you expected to happen

The ray_serve_deployment_queued_queries metric tracks the number of ServeHandle requests that are queued for a particular deployment. It increments whenever a query starts being assigned, and it decrements once a query starts being processed.

However, when a client disconnects before a request is assigned to a replica, the request gets canceled, but the metric never decrements for that request. This causes the metric to increase over time as clients disconnect.

I expect the metric to decrement when a client disconnects and the request is canceled.

Versions / Dependencies

Ray on the latest master.

Reproduction script

# bad_metrics.py

import time
import logging
from ray import serve

logger = logging.getLogger("ray.serve")

@serve.deployment(
    max_concurrent_queries=100,  # 1
    graceful_shutdown_timeout_s=0.0001,
)
def wait(self, *args):
    logger.info("Started a request...")
    time.sleep(100000)
    logger.info("Finished a request...")

app = wait.bind()

Run this app on a Ray cluster that exposes prometheus metrics (e.g. ray start --head --metrics-export-port=8080). Submit a few curl requests from different terminal windows. Then kill the requests. The metric does not decrease.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@shrekris-anyscale shrekris-anyscale added bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks serve Ray Serve Related Issue labels Jul 31, 2023
@shrekris-anyscale shrekris-anyscale self-assigned this Jul 31, 2023
@shrekris-anyscale shrekris-anyscale changed the title [Serve] serve_deployment_queued_queries doesn't handle client disconnects [Serve] ray_serve_deployment_queued_queries doesn't handle client disconnects Aug 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks serve Ray Serve Related Issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant