Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[serve][streaming] Support streaming w/ model composition (using ServeHandle) #35777

Closed
edoakes opened this issue May 25, 2023 · 1 comment
Closed
Assignees
Labels
enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks ray-team-created Ray Team created serve Ray Serve Related Issue

Comments

@edoakes
Copy link
Contributor

edoakes commented May 25, 2023

No description provided.

@edoakes edoakes added enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks serve Ray Serve Related Issue ray-team-created Ray Team created labels May 25, 2023
edoakes added a commit that referenced this issue May 26, 2023
…jectRefGenerator` (#35720)

Adds experimental support for using `StreamingResponse`s to stream intermediate results back to the client. This is currently gated behind a feature flag (must set `RAY_SERVE_ENABLE_EXPERIMENTAL_STREAMING=1`.

This is implemented by using the Ray `ObjectRefStreamingGenerator` interface. When the feature flag is on, the HTTP proxy will use `num_returns="streaming"` for _all_ calls to downstream replicas. The replica code has been modified to incrementally yield raw ASGI messages back to the HTTP proxy.

Known limitations & follow-ups (to be addressed before a non-experimental release):

- Minor performance regression due to an extra RPC from streaming protocol (see the microbenchmark results posted on #35468). Most of this should be able to be optimized away before turning this on by default.
- Streaming is not yet possible using the `ServeHandle` interface: #35777
- `max_concurrent_queries` is not respected by the HTTP proxy when streaming is enabled; we do simple round-robin instead: #35778
- The timeout set in the HTTP proxy does not apply to streaming responses: #35779
edoakes added a commit to edoakes/ray that referenced this issue May 26, 2023
…jectRefGenerator` (ray-project#35720)

Adds experimental support for using `StreamingResponse`s to stream intermediate results back to the client. This is currently gated behind a feature flag (must set `RAY_SERVE_ENABLE_EXPERIMENTAL_STREAMING=1`.

This is implemented by using the Ray `ObjectRefStreamingGenerator` interface. When the feature flag is on, the HTTP proxy will use `num_returns="streaming"` for _all_ calls to downstream replicas. The replica code has been modified to incrementally yield raw ASGI messages back to the HTTP proxy.

Known limitations & follow-ups (to be addressed before a non-experimental release):

- Minor performance regression due to an extra RPC from streaming protocol (see the microbenchmark results posted on ray-project#35468). Most of this should be able to be optimized away before turning this on by default.
- Streaming is not yet possible using the `ServeHandle` interface: ray-project#35777
- `max_concurrent_queries` is not respected by the HTTP proxy when streaming is enabled; we do simple round-robin instead: ray-project#35778
- The timeout set in the HTTP proxy does not apply to streaming responses: ray-project#35779

Signed-off-by: Edward Oakes <[email protected]>
ArturNiederfahrenhorst pushed a commit that referenced this issue May 26, 2023
…jectRefGenerator` (#35720) (#35811)

Adds experimental support for using `StreamingResponse`s to stream intermediate results back to the client. This is currently gated behind a feature flag (must set `RAY_SERVE_ENABLE_EXPERIMENTAL_STREAMING=1`.

This is implemented by using the Ray `ObjectRefStreamingGenerator` interface. When the feature flag is on, the HTTP proxy will use `num_returns="streaming"` for _all_ calls to downstream replicas. The replica code has been modified to incrementally yield raw ASGI messages back to the HTTP proxy.

Known limitations & follow-ups (to be addressed before a non-experimental release):

- Minor performance regression due to an extra RPC from streaming protocol (see the microbenchmark results posted on #35468). Most of this should be able to be optimized away before turning this on by default.
- Streaming is not yet possible using the `ServeHandle` interface: #35777
- `max_concurrent_queries` is not respected by the HTTP proxy when streaming is enabled; we do simple round-robin instead: #35778
- The timeout set in the HTTP proxy does not apply to streaming responses: #35779

Signed-off-by: Edward Oakes <[email protected]>
@edoakes edoakes self-assigned this May 30, 2023
scv119 pushed a commit to scv119/ray that referenced this issue Jun 16, 2023
…jectRefGenerator` (ray-project#35720)

Adds experimental support for using `StreamingResponse`s to stream intermediate results back to the client. This is currently gated behind a feature flag (must set `RAY_SERVE_ENABLE_EXPERIMENTAL_STREAMING=1`.

This is implemented by using the Ray `ObjectRefStreamingGenerator` interface. When the feature flag is on, the HTTP proxy will use `num_returns="streaming"` for _all_ calls to downstream replicas. The replica code has been modified to incrementally yield raw ASGI messages back to the HTTP proxy.

Known limitations & follow-ups (to be addressed before a non-experimental release):

- Minor performance regression due to an extra RPC from streaming protocol (see the microbenchmark results posted on ray-project#35468). Most of this should be able to be optimized away before turning this on by default.
- Streaming is not yet possible using the `ServeHandle` interface: ray-project#35777
- `max_concurrent_queries` is not respected by the HTTP proxy when streaming is enabled; we do simple round-robin instead: ray-project#35778
- The timeout set in the HTTP proxy does not apply to streaming responses: ray-project#35779
@edoakes
Copy link
Contributor Author

edoakes commented Jul 7, 2023

Closed by #37114

@edoakes edoakes closed this as completed Jul 7, 2023
arvind-chandra pushed a commit to lmco/ray that referenced this issue Aug 31, 2023
…jectRefGenerator` (ray-project#35720)

Adds experimental support for using `StreamingResponse`s to stream intermediate results back to the client. This is currently gated behind a feature flag (must set `RAY_SERVE_ENABLE_EXPERIMENTAL_STREAMING=1`.

This is implemented by using the Ray `ObjectRefStreamingGenerator` interface. When the feature flag is on, the HTTP proxy will use `num_returns="streaming"` for _all_ calls to downstream replicas. The replica code has been modified to incrementally yield raw ASGI messages back to the HTTP proxy.

Known limitations & follow-ups (to be addressed before a non-experimental release):

- Minor performance regression due to an extra RPC from streaming protocol (see the microbenchmark results posted on ray-project#35468). Most of this should be able to be optimized away before turning this on by default.
- Streaming is not yet possible using the `ServeHandle` interface: ray-project#35777
- `max_concurrent_queries` is not respected by the HTTP proxy when streaming is enabled; we do simple round-robin instead: ray-project#35778
- The timeout set in the HTTP proxy does not apply to streaming responses: ray-project#35779

Signed-off-by: e428265 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks ray-team-created Ray Team created serve Ray Serve Related Issue
Projects
None yet
Development

No branches or pull requests

1 participant