Requests buffered at the receiving side generate unbounded latency #1409

markusthoemmes · 2018-06-29T09:03:37Z

Expected Behavior

When setting a revision's ConcurrencyModel to be SingleConcurrency but having N clients call the revision at the same time, I expect knative to scale to N and route the requests already in the system to the newly created pods.

Actual Behavior

The first N requests will be sent to one container and will sequentially be worked on. The Nth request, therefore, has at minimum the latency of all the latencies of the previous requests combined. This causes unnecessary latency for the first N requests which can be huge if the requests themselves each take a long time.

Instead, the excessive requests should be routed to the newly created Pods (which are scaled just fine as of today)

Steps to Reproduce the Problem

Run an application in SingleConcurrency mode
Hit the route with multiple concurrent requests
Observe the latency of those requests

(The testcase I described in #1126 (comment) will make this issue apparent in its output)

Additional Info

The problem above is due to the fact that requests are getting routed to one pod that's already floating around (or getting created via 0 -> 1 scaling). Concurrency is measured there and only then are new pods created through autoscaling.

markusthoemmes · 2018-06-29T09:07:25Z

Some context from a Slack discussion regarding the topic:

mdemirhan [14 days ago]
Those numbers apply to each instance (pod) of the destination service and take effect on the client side, so e.g. httpMaxPendingRequests is the maximum outstanding requests from any individual client to one of the backends in that service per connection (each TCP connection is a separate circuit breaker, IIRC). However, as I understand your use case, the Istio ingress is the only "client" calling functions, so a circuit breaker restricting to one request per connection should mean one request per function pod, at the cost of establishing many connections (+connection establishment time as new backends spin up, and lots of Envoy config churn with super-ephemeral endpoints).

In general Istio cannot enforce "one request per backend" because that requires some notion of central coordination, and the sidecars are not coordinated by design. E.g. weighted traffic splitting is a commutative operation, so as long as each proxy in the mesh uses the same weights, in aggregate all traffic follows those weights. The same is not true of # of requests to each backend. It might be possible to implement this style behavior using Mixer's rate limiting/quota implementation (so Mixer can centrally coordinate the calls), but the cost would be increased latency (and quota as implemented today is only best effort, so its very possible to get more calls than intended).

josephburnett [8 days ago]
i don't really want to queue requests on the receiving side. but we do need to enforce concurrency limits, especially when the limit is 1. and we don't have load balancing guarantees from istio that they will respect a concurrency limit per service instance

markusthoemmes · 2018-06-29T09:21:22Z

3 general ideas on how we could go about to fix this issue:

Queue at some more centralized place (like ingress or a “centralized” queue-proxies), which has its own challenges in scalability.
Keep queuing in the queue-proxy and let it know once new resources are available. It could then in theory reverse proxy back to the service itself and go the istio route again, if that makes sense?
Agressive retrying in denying any request that doesn’t fit in the container immediately with a 503 and have that been taken care off by istio retries. Wondering how much overhead that ping-pong might cause though

markusthoemmes · 2018-08-20T21:14:57Z

This is very relevant towards: #1846.

Keeping this open though as it captures the general requirement.

mattmoor · 2019-04-26T21:32:14Z

I think that in terms of mechanics, the method I prefer the most would be to have the KPA put the SKS into PROXY mode (as I commented #1846).

I think the pertinent question then becomes: when should the activator be hooked into the request path?

Talking to @hohaichi about what's described here, queuing at the queue-proxy level (what he called "backlog") feels like a failure mode for the autoscaler. We'd still service the requests, but we're outside of what I'd consider the happy path.

My mental model for this involves having a particular "burst capacity" per-Revision (likely configured cluster-wide, maybe with an annotation for finer tuning), which is the amount of additional cushion each Revision should be able to accommodate in a burst.

Suppose we have a burst capacity of 100 concurrent requests and a concurrency target of 100, then I would need roughly 1 pod of surplus capacity before a burst would result in "backlogging" and we're arguably underprovisioned.

When we're underprovisioned for such a burst, we want the activator to proxy traffic so that the throttler can keep us from overloading Revisions in case of a burst. When the autoscaler catches us up, we can disintermediate the data plane and allow traffic to flow directly again. However, when we are subject to bursts of up to our burst capacity we should either:

Have capacity assuming equal distribution across active pods.
Go through the activator, which can buffer in a pod-agnostic fashion.

Hopefully that makes sense (it's how I've been thinking about the problem), and I'd love to talk about this more now that we have a chance of solving it. 🎉

hohaichi · 2019-04-29T21:35:37Z

I think it is a good idea to switch to PROXY mode as the first step in scaling up. It will help us not only to lower the latency, but also to measure cold requests separately from warm requests.

vagababov · 2019-06-15T17:07:52Z

https://docs.google.com/document/d/1KK32OG-QzOXFrFcD9O7L2udMuzpyYdhDN9PoGdw3y_s/preview

duglin · 2019-06-19T18:57:03Z

This comment is a follow-on to today's scaling call. Upon thinking about everything that was said, I'm still not sure whether my desired scenario can be supported today (via tweaks to config knobs) or not. So, let me describe my scenario here:

as a person sending in requests I understand I may have to deal with cold start times if all existing service instances are at full capacity - and that's ok. So for this discuss, to keep it simple, assume containerConcurrency=1
what I'm not ok with is delays due to other requests being processed and this is very true in cases where the requests are not quick. For the purposes of this discussion, assume processing times much larger than cold-starts, such as 10 seconds
as @markusthoemmes said on today's call, end users aren't really as concerned about what happens under the covers, they care more about things like observed latency. So, with that in mind, as a sender of requests I expected to see response times of 10 seconds + cold-start time, at most!
while there are going to be, what I call, optimizations that can be done to try to reduce how often we spin up a new instance (e.g. twiddle the "idle time" once an instance is done processing a request before we kill it, or "guess" that processing times are consistent and if an instance is 8 seconds into processing a request it might be faster to wait for it finish than to spin up another instance), in the end a lot of those are still just "guesses" and in order to guarantee a response time that isn't dependent on an existing request finishing quickly, as a person setting up my Kn (or KnService) I'd like to be able to tell Kn that I do not want "guesses" - I'm willing to pay for cold-starts if that is needed based on the current state of all existing instances when a new request comes in

If, as what was suggested on today's call, there are certain config knobs I can set to get the above behavior I'd like to open a docs PR to describe how someone can achieve this goal because I don't think we should expect people to piece together the various options "just right" to support this scenario. But in order for me to open a PR I need someone to articulate what knobs and values people need to set.

If this scenario can not be done today then I'd like to ask that this be considered a "stop ship" issue for v1.0.

markusthoemmes · 2019-06-19T19:06:46Z

I‘d like us to approach this from a more abstract standpoint.

I think you‘re saying that, given there is enough capacity in the cluster, I as a user want to be able to see at most cold-start latency + request latency.

Moreover we should also consider thinking in percentiles rather than absolutes. 100%p latency guarantees will be very hard to keep.

Regardless, the direction we‘re taking (thread in the activator and use it as a more centralized buffer) has been ok the agenda for a long while and I think it‘s definitely a step in the right direction.

markusthoemmes · 2019-06-19T19:10:03Z

To add to the above: I think for many users it is not okay to deal with cold-start times, even when considering a certain increase in traffic.

As such, I think we should look at the „latency guarantee“ dial not only in terms of buffering requests but also minimizing cold-start wait times. That‘s where a burst capacity and/or overprovisioning comes into play.

duglin · 2019-06-19T19:37:42Z

I think you‘re saying that, given there is enough capacity in the cluster, I as a user want to be able to see at most cold-start latency + request latency.

yep

I think for many users it is not okay to deal with cold-start times, even when considering a certain increase in traffic.

might be true, but that's why I think there should be a config knob that allows someone to decide which behavior they want rather than Kn decide it for everyone. I tend to think of this as a step-wise problem and step one is to support the idea of "a new request comes in; are all instances busy? if so, spin up a new one". That should be the baseline. Then we can work on making the "busy" check smarter and do the optimizations that have been discussed. But, based on today's call I do not think we support this baseline today - but that's what I was asking in my comment above because if we do I want to document it :-)

... also minimizing cold-start wait times.

totally agree. And my question was written with the assumption that our work on reducing cold-start times would continue regardless of what happens with this queuing issue.

vagababov · 2019-06-19T21:00:19Z

So for "certain" increase in traffic, we already have utilization target, which will deal with % of capacity allocated for bursts in traffic. Unfortunately, the % of capacity is independent of desired burst, but rather a function of current stable traffic.

The new knob permits you to have pod-start+serving time guarantee (% loadbalancing in case of very packed cluster, but I hope we all can agree is not the problem we're solving with current solution).

The https://docs.google.com/spreadsheets/d/1F6t4xTsb6gnSOwsw1nkDNWPg9erKq_uBu9pVzQgH60E/edit#gid=0 allows you to see how these two nodes affect and help you model and decide what values do you want there (i.e. how much more you're willing to pay to get better guarantees of faster traffic being served).

And I want to re-iterate that this is not the final solution to the serverless problem, but a step in that bright direction.

HTH :-)

vagababov · 2019-07-15T20:21:35Z

/close

knative-prow-robot · 2019-07-15T20:21:37Z

@vagababov: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

duglin · 2019-07-29T21:36:42Z

This one is closed, is it because people believe the problem is fixed? I'm still seeing it.

vagababov · 2019-07-29T22:29:08Z

Did you setup the system with TBC?

duglin · 2019-07-29T22:55:30Z

Probably not. What is TBC? 😀 Thanks, -Doug Sent from my iPhone

…

On Jul 29, 2019, at 6:29 PM, Victor Agababov ***@***.***> wrote: Did you setup the system with TBC? — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

google-prow-robot added area/autoscale area/networking kind/bug Categorizes issue or PR as related to a bug. labels Jun 29, 2018

mattmoor added this to the Serving "v1" (ready for production) milestone Nov 28, 2018

vvraskin mentioned this issue Jan 24, 2019

Requests are failing with 503s when scale from 0 to N #2988

Closed

markusthoemmes mentioned this issue Feb 8, 2019

Scaling Roadmap 2019 #3040

Merged

markusthoemmes mentioned this issue Mar 21, 2019

Don't double account for requests going through the activator. #3301

Closed

markusthoemmes mentioned this issue May 16, 2019

Auto-scaling is reactionary #4107

Closed

mdemirhan assigned vagababov May 31, 2019

eallred-google added the P1 P1 label Jun 5, 2019

duglin mentioned this issue Jun 13, 2019

Autoscaler should not behave the same when scaled to zero as other times #4344

Closed

vagababov mentioned this issue Jun 19, 2019

Introduce new knob for configuring burst capacity. #4433

Merged

mattmoor modified the milestones: Serving "v1" (ready for production), Serving 0.8 Jun 24, 2019

vagababov mentioned this issue Jun 24, 2019

BurstCapacity computation and Decider changes #4516

Merged

vagababov mentioned this issue Jul 1, 2019

Implementation of the burst capacity in the KPA #4592

Merged

This was referenced Jul 11, 2019

Initial TBC test. #4710

Merged

Fix the bug where with TBC=0 we could get Activator. #4749

Merged

knative-prow-robot closed this as completed Jul 15, 2019

duglin mentioned this issue Jul 31, 2019

Requests are routed sub-optimally #4997

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Requests buffered at the receiving side generate unbounded latency #1409

Requests buffered at the receiving side generate unbounded latency #1409

markusthoemmes commented Jun 29, 2018

markusthoemmes commented Jun 29, 2018 •

edited

Loading

markusthoemmes commented Jun 29, 2018 •

edited

Loading

markusthoemmes commented Aug 20, 2018

mattmoor commented Apr 26, 2019

hohaichi commented Apr 29, 2019

vagababov commented Jun 15, 2019

duglin commented Jun 19, 2019 •

edited

Loading

markusthoemmes commented Jun 19, 2019

markusthoemmes commented Jun 19, 2019

duglin commented Jun 19, 2019

vagababov commented Jun 19, 2019

vagababov commented Jul 15, 2019

knative-prow-robot commented Jul 15, 2019

duglin commented Jul 29, 2019

vagababov commented Jul 29, 2019

duglin commented Jul 29, 2019 via email

Requests buffered at the receiving side generate unbounded latency #1409

Requests buffered at the receiving side generate unbounded latency #1409

Comments

markusthoemmes commented Jun 29, 2018

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Additional Info

markusthoemmes commented Jun 29, 2018 • edited Loading

markusthoemmes commented Jun 29, 2018 • edited Loading

markusthoemmes commented Aug 20, 2018

mattmoor commented Apr 26, 2019

hohaichi commented Apr 29, 2019

vagababov commented Jun 15, 2019

duglin commented Jun 19, 2019 • edited Loading

markusthoemmes commented Jun 19, 2019

markusthoemmes commented Jun 19, 2019

duglin commented Jun 19, 2019

vagababov commented Jun 19, 2019

vagababov commented Jul 15, 2019

knative-prow-robot commented Jul 15, 2019

duglin commented Jul 29, 2019

vagababov commented Jul 29, 2019

duglin commented Jul 29, 2019 via email

markusthoemmes commented Jun 29, 2018 •

edited

Loading

markusthoemmes commented Jun 29, 2018 •

edited

Loading

duglin commented Jun 19, 2019 •

edited

Loading