Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requests buffered at the receiving side generate unbounded latency #1409

Closed
markusthoemmes opened this issue Jun 29, 2018 · 16 comments
Closed
Assignees
Labels
area/autoscale area/networking kind/bug Categorizes issue or PR as related to a bug. P1 P1
Milestone

Comments

@markusthoemmes
Copy link
Contributor

Expected Behavior

When setting a revision's ConcurrencyModel to be SingleConcurrency but having N clients call the revision at the same time, I expect knative to scale to N and route the requests already in the system to the newly created pods.

Actual Behavior

The first N requests will be sent to one container and will sequentially be worked on. The Nth request, therefore, has at minimum the latency of all the latencies of the previous requests combined. This causes unnecessary latency for the first N requests which can be huge if the requests themselves each take a long time.

Instead, the excessive requests should be routed to the newly created Pods (which are scaled just fine as of today)

Steps to Reproduce the Problem

  1. Run an application in SingleConcurrency mode
  2. Hit the route with multiple concurrent requests
  3. Observe the latency of those requests

(The testcase I described in #1126 (comment) will make this issue apparent in its output)

Additional Info

The problem above is due to the fact that requests are getting routed to one pod that's already floating around (or getting created via 0 -> 1 scaling). Concurrency is measured there and only then are new pods created through autoscaling.

@google-prow-robot google-prow-robot added area/autoscale area/networking kind/bug Categorizes issue or PR as related to a bug. labels Jun 29, 2018
@markusthoemmes
Copy link
Contributor Author

markusthoemmes commented Jun 29, 2018

Some context from a Slack discussion regarding the topic:

mdemirhan [14 days ago]
Those numbers apply to each instance (pod) of the destination service and take effect on the client side, so e.g. httpMaxPendingRequests is the maximum outstanding requests from any individual client to one of the backends in that service per connection (each TCP connection is a separate circuit breaker, IIRC). However, as I understand your use case, the Istio ingress is the only "client" calling functions, so a circuit breaker restricting to one request per connection should mean one request per function pod, at the cost of establishing many connections (+connection establishment time as new backends spin up, and lots of Envoy config churn with super-ephemeral endpoints).

In general Istio cannot enforce "one request per backend" because that requires some notion of central coordination, and the sidecars are not coordinated by design. E.g. weighted traffic splitting is a commutative operation, so as long as each proxy in the mesh uses the same weights, in aggregate all traffic follows those weights. The same is not true of # of requests to each backend. It might be possible to implement this style behavior using Mixer's rate limiting/quota implementation (so Mixer can centrally coordinate the calls), but the cost would be increased latency (and quota as implemented today is only best effort, so its very possible to get more calls than intended).

josephburnett [8 days ago]
i don't really want to queue requests on the receiving side. but we do need to enforce concurrency limits, especially when the limit is 1. and we don't have load balancing guarantees from istio that they will respect a concurrency limit per service instance

@markusthoemmes
Copy link
Contributor Author

markusthoemmes commented Jun 29, 2018

3 general ideas on how we could go about to fix this issue:

  1. Queue at some more centralized place (like ingress or a “centralized” queue-proxies), which has its own challenges in scalability.
  2. Keep queuing in the queue-proxy and let it know once new resources are available. It could then in theory reverse proxy back to the service itself and go the istio route again, if that makes sense?
  3. Agressive retrying in denying any request that doesn’t fit in the container immediately with a 503 and have that been taken care off by istio retries. Wondering how much overhead that ping-pong might cause though

@markusthoemmes
Copy link
Contributor Author

This is very relevant towards: #1846.

Keeping this open though as it captures the general requirement.

@mattmoor
Copy link
Member

I think that in terms of mechanics, the method I prefer the most would be to have the KPA put the SKS into PROXY mode (as I commented #1846).

I think the pertinent question then becomes: when should the activator be hooked into the request path?

Talking to @hohaichi about what's described here, queuing at the queue-proxy level (what he called "backlog") feels like a failure mode for the autoscaler. We'd still service the requests, but we're outside of what I'd consider the happy path.

My mental model for this involves having a particular "burst capacity" per-Revision (likely configured cluster-wide, maybe with an annotation for finer tuning), which is the amount of additional cushion each Revision should be able to accommodate in a burst.

Suppose we have a burst capacity of 100 concurrent requests and a concurrency target of 100, then I would need roughly 1 pod of surplus capacity before a burst would result in "backlogging" and we're arguably underprovisioned.

When we're underprovisioned for such a burst, we want the activator to proxy traffic so that the throttler can keep us from overloading Revisions in case of a burst. When the autoscaler catches us up, we can disintermediate the data plane and allow traffic to flow directly again. However, when we are subject to bursts of up to our burst capacity we should either:

  1. Have capacity assuming equal distribution across active pods.
  2. Go through the activator, which can buffer in a pod-agnostic fashion.

Hopefully that makes sense (it's how I've been thinking about the problem), and I'd love to talk about this more now that we have a chance of solving it. 🎉

@hohaichi
Copy link
Contributor

I think it is a good idea to switch to PROXY mode as the first step in scaling up. It will help us not only to lower the latency, but also to measure cold requests separately from warm requests.

@vagababov
Copy link
Contributor

@duglin
Copy link

duglin commented Jun 19, 2019

This comment is a follow-on to today's scaling call. Upon thinking about everything that was said, I'm still not sure whether my desired scenario can be supported today (via tweaks to config knobs) or not. So, let me describe my scenario here:

  • as a person sending in requests I understand I may have to deal with cold start times if all existing service instances are at full capacity - and that's ok. So for this discuss, to keep it simple, assume containerConcurrency=1
  • what I'm not ok with is delays due to other requests being processed and this is very true in cases where the requests are not quick. For the purposes of this discussion, assume processing times much larger than cold-starts, such as 10 seconds
  • as @markusthoemmes said on today's call, end users aren't really as concerned about what happens under the covers, they care more about things like observed latency. So, with that in mind, as a sender of requests I expected to see response times of 10 seconds + cold-start time, at most!
  • while there are going to be, what I call, optimizations that can be done to try to reduce how often we spin up a new instance (e.g. twiddle the "idle time" once an instance is done processing a request before we kill it, or "guess" that processing times are consistent and if an instance is 8 seconds into processing a request it might be faster to wait for it finish than to spin up another instance), in the end a lot of those are still just "guesses" and in order to guarantee a response time that isn't dependent on an existing request finishing quickly, as a person setting up my Kn (or KnService) I'd like to be able to tell Kn that I do not want "guesses" - I'm willing to pay for cold-starts if that is needed based on the current state of all existing instances when a new request comes in

If, as what was suggested on today's call, there are certain config knobs I can set to get the above behavior I'd like to open a docs PR to describe how someone can achieve this goal because I don't think we should expect people to piece together the various options "just right" to support this scenario. But in order for me to open a PR I need someone to articulate what knobs and values people need to set.

If this scenario can not be done today then I'd like to ask that this be considered a "stop ship" issue for v1.0.

@markusthoemmes
Copy link
Contributor Author

I‘d like us to approach this from a more abstract standpoint.

I think you‘re saying that, given there is enough capacity in the cluster, I as a user want to be able to see at most cold-start latency + request latency.

Moreover we should also consider thinking in percentiles rather than absolutes. 100%p latency guarantees will be very hard to keep.

Regardless, the direction we‘re taking (thread in the activator and use it as a more centralized buffer) has been ok the agenda for a long while and I think it‘s definitely a step in the right direction.

@markusthoemmes
Copy link
Contributor Author

To add to the above: I think for many users it is not okay to deal with cold-start times, even when considering a certain increase in traffic.

As such, I think we should look at the „latency guarantee“ dial not only in terms of buffering requests but also minimizing cold-start wait times. That‘s where a burst capacity and/or overprovisioning comes into play.

@duglin
Copy link

duglin commented Jun 19, 2019

I think you‘re saying that, given there is enough capacity in the cluster, I as a user want to be able to see at most cold-start latency + request latency.

yep

I think for many users it is not okay to deal with cold-start times, even when considering a certain increase in traffic.

might be true, but that's why I think there should be a config knob that allows someone to decide which behavior they want rather than Kn decide it for everyone. I tend to think of this as a step-wise problem and step one is to support the idea of "a new request comes in; are all instances busy? if so, spin up a new one". That should be the baseline. Then we can work on making the "busy" check smarter and do the optimizations that have been discussed. But, based on today's call I do not think we support this baseline today - but that's what I was asking in my comment above because if we do I want to document it :-)

... also minimizing cold-start wait times.

totally agree. And my question was written with the assumption that our work on reducing cold-start times would continue regardless of what happens with this queuing issue.

@vagababov
Copy link
Contributor

So for "certain" increase in traffic, we already have utilization target, which will deal with % of capacity allocated for bursts in traffic. Unfortunately, the % of capacity is independent of desired burst, but rather a function of current stable traffic.

The new knob permits you to have pod-start+serving time guarantee (% loadbalancing in case of very packed cluster, but I hope we all can agree is not the problem we're solving with current solution).

The https://docs.google.com/spreadsheets/d/1F6t4xTsb6gnSOwsw1nkDNWPg9erKq_uBu9pVzQgH60E/edit#gid=0 allows you to see how these two nodes affect and help you model and decide what values do you want there (i.e. how much more you're willing to pay to get better guarantees of faster traffic being served).

And I want to re-iterate that this is not the final solution to the serverless problem, but a step in that bright direction.

HTH :-)

@vagababov
Copy link
Contributor

/close

@knative-prow-robot
Copy link
Contributor

@vagababov: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@duglin
Copy link

duglin commented Jul 29, 2019

This one is closed, is it because people believe the problem is fixed? I'm still seeing it.

@vagababov
Copy link
Contributor

Did you setup the system with TBC?

@duglin
Copy link

duglin commented Jul 29, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/autoscale area/networking kind/bug Categorizes issue or PR as related to a bug. P1 P1
Projects
None yet
Development

No branches or pull requests

8 participants