-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Requests buffered at the receiving side generate unbounded latency #1409
Comments
Some context from a Slack discussion regarding the topic:
|
3 general ideas on how we could go about to fix this issue:
|
This is very relevant towards: #1846. Keeping this open though as it captures the general requirement. |
I think that in terms of mechanics, the method I prefer the most would be to have the KPA put the SKS into PROXY mode (as I commented #1846). I think the pertinent question then becomes: when should the activator be hooked into the request path? Talking to @hohaichi about what's described here, queuing at the queue-proxy level (what he called "backlog") feels like a failure mode for the autoscaler. We'd still service the requests, but we're outside of what I'd consider the happy path. My mental model for this involves having a particular "burst capacity" per-Revision (likely configured cluster-wide, maybe with an annotation for finer tuning), which is the amount of additional cushion each Revision should be able to accommodate in a burst. Suppose we have a burst capacity of 100 concurrent requests and a concurrency target of 100, then I would need roughly 1 pod of surplus capacity before a burst would result in "backlogging" and we're arguably underprovisioned. When we're underprovisioned for such a burst, we want the activator to proxy traffic so that the throttler can keep us from overloading Revisions in case of a burst. When the autoscaler catches us up, we can disintermediate the data plane and allow traffic to flow directly again. However, when we are subject to bursts of up to our burst capacity we should either:
Hopefully that makes sense (it's how I've been thinking about the problem), and I'd love to talk about this more now that we have a chance of solving it. 🎉 |
I think it is a good idea to switch to PROXY mode as the first step in scaling up. It will help us not only to lower the latency, but also to measure cold requests separately from warm requests. |
This comment is a follow-on to today's scaling call. Upon thinking about everything that was said, I'm still not sure whether my desired scenario can be supported today (via tweaks to config knobs) or not. So, let me describe my scenario here:
If, as what was suggested on today's call, there are certain config knobs I can set to get the above behavior I'd like to open a docs PR to describe how someone can achieve this goal because I don't think we should expect people to piece together the various options "just right" to support this scenario. But in order for me to open a PR I need someone to articulate what knobs and values people need to set. If this scenario can not be done today then I'd like to ask that this be considered a "stop ship" issue for v1.0. |
I‘d like us to approach this from a more abstract standpoint. I think you‘re saying that, given there is enough capacity in the cluster, I as a user want to be able to see at most cold-start latency + request latency. Moreover we should also consider thinking in percentiles rather than absolutes. 100%p latency guarantees will be very hard to keep. Regardless, the direction we‘re taking (thread in the activator and use it as a more centralized buffer) has been ok the agenda for a long while and I think it‘s definitely a step in the right direction. |
To add to the above: I think for many users it is not okay to deal with cold-start times, even when considering a certain increase in traffic. As such, I think we should look at the „latency guarantee“ dial not only in terms of buffering requests but also minimizing cold-start wait times. That‘s where a burst capacity and/or overprovisioning comes into play. |
yep
might be true, but that's why I think there should be a config knob that allows someone to decide which behavior they want rather than Kn decide it for everyone. I tend to think of this as a step-wise problem and step one is to support the idea of "a new request comes in; are all instances busy? if so, spin up a new one". That should be the baseline. Then we can work on making the "busy" check smarter and do the optimizations that have been discussed. But, based on today's call I do not think we support this baseline today - but that's what I was asking in my comment above because if we do I want to document it :-)
totally agree. And my question was written with the assumption that our work on reducing cold-start times would continue regardless of what happens with this queuing issue. |
So for "certain" increase in traffic, we already have utilization target, which will deal with % of capacity allocated for bursts in traffic. Unfortunately, the % of capacity is independent of desired burst, but rather a function of current stable traffic. The new knob permits you to have The https://docs.google.com/spreadsheets/d/1F6t4xTsb6gnSOwsw1nkDNWPg9erKq_uBu9pVzQgH60E/edit#gid=0 allows you to see how these two nodes affect and help you model and decide what values do you want there (i.e. how much more you're willing to pay to get better guarantees of faster traffic being served). And I want to re-iterate that this is not the final solution to the serverless problem, but a step in that bright direction. HTH :-) |
/close |
@vagababov: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This one is closed, is it because people believe the problem is fixed? I'm still seeing it. |
Did you setup the system with TBC? |
Probably not. What is TBC? 😀
Thanks,
-Doug
Sent from my iPhone
… On Jul 29, 2019, at 6:29 PM, Victor Agababov ***@***.***> wrote:
Did you setup the system with TBC?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Expected Behavior
When setting a revision's ConcurrencyModel to be SingleConcurrency but having N clients call the revision at the same time, I expect knative to scale to N and route the requests already in the system to the newly created pods.
Actual Behavior
The first N requests will be sent to one container and will sequentially be worked on. The Nth request, therefore, has at minimum the latency of all the latencies of the previous requests combined. This causes unnecessary latency for the first N requests which can be huge if the requests themselves each take a long time.
Instead, the excessive requests should be routed to the newly created Pods (which are scaled just fine as of today)
Steps to Reproduce the Problem
(The testcase I described in #1126 (comment) will make this issue apparent in its output)
Additional Info
The problem above is due to the fact that requests are getting routed to one pod that's already floating around (or getting created via 0 -> 1 scaling). Concurrency is measured there and only then are new pods created through autoscaling.
The text was updated successfully, but these errors were encountered: