Don't double account for requests going through the activator. #3301

markusthoemmes · 2019-02-22T13:03:50Z

In what area(s)?

/area autoscale
/area networking

What version of Knative?

HEAD

Description

Today, we have 2 components informing the autoscaler about the state of the system: activator and queue-proxy (per pod). With activator throttling enabled we can now have the situation where the activator and the queue-proxies simultaneously send valid concurrency data. The autoscaler needs some adjustments in its algorithms to account for that (see #3289) but we also need to make sure that we do not double account for requests.

If a request is proxied by the activator today, both the activator and the receiving queue-proxy will increase their respective concurrency counters to report that.

We discussed 2 ways of fixing that:

Proposal 1: Discount in the activator

We can discount a request in the activator as soon as it hits the proxyRequest method (i.e. is handed off to Golang's ReverseProxy). However, that method has internal retries and we cannot know when the request is really proxied. That can make for blurriness in the autoscaler signal.

Proposal 2: Discount in the queue-proxy

The second proposal was to discount the request in the queue-proxy. The activator would set a header that will cause the queue-proxy to not take that request into consideration when accounting for it's observed concurrency. It's been pointed out that this is a potential attack vector as arbitrary users could set this header as well which would in turn render autoscaling of that revision inaccurate and could potentially choke the application.

The text was updated successfully, but these errors were encountered:

markusthoemmes · 2019-02-22T13:04:09Z

/cc @k4leung4
/cc @mattmoor

vvraskin · 2019-02-25T15:31:10Z

Regarding the attack vector in Proposal 2. As it seems it is possible to sanitise headers on the latest master of Istio - istio/istio#8340 (I haven't tested it myself through). So we could restrict users from using certain system headers by pruning them if they are present. Maybe it is something we could benefit from after migrating to Istio 1.1?

markusthoemmes · 2019-02-25T16:17:55Z

Maybe, we shouldn't wait for Istio 1.1 with this though.

markusthoemmes · 2019-02-27T09:49:27Z

Talked to Kenny and he's going to take this on.

/assign @k4leung4

/milestone "Serving 0.5"

knative-prow-robot · 2019-02-27T09:49:28Z

@markusthoemmes: You must be a member of the knative/knative-milestone-maintainers github team to set the milestone.

In response to this:

Talked to Kenny and he's going to take this on.

/assign @k4leung4

/milestone "Serving 0.5"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mattmoor · 2019-03-04T04:37:01Z

/milestone "Serving 0.5"

knative-prow-robot · 2019-03-04T04:37:02Z

@mattmoor: The provided milestone is not valid for this repository. Milestones in this repository: [Ice Box, Needs Triage, Serving "v1" (ready for production), Serving 0.5]

Use /milestone clear to clear the milestone.

In response to this:

/milestone "Serving 0.5"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k4leung4 · 2019-03-06T18:18:59Z

/milestone Serving 0.5

k4leung4 · 2019-03-11T18:53:22Z

/assign @hohaichi
/assign @yanweiguo

As I won't be able to work on it any more.

knative-prow-robot · 2019-03-11T18:53:24Z

@k4leung4: GitHub didn't allow me to assign the following users: hohaichi.

Note that only knative members and repo collaborators can be assigned.
For more information please see the contributor guide

In response to this:

/assign @hohaichi
/assign @yanweiguo

As I won't be able to work on it any more.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

hohaichi · 2019-03-12T20:13:22Z

/assign

hohaichi · 2019-03-19T18:14:17Z

Proposal 3: Let the autoscaler decide

Let the activator and queue-proxy report as they do now.
Let the autoscaler discount overlapping stats from the activator and the queue-proxy of the same key. In particular, given the same key:
. Suppose Q(t) be the concurrency from queue-proxy at time t, and A(t) be the concurrency from activator at time t.
. Then discount Q(t): Q(t) = max(0, Q(t) - A(t)).

Intuitively, this means autoscaler does not count a pending request in queue-proxy if it is already counted in activator. It is straightforward in the cold start periods, where all requests go through activator, and in the steady periods, where no requests go through activator. The grey area is when the revision is just up and receiving requests directly, yet there are still pending requests in activator. In that case, we may discount a request x that comes directly to queue-proxy by a request y that is pending at activator but not at queue-proxy. When we compute the total concurrency Q(t) + A(t), y is counted (in A(t)) and x is not counted. However, we can turn it around, considering x is counted and y is not counted, as the total is the same. With respect to the revision, this is correct, because x affects the revision, while y does not. What we may lose is only a look-ahead for requests that reached activator and are going to reach the revision.

Taking a closer look, y is pending at activator but not at queue-proxy can mean only two cases: y has not arrived at queue-proxy, and y has finished at queue-proxy.

If y has not arrived at queue-proxy, it will likely arrive at queue-proxy in the next window. With the current math, queue-proxy has 6 data points in the total concurrency, and not counting y in one data point loses only 1/6 count.

If y has finished at queue-proxy, well, not counting y is actually more accurate w.r.t. the load on the revision currently and in the near future.

Compare to proposals 1 and 2:

Proposal 1 can drop requests during cold start, when they are in-flight between activator and queue-proxy. Proposal 3 can drop in-flight requests only when the revision is up and running, and an in-flight request is dropped only to count a request coming directly to the queue-proxy. Not counting requests during cold start is bad, because in-flight requests from activator may be the only requests coming to the revision. Not counting in-flight requests from activator when the revision is already handling traffic directly does not matter much, because no new traffic goes to activator.

Proposal 2 can arguably have the attack mentioned above, proposal 3 doesn't. In addition, proposal 2 introduces a binding between activator and queue-proxy (e.g., queue-proxy needs to understand the header activator uses), while proposal 3 simply makes local calculation.

@mattmoor @markusthoemmes what do you think?

hohaichi · 2019-03-19T21:38:58Z

I talked to Matt. So we want to have queue-proxy report metrics for requests going through activator and for requests not going through activator separately. Then autoscaler can decide to use which metric(s). For this particular issue, autoscaler will pick only the metric for requests not going through activator and get a result similar to proposal 2.

markusthoemmes · 2019-03-20T10:28:18Z

Thanks for the very detailed writeup @hohaichi

A few thoughts:

What we may lose is only a look-ahead for requests that reached activator and are going to reach the revision.

If y has not arrived at queue-proxy, it will likely arrive at queue-proxy in the next window.

These are very important properties I think. I don't think the second sentence is true (depending on what "window" means here). I read it as "metric collection window", which is 1s in our case (as opposed to the aggregation window of 60s/6s respectively). As the activator now throttles requests based on the number of pods available it can actually take quite a while for this queue to drain.

It also does not directly relate to the concurrency that's reported by the queue-proxies. It's entirely possible to have a backlog of 1000 requests on the activator, only 10 of them being actively proxied at the moment. Discounting the whole backlog from the metrics reported by the queue-proxies will skew things heavily I fear.

@mattmoor's idea seems to be exactly the spirit of 2., no? Just a question of which component discounts the metric I guess. I'm trying to be wary of the complexity of our metrics pipeline. I like about the current reporting that it just sums up to a "concurrency in system" metric. Once we start splitting metrics or start double reporting we lose that "feature".

Can you clarify a bit further what the metrics would look like after the change?

hohaichi · 2019-03-20T18:16:30Z

Thanks for the very detailed writeup @hohaichi

A few thoughts:

What we may lose is only a look-ahead for requests that reached activator and are going to reach the revision.

If y has not arrived at queue-proxy, it will likely arrive at queue-proxy in the next window.

These are very important properties I think. I don't think the second sentence is true (depending on what "window" means here). I read it as "metric collection window", which is 1s in our case (as opposed to the aggregation window of 60s/6s respectively). As the activator now throttles requests based on the number of pods available it can actually take quite a while for this queue to drain.

It also does not directly relate to the concurrency that's reported by the queue-proxies. It's entirely possible to have a backlog of 1000 requests on the activator, only 10 of them being actively proxied at the moment. Discounting the whole backlog from the metrics reported by the queue-proxies will skew things heavily I fear.

Thank you for your thoughtful feedback, @markusthoemmes

The backlog case you pointed out is not as bad as you fear, if I understand what you mean correctly. When we add the concurrency from activator and the adjusted concurrency from queue-proxy, the total becomes the max of the two. So there is theoretically a chance that backlog never drains, when the number of requests directly to queue-proxy balance out the backlog. But in practice, we don't have the precision translation from concurrency to compute resources, nor the granularity to provision the compute resources to the exact concurrency. There should always been headroom on the compute resource side (or we have a bigger problem to address), and the headroom is what drains the backlog. By the way, I think there is a chance of backlog never drains with any scaling algorithm based on concurrency, because requests are not the same and scaling based solely on concurrency implicitly assumes that requests are the same.

Anyway, the new behavior should be: the pods are scaled up and the backlog drains quickly until new requests come directly to the revision. Then the backlog drains slowly, and the pods may not scale up to keep up with the drain rate before. After the backlog completely drains, the pods continue to scale up with the traffic.

So the change in behavior is (1) the drain rate can be slower during the scale-up period, and (2) no scale-down after the drain finishes. Is it good or bad? I don't know. I think it depends on the apps.

@mattmoor's idea seems to be exactly the spirit of 2., no? Just a question of which component discounts the metric I guess. I'm trying to be wary of the complexity of our metrics pipeline. I like about the current reporting that it just sums up to a "concurrency in system" metric. Once we start splitting metrics or start double reporting we lose that "feature".

I think the effect is the same as 2, but the approach is very different. Proposal 2 have queue-proxy to report not what it measures. It needs the knowledge from what autoscaler does. It's cleaner to have concurrency consistently means "number of pending requests" everywhere, and have the autoscaling logic contained in the autoscaler. This is part of the reasons I proposed proposal 3, to let the autoscaler decide.

Can you clarify a bit further what the metrics would look like after the change?
I'll draft a PR.

hohaichi · 2019-03-21T00:33:47Z

I'll draft a PR.

I drafted PR#3477. I am still working on tests, but I'd like to get early feedback on the approach, the choices of header and metric, etc.

@markusthoemmes @mattmoor @tcnghia your inputs are appreciated.

markusthoemmes · 2019-03-21T07:38:00Z

@hohaichi I think I see what you mean.

There is a strong correlation between the amount of workload that can be in-flight at a time (essentially defined by the amount of pods) and the drain-rate from the activator (also basically defined by the amount of pods). I can see that not being as bad as I imagined it to be.

Maybe we can a bit more nuanced? We could report 2 numbers from the Activator

Backlog Ab(i)
Requests being proxied (having reached the proxyRequest function)Ac(i)

The queue-proxies would still only report the concurrency (Qc(i)) visible to them.

That'd slightly alter the calculation to overallConcurrency = Ab(i) + Qc(i) - Ac(i). (I think I just described proposal 1 🤔 )

Coming full-circle to proposal 1: We have ditched/are ditching the retries in proxyRequest (hidden in the Transport) which should make the window of unreported requests very short (milliseconds). I think that'd still be more accurate than simply discounting all the activator's reported metrics. As a bonus point we get the Backlog as a dedicated metric, which might be useful for the autoscaler anyway as it can, in theory, treat that Backlog of non-started requests differently than the requests already in flight. Today it simply adds it up to additional concurrency which is the safest way to view it. It might not be the optimal one though.

hohaichi · 2019-03-21T15:59:52Z

@markusthoemmes I think a backlog metric makes a lot of sense. In fact, I think concurrency should be refined to two metrics: backlog, and pending (being processed). Pending gives us an idea of pods' throughput, and we can dynamically adjust target concurrency. Backlog can help to troubleshoot/scale Knative components. I also think that backlog should include not only requests in the activator, but also requests in queue before they are forwarded to the user container. But then it gets us back to not double counting requests in backlog. :)

By the way, do you know why we throttle at activator? Should we let the requests buffered in queue, and queue throttles requests it sends to the user container? Activator is shared and I feel that its not draining as fast as it can could result in one app hurting another.

markusthoemmes · 2019-03-21T16:08:47Z

The reason we buffer in activator is: #1409

hohaichi · 2019-03-21T21:29:05Z

@markusthoemmes Thank you for the reference. Do you know if there's any reason other latency? It seems not a good trade-off, because we sacrifice reliability for latency. Furthermore, it solves the case of 0-N, but does not solve the same problem in 1-N or x-N scaling.

Proposal 2 in #1409 (comment) sounds better to me.

markusthoemmes · 2019-03-22T07:07:29Z

@hohaichi you are correct. There are a few investigations in-flight as to how we can solve this in a more general way (see #3276 for example).

AFAIK it's purely for latency and for preventing 503s as the queue proxies will get overloaded fairly quickly.

markusthoemmes added the kind/bug Categorizes issue or PR as related to a bug. label Feb 22, 2019

knative-prow-robot assigned k4leung4 Feb 27, 2019

mattmoor added area/networking area/autoscale labels Mar 4, 2019

mattmoor added this to the Serving 0.5 milestone Mar 4, 2019

knative-prow-robot assigned yanweiguo Mar 11, 2019

knative-prow-robot assigned hohaichi Mar 12, 2019

hohaichi mentioned this issue Mar 21, 2019

Add a new stat metric in queue to prevent double counting #3477

Merged

mattmoor mentioned this issue Mar 25, 2019

Autoscaler makes *really* bad decision when exiting "panic" mode #3153

Closed

knative-prow-robot closed this as completed in #3477 Mar 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't double account for requests going through the activator. #3301

Don't double account for requests going through the activator. #3301

markusthoemmes commented Feb 22, 2019

markusthoemmes commented Feb 22, 2019

vvraskin commented Feb 25, 2019

markusthoemmes commented Feb 25, 2019

markusthoemmes commented Feb 27, 2019

knative-prow-robot commented Feb 27, 2019

mattmoor commented Mar 4, 2019

knative-prow-robot commented Mar 4, 2019

k4leung4 commented Mar 6, 2019

k4leung4 commented Mar 11, 2019

knative-prow-robot commented Mar 11, 2019

hohaichi commented Mar 12, 2019

hohaichi commented Mar 19, 2019

hohaichi commented Mar 19, 2019

markusthoemmes commented Mar 20, 2019

hohaichi commented Mar 20, 2019

hohaichi commented Mar 21, 2019

markusthoemmes commented Mar 21, 2019

hohaichi commented Mar 21, 2019

markusthoemmes commented Mar 21, 2019

hohaichi commented Mar 21, 2019

markusthoemmes commented Mar 22, 2019

Don't double account for requests going through the activator. #3301

Don't double account for requests going through the activator. #3301

Comments

markusthoemmes commented Feb 22, 2019

In what area(s)?

What version of Knative?

Description

Proposal 1: Discount in the activator

Proposal 2: Discount in the queue-proxy

markusthoemmes commented Feb 22, 2019

vvraskin commented Feb 25, 2019

markusthoemmes commented Feb 25, 2019

markusthoemmes commented Feb 27, 2019

knative-prow-robot commented Feb 27, 2019

mattmoor commented Mar 4, 2019

knative-prow-robot commented Mar 4, 2019

k4leung4 commented Mar 6, 2019

k4leung4 commented Mar 11, 2019

knative-prow-robot commented Mar 11, 2019

hohaichi commented Mar 12, 2019

hohaichi commented Mar 19, 2019

hohaichi commented Mar 19, 2019

markusthoemmes commented Mar 20, 2019

hohaichi commented Mar 20, 2019

hohaichi commented Mar 21, 2019

markusthoemmes commented Mar 21, 2019

hohaichi commented Mar 21, 2019

markusthoemmes commented Mar 21, 2019

hohaichi commented Mar 21, 2019

markusthoemmes commented Mar 22, 2019