Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't double account for requests going through the activator. #3301

Closed
markusthoemmes opened this issue Feb 22, 2019 · 21 comments · Fixed by #3477
Closed

Don't double account for requests going through the activator. #3301

markusthoemmes opened this issue Feb 22, 2019 · 21 comments · Fixed by #3477
Assignees
Labels
area/autoscale area/networking kind/bug Categorizes issue or PR as related to a bug.
Milestone

Comments

@markusthoemmes
Copy link
Contributor

In what area(s)?

/area autoscale
/area networking

What version of Knative?

HEAD

Description

Today, we have 2 components informing the autoscaler about the state of the system: activator and queue-proxy (per pod). With activator throttling enabled we can now have the situation where the activator and the queue-proxies simultaneously send valid concurrency data. The autoscaler needs some adjustments in its algorithms to account for that (see #3289) but we also need to make sure that we do not double account for requests.

If a request is proxied by the activator today, both the activator and the receiving queue-proxy will increase their respective concurrency counters to report that.

We discussed 2 ways of fixing that:

Proposal 1: Discount in the activator

We can discount a request in the activator as soon as it hits the proxyRequest method (i.e. is handed off to Golang's ReverseProxy). However, that method has internal retries and we cannot know when the request is really proxied. That can make for blurriness in the autoscaler signal.

Proposal 2: Discount in the queue-proxy

The second proposal was to discount the request in the queue-proxy. The activator would set a header that will cause the queue-proxy to not take that request into consideration when accounting for it's observed concurrency. It's been pointed out that this is a potential attack vector as arbitrary users could set this header as well which would in turn render autoscaling of that revision inaccurate and could potentially choke the application.

@markusthoemmes markusthoemmes added the kind/bug Categorizes issue or PR as related to a bug. label Feb 22, 2019
@markusthoemmes
Copy link
Contributor Author

/cc @k4leung4
/cc @mattmoor

@vvraskin
Copy link
Contributor

Regarding the attack vector in Proposal 2. As it seems it is possible to sanitise headers on the latest master of Istio - istio/istio#8340 (I haven't tested it myself through). So we could restrict users from using certain system headers by pruning them if they are present. Maybe it is something we could benefit from after migrating to Istio 1.1?

@markusthoemmes
Copy link
Contributor Author

Maybe, we shouldn't wait for Istio 1.1 with this though.

@markusthoemmes
Copy link
Contributor Author

Talked to Kenny and he's going to take this on.

/assign @k4leung4

/milestone "Serving 0.5"

@knative-prow-robot
Copy link
Contributor

@markusthoemmes: You must be a member of the knative/knative-milestone-maintainers github team to set the milestone.

In response to this:

Talked to Kenny and he's going to take this on.

/assign @k4leung4

/milestone "Serving 0.5"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mattmoor
Copy link
Member

mattmoor commented Mar 4, 2019

/milestone "Serving 0.5"

@knative-prow-robot
Copy link
Contributor

@mattmoor: The provided milestone is not valid for this repository. Milestones in this repository: [Ice Box, Needs Triage, Serving "v1" (ready for production), Serving 0.5]

Use /milestone clear to clear the milestone.

In response to this:

/milestone "Serving 0.5"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mattmoor mattmoor added this to the Serving 0.5 milestone Mar 4, 2019
@k4leung4
Copy link
Contributor

k4leung4 commented Mar 6, 2019

/milestone Serving 0.5

@k4leung4
Copy link
Contributor

/assign @hohaichi
/assign @yanweiguo

As I won't be able to work on it any more.

@knative-prow-robot
Copy link
Contributor

@k4leung4: GitHub didn't allow me to assign the following users: hohaichi.

Note that only knative members and repo collaborators can be assigned.
For more information please see the contributor guide

In response to this:

/assign @hohaichi
/assign @yanweiguo

As I won't be able to work on it any more.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@hohaichi
Copy link
Contributor

/assign

@hohaichi
Copy link
Contributor

Proposal 3: Let the autoscaler decide

  • Let the activator and queue-proxy report as they do now.
  • Let the autoscaler discount overlapping stats from the activator and the queue-proxy of the same key. In particular, given the same key:
    . Suppose Q(t) be the concurrency from queue-proxy at time t, and A(t) be the concurrency from activator at time t.
    . Then discount Q(t): Q(t) = max(0, Q(t) - A(t)).

Intuitively, this means autoscaler does not count a pending request in queue-proxy if it is already counted in activator. It is straightforward in the cold start periods, where all requests go through activator, and in the steady periods, where no requests go through activator. The grey area is when the revision is just up and receiving requests directly, yet there are still pending requests in activator. In that case, we may discount a request x that comes directly to queue-proxy by a request y that is pending at activator but not at queue-proxy. When we compute the total concurrency Q(t) + A(t), y is counted (in A(t)) and x is not counted. However, we can turn it around, considering x is counted and y is not counted, as the total is the same. With respect to the revision, this is correct, because x affects the revision, while y does not. What we may lose is only a look-ahead for requests that reached activator and are going to reach the revision.

Taking a closer look, y is pending at activator but not at queue-proxy can mean only two cases: y has not arrived at queue-proxy, and y has finished at queue-proxy.

If y has not arrived at queue-proxy, it will likely arrive at queue-proxy in the next window. With the current math, queue-proxy has 6 data points in the total concurrency, and not counting y in one data point loses only 1/6 count.

If y has finished at queue-proxy, well, not counting y is actually more accurate w.r.t. the load on the revision currently and in the near future.

Compare to proposals 1 and 2:

Proposal 1 can drop requests during cold start, when they are in-flight between activator and queue-proxy. Proposal 3 can drop in-flight requests only when the revision is up and running, and an in-flight request is dropped only to count a request coming directly to the queue-proxy. Not counting requests during cold start is bad, because in-flight requests from activator may be the only requests coming to the revision. Not counting in-flight requests from activator when the revision is already handling traffic directly does not matter much, because no new traffic goes to activator.

Proposal 2 can arguably have the attack mentioned above, proposal 3 doesn't. In addition, proposal 2 introduces a binding between activator and queue-proxy (e.g., queue-proxy needs to understand the header activator uses), while proposal 3 simply makes local calculation.

@mattmoor @markusthoemmes what do you think?

@hohaichi
Copy link
Contributor

I talked to Matt. So we want to have queue-proxy report metrics for requests going through activator and for requests not going through activator separately. Then autoscaler can decide to use which metric(s). For this particular issue, autoscaler will pick only the metric for requests not going through activator and get a result similar to proposal 2.

@markusthoemmes
Copy link
Contributor Author

Thanks for the very detailed writeup @hohaichi

A few thoughts:

What we may lose is only a look-ahead for requests that reached activator and are going to reach the revision.

If y has not arrived at queue-proxy, it will likely arrive at queue-proxy in the next window.

These are very important properties I think. I don't think the second sentence is true (depending on what "window" means here). I read it as "metric collection window", which is 1s in our case (as opposed to the aggregation window of 60s/6s respectively). As the activator now throttles requests based on the number of pods available it can actually take quite a while for this queue to drain.

It also does not directly relate to the concurrency that's reported by the queue-proxies. It's entirely possible to have a backlog of 1000 requests on the activator, only 10 of them being actively proxied at the moment. Discounting the whole backlog from the metrics reported by the queue-proxies will skew things heavily I fear.


@mattmoor's idea seems to be exactly the spirit of 2., no? Just a question of which component discounts the metric I guess. I'm trying to be wary of the complexity of our metrics pipeline. I like about the current reporting that it just sums up to a "concurrency in system" metric. Once we start splitting metrics or start double reporting we lose that "feature".

Can you clarify a bit further what the metrics would look like after the change?

@hohaichi
Copy link
Contributor

Thanks for the very detailed writeup @hohaichi

A few thoughts:

What we may lose is only a look-ahead for requests that reached activator and are going to reach the revision.

If y has not arrived at queue-proxy, it will likely arrive at queue-proxy in the next window.

These are very important properties I think. I don't think the second sentence is true (depending on what "window" means here). I read it as "metric collection window", which is 1s in our case (as opposed to the aggregation window of 60s/6s respectively). As the activator now throttles requests based on the number of pods available it can actually take quite a while for this queue to drain.

It also does not directly relate to the concurrency that's reported by the queue-proxies. It's entirely possible to have a backlog of 1000 requests on the activator, only 10 of them being actively proxied at the moment. Discounting the whole backlog from the metrics reported by the queue-proxies will skew things heavily I fear.

Thank you for your thoughtful feedback, @markusthoemmes

The backlog case you pointed out is not as bad as you fear, if I understand what you mean correctly. When we add the concurrency from activator and the adjusted concurrency from queue-proxy, the total becomes the max of the two. So there is theoretically a chance that backlog never drains, when the number of requests directly to queue-proxy balance out the backlog. But in practice, we don't have the precision translation from concurrency to compute resources, nor the granularity to provision the compute resources to the exact concurrency. There should always been headroom on the compute resource side (or we have a bigger problem to address), and the headroom is what drains the backlog. By the way, I think there is a chance of backlog never drains with any scaling algorithm based on concurrency, because requests are not the same and scaling based solely on concurrency implicitly assumes that requests are the same.

Anyway, the new behavior should be: the pods are scaled up and the backlog drains quickly until new requests come directly to the revision. Then the backlog drains slowly, and the pods may not scale up to keep up with the drain rate before. After the backlog completely drains, the pods continue to scale up with the traffic.

So the change in behavior is (1) the drain rate can be slower during the scale-up period, and (2) no scale-down after the drain finishes. Is it good or bad? I don't know. I think it depends on the apps.

@mattmoor's idea seems to be exactly the spirit of 2., no? Just a question of which component discounts the metric I guess. I'm trying to be wary of the complexity of our metrics pipeline. I like about the current reporting that it just sums up to a "concurrency in system" metric. Once we start splitting metrics or start double reporting we lose that "feature".

I think the effect is the same as 2, but the approach is very different. Proposal 2 have queue-proxy to report not what it measures. It needs the knowledge from what autoscaler does. It's cleaner to have concurrency consistently means "number of pending requests" everywhere, and have the autoscaling logic contained in the autoscaler. This is part of the reasons I proposed proposal 3, to let the autoscaler decide.

Can you clarify a bit further what the metrics would look like after the change?
I'll draft a PR.

@hohaichi
Copy link
Contributor

I'll draft a PR.

I drafted PR#3477. I am still working on tests, but I'd like to get early feedback on the approach, the choices of header and metric, etc.

@markusthoemmes @mattmoor @tcnghia your inputs are appreciated.

@markusthoemmes
Copy link
Contributor Author

@hohaichi I think I see what you mean.

There is a strong correlation between the amount of workload that can be in-flight at a time (essentially defined by the amount of pods) and the drain-rate from the activator (also basically defined by the amount of pods). I can see that not being as bad as I imagined it to be.

Maybe we can a bit more nuanced? We could report 2 numbers from the Activator

  1. Backlog Ab(i)
  2. Requests being proxied (having reached the proxyRequest function)Ac(i)

The queue-proxies would still only report the concurrency (Qc(i)) visible to them.

That'd slightly alter the calculation to overallConcurrency = Ab(i) + Qc(i) - Ac(i). (I think I just described proposal 1 🤔 )


Coming full-circle to proposal 1: We have ditched/are ditching the retries in proxyRequest (hidden in the Transport) which should make the window of unreported requests very short (milliseconds). I think that'd still be more accurate than simply discounting all the activator's reported metrics. As a bonus point we get the Backlog as a dedicated metric, which might be useful for the autoscaler anyway as it can, in theory, treat that Backlog of non-started requests differently than the requests already in flight. Today it simply adds it up to additional concurrency which is the safest way to view it. It might not be the optimal one though.

@hohaichi
Copy link
Contributor

@markusthoemmes I think a backlog metric makes a lot of sense. In fact, I think concurrency should be refined to two metrics: backlog, and pending (being processed). Pending gives us an idea of pods' throughput, and we can dynamically adjust target concurrency. Backlog can help to troubleshoot/scale Knative components. I also think that backlog should include not only requests in the activator, but also requests in queue before they are forwarded to the user container. But then it gets us back to not double counting requests in backlog. :)

By the way, do you know why we throttle at activator? Should we let the requests buffered in queue, and queue throttles requests it sends to the user container? Activator is shared and I feel that its not draining as fast as it can could result in one app hurting another.

@markusthoemmes
Copy link
Contributor Author

The reason we buffer in activator is: #1409

@hohaichi
Copy link
Contributor

@markusthoemmes Thank you for the reference. Do you know if there's any reason other latency? It seems not a good trade-off, because we sacrifice reliability for latency. Furthermore, it solves the case of 0-N, but does not solve the same problem in 1-N or x-N scaling.

Proposal 2 in #1409 (comment) sounds better to me.

@markusthoemmes
Copy link
Contributor Author

@hohaichi you are correct. There are a few investigations in-flight as to how we can solve this in a more general way (see #3276 for example).

AFAIK it's purely for latency and for preventing 503s as the queue proxies will get overloaded fairly quickly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/autoscale area/networking kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants