Scrape queue-proxy metrics in autoscaler #3149

yanweiguo · 2019-02-09T00:33:16Z

Fixes #2203.
Fixes #1927.

Proposed Changes

Remove the websocket usage in queue-proxy, which pushes metrics to autoscaler.
Create a goruntime for ServiceScraper to scrape metrics from queue-proxy when a UniScaler is created.
Use K8S Informer in ServiceScraper to get ready pods count and estimate the average revision concurrency. Store this value in Stat as new field AverageRevConcurrency.
Remove PodWeight in autoscaling algorithm. This is proposed in Bucketize autoscaling metrics by timeframe not by pod name. #2977. Use same weight for all sample data.
Do average over all data points. See the reason below.
Remove observedPods. This information is useless when the autoscaling algorithm is based on sample data. We already remove the hard dependency in Use actual pods from K8S Informer for scaling rate #3055.

Release Note

1. Scrape queue-proxy metrics in autoscaler instead of pushing metrics from queue-proxy to autoscaler via websocket connection. Remove the websocket usage in queue-proxy.

knative-prow-robot · 2019-02-22T01:45:03Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mdemirhan, yanweiguo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cmd/autoscaler/OWNERS~~ [mdemirhan]
~~cmd/queue/OWNERS~~ [mdemirhan]
~~pkg/autoscaler/OWNERS~~ [mdemirhan]
~~test/OWNERS~~ [mdemirhan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

yanweiguo · 2019-02-22T01:57:02Z

The test TestAutoscaleUpCountPods is flaky. In most of my failure runs, the maximum average observed revision concurrency is between 29-30. It usually took 10 seconds to scale up to 2 pods and another 20 seconds to scale up to 3 pods. At the point when the traffic stops(30 seconds for each scale up target), The expected average observed revision concurrency is 20*10/60 + 30*20/60 + 40*30 = 33.33.

Since we do sampling and due to the reason mentioned in this comment, we could get metrics much lower than pod average level. This results in lower average observed revision concurrency and causes the test failure. So I increase the traffic sending window from 30 seconds to 35 seconds to reduce flakiness.

yanweiguo · 2019-02-23T00:06:04Z

Merged #3289.

Thanks to #3289, the changes to knative/serving/pkg/autoscaler/autoscaler.go in this PR reduce a lot. But I'm sorry to introduce distinguishing activator from customer pods again.

@k4leung4 @markusthoemmes PTAL.

yanweiguo · 2019-02-23T00:06:28Z

/test pull-knative-serving-unit-tests

yanweiguo · 2019-02-23T01:22:58Z

/test pull-knative-serving-integration-tests

TestScaleToN/scale-50 is not relative to this PR.

markusthoemmes

Let's discuss if we really need the activator alternative path.

pkg/autoscaler/autoscaler.go

markusthoemmes · 2019-02-25T07:37:27Z

Do we need to think about the backwards compatibility implications here? With the changes proposed by me above we could see double reporting (1. through scraping, 2. through metric sending) before apps are redeployed. Do we need to actively prevent that (for example by only accepting sent metrics from the activator) or do we rely on all applications being redeployed to not send stats anymore?

greghaynes · 2019-02-25T16:58:02Z

@markusthoemmes We had some discussion about that above (#3149 (comment) if that link works) - we have a mechanism that redeploys user's apps as part of 0.3 so now that we have 0.4 cut a user has an intermediary version where we supported both systems.

markusthoemmes · 2019-02-25T17:24:38Z

@greghaynes right, this is not necessarily about supporting both though but about having both of them interfere with each other

markusthoemmes

We're getting there. I like the minimalized changes to the autoscaler itself a lot, thanks for doing that!

I've got a few comments throughout, I'm happy to help you get this in ASAP. 🎉

test/e2e/autoscale_test.go

pkg/autoscaler/autoscaler.go

pkg/autoscaler/autoscaler_test.go

pkg/autoscaler/multiscaler_test.go

pkg/autoscaler/multiscaler.go

pkg/autoscaler/stats_scraper.go

markusthoemmes · 2019-02-26T12:36:01Z

pkg/autoscaler/stats_scraper.go

 	}, nil
 }

 // Scrape call the destination service then send it
 // to the given stats chanel
-func (s *ServiceScraper) Scrape(statsCh chan<- *StatMessage) {
+func (s *ServiceScraper) Scrape(ctx context.Context, statsCh chan<- *StatMessage) {
+	logger := logging.FromContext(ctx)


I wonder if we should make Scrape return an error and log in the method calling it to prevent creating loggers (and passing context just to create loggers)

I made this function as the same format of Scale function. If we want to send no pods log for debugging, we can't return it as error.

We can have logger as a ServiceScraper member, fwiw.

yanweiguo · 2019-02-26T19:59:49Z

/test pull-knative-serving-integration-tests

knative-metrics-robot · 2019-02-27T18:40:32Z

The following is the coverage report on pkg/.
Say /test pull-knative-serving-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/autoscaler/autoscaler.go	97.2%	97.0%	-0.2
pkg/autoscaler/multiscaler.go	94.6%	94.4%	-0.3
pkg/autoscaler/stats_scraper.go	83.7%	89.3%	5.6
pkg/autoscaler/statserver/server.go	79.2%	80.8%	1.5

vagababov

/gltm

vagababov · 2019-02-27T18:53:46Z

pkg/autoscaler/stats_scraper.go

 	}, nil
 }

 // Scrape call the destination service then send it
 // to the given stats chanel
-func (s *ServiceScraper) Scrape(statsCh chan<- *StatMessage) {
+func (s *ServiceScraper) Scrape(ctx context.Context, statsCh chan<- *StatMessage) {
+	logger := logging.FromContext(ctx)


We can have logger as a ServiceScraper member, fwiw.

vagababov · 2019-02-27T18:54:31Z

meh,
/lgtm

yanweiguo added 30 commits January 28, 2019 14:17

refactor where we use rate

a82ed49

extract concurrencyPerPod

2896a0d

use approximateZero

ad11d78

use pods count from Informer

de81555

get pods when record

9895328

merge master

28ee046

use 1 as min actual pod

57527b3

add unit test

23c132b

add unit tests for cmd

10f2552

lint

29d9b52

address comments

43844dd

remove unuse func

79842cf

short locked scope

e7672ae

address comment

efa8787

wrap func into func

63e678d

address comments

c50e2cd

add dot

f1fc3aa

change algorithm

9149930

Merge branch 'master' into use_actual_pods

6d2ca83

revert handle file

efd64dd

do not send websocket metrics

a094cf8

exclude test coverage check for main files

8302310

remove unuse func

c074299

remove websocket stuff in queue

359a1af

bug fixed

df2f936

clean cache in tests

8341498

removed cache

090f573

merge

9e3cdef

use informer in scraper

6550436

add unit tests

71fc715

knative-prow-robot removed the lgtm Indicates that a PR is ready to be merged. label Feb 22, 2019

yanweiguo added 3 commits February 22, 2019 15:47

merge master

9a4c449

remove a log

7dce271

add some comment

be9ec33

change a test to cover more

2e08244

markusthoemmes reviewed Feb 23, 2019

View reviewed changes

pkg/autoscaler/autoscaler.go Outdated Show resolved Hide resolved

hide pods behind scraper

e12b80a

knative-prow-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 25, 2019

markusthoemmes reviewed Feb 26, 2019

View reviewed changes

yanweiguo added 3 commits February 26, 2019 10:43

address comment

3fe1c33

fix a log

1b48541

merged master

3aa2244

yanweiguo added 2 commits February 27, 2019 10:18

drop none activator stats

36a41a4

fix the test

71fcbcd

vagababov reviewed Feb 27, 2019

View reviewed changes

knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 27, 2019

knative-prow-robot merged commit 0759ef8 into knative:master Feb 27, 2019

yanweiguo deleted the switch_to_pull branch March 21, 2019 18:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scrape queue-proxy metrics in autoscaler #3149

Scrape queue-proxy metrics in autoscaler #3149

yanweiguo commented Feb 9, 2019 •

edited

Loading

knative-prow-robot commented Feb 22, 2019

yanweiguo commented Feb 22, 2019

yanweiguo commented Feb 23, 2019

yanweiguo commented Feb 23, 2019

yanweiguo commented Feb 23, 2019

markusthoemmes left a comment

markusthoemmes commented Feb 25, 2019

greghaynes commented Feb 25, 2019

markusthoemmes commented Feb 25, 2019

markusthoemmes left a comment

markusthoemmes Feb 26, 2019

yanweiguo Feb 26, 2019

vagababov Feb 27, 2019

yanweiguo commented Feb 26, 2019

knative-metrics-robot commented Feb 27, 2019

vagababov left a comment

vagababov Feb 27, 2019

vagababov commented Feb 27, 2019

Scrape queue-proxy metrics in autoscaler #3149

Scrape queue-proxy metrics in autoscaler #3149

Conversation

yanweiguo commented Feb 9, 2019 • edited Loading

Proposed Changes

knative-prow-robot commented Feb 22, 2019

yanweiguo commented Feb 22, 2019

yanweiguo commented Feb 23, 2019

yanweiguo commented Feb 23, 2019

yanweiguo commented Feb 23, 2019

markusthoemmes left a comment

Choose a reason for hiding this comment

markusthoemmes commented Feb 25, 2019

greghaynes commented Feb 25, 2019

markusthoemmes commented Feb 25, 2019

markusthoemmes left a comment

Choose a reason for hiding this comment

markusthoemmes Feb 26, 2019

Choose a reason for hiding this comment

yanweiguo Feb 26, 2019

Choose a reason for hiding this comment

vagababov Feb 27, 2019

Choose a reason for hiding this comment

yanweiguo commented Feb 26, 2019

knative-metrics-robot commented Feb 27, 2019

vagababov left a comment

Choose a reason for hiding this comment

vagababov Feb 27, 2019

Choose a reason for hiding this comment

vagababov commented Feb 27, 2019

yanweiguo commented Feb 9, 2019 •

edited

Loading