Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scrape queue-proxy metrics in autoscaler #3149

Merged
merged 49 commits into from
Feb 27, 2019

Conversation

yanweiguo
Copy link
Contributor

@yanweiguo yanweiguo commented Feb 9, 2019

Fixes #2203.
Fixes #1927.

Proposed Changes

  • Remove the websocket usage in queue-proxy, which pushes metrics to autoscaler.
  • Create a goruntime for ServiceScraper to scrape metrics from queue-proxy when a UniScaler is created.
  • Use K8S Informer in ServiceScraper to get ready pods count and estimate the average revision concurrency. Store this value in Stat as new field AverageRevConcurrency.
  • Remove PodWeight in autoscaling algorithm. This is proposed in Bucketize autoscaling metrics by timeframe not by pod name. #2977. Use same weight for all sample data.
  • Do average over all data points. See the reason below.
  • Remove observedPods. This information is useless when the autoscaling algorithm is based on sample data. We already remove the hard dependency in Use actual pods from K8S Informer for scaling rate #3055.

Release Note

1. Scrape queue-proxy metrics in autoscaler instead of pushing metrics from queue-proxy to autoscaler via websocket connection. Remove the websocket usage in queue-proxy.

@knative-prow-robot knative-prow-robot removed the lgtm Indicates that a PR is ready to be merged. label Feb 22, 2019
@knative-prow-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mdemirhan, yanweiguo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@yanweiguo
Copy link
Contributor Author

The test TestAutoscaleUpCountPods is flaky. In most of my failure runs, the maximum average observed revision concurrency is between 29-30. It usually took 10 seconds to scale up to 2 pods and another 20 seconds to scale up to 3 pods. At the point when the traffic stops(30 seconds for each scale up target), The expected average observed revision concurrency is 20*10/60 + 30*20/60 + 40*30 = 33.33.

Since we do sampling and due to the reason mentioned in this comment, we could get metrics much lower than pod average level. This results in lower average observed revision concurrency and causes the test failure. So I increase the traffic sending window from 30 seconds to 35 seconds to reduce flakiness.

@yanweiguo
Copy link
Contributor Author

Merged #3289.

Thanks to #3289, the changes to knative/serving/pkg/autoscaler/autoscaler.go in this PR reduce a lot. But I'm sorry to introduce distinguishing activator from customer pods again.

@k4leung4 @markusthoemmes PTAL.

@yanweiguo
Copy link
Contributor Author

/test pull-knative-serving-unit-tests

@yanweiguo
Copy link
Contributor Author

/test pull-knative-serving-integration-tests

TestScaleToN/scale-50 is not relative to this PR.

Copy link
Contributor

@markusthoemmes markusthoemmes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss if we really need the activator alternative path.

pkg/autoscaler/autoscaler.go Outdated Show resolved Hide resolved
@markusthoemmes
Copy link
Contributor

Do we need to think about the backwards compatibility implications here? With the changes proposed by me above we could see double reporting (1. through scraping, 2. through metric sending) before apps are redeployed. Do we need to actively prevent that (for example by only accepting sent metrics from the activator) or do we rely on all applications being redeployed to not send stats anymore?

@greghaynes
Copy link
Contributor

@markusthoemmes We had some discussion about that above (#3149 (comment) if that link works) - we have a mechanism that redeploys user's apps as part of 0.3 so now that we have 0.4 cut a user has an intermediary version where we supported both systems.

@markusthoemmes
Copy link
Contributor

@greghaynes right, this is not necessarily about supporting both though but about having both of them interfere with each other

@knative-prow-robot knative-prow-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 25, 2019
Copy link
Contributor

@markusthoemmes markusthoemmes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're getting there. I like the minimalized changes to the autoscaler itself a lot, thanks for doing that!

I've got a few comments throughout, I'm happy to help you get this in ASAP. 🎉

test/e2e/autoscale_test.go Outdated Show resolved Hide resolved
pkg/autoscaler/autoscaler.go Outdated Show resolved Hide resolved
pkg/autoscaler/autoscaler_test.go Show resolved Hide resolved
pkg/autoscaler/multiscaler_test.go Outdated Show resolved Hide resolved
pkg/autoscaler/multiscaler.go Outdated Show resolved Hide resolved
pkg/autoscaler/stats_scraper.go Show resolved Hide resolved
pkg/autoscaler/stats_scraper.go Outdated Show resolved Hide resolved
pkg/autoscaler/stats_scraper.go Outdated Show resolved Hide resolved
pkg/autoscaler/stats_scraper.go Outdated Show resolved Hide resolved
}, nil
}

// Scrape call the destination service then send it
// to the given stats chanel
func (s *ServiceScraper) Scrape(statsCh chan<- *StatMessage) {
func (s *ServiceScraper) Scrape(ctx context.Context, statsCh chan<- *StatMessage) {
logger := logging.FromContext(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should make Scrape return an error and log in the method calling it to prevent creating loggers (and passing context just to create loggers)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this function as the same format of Scale function. If we want to send no pods log for debugging, we can't return it as error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have logger as a ServiceScraper member, fwiw.

@yanweiguo
Copy link
Contributor Author

/test pull-knative-serving-integration-tests

@knative-metrics-robot
Copy link

The following is the coverage report on pkg/.
Say /test pull-knative-serving-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/autoscaler/autoscaler.go 97.2% 97.0% -0.2
pkg/autoscaler/multiscaler.go 94.6% 94.4% -0.3
pkg/autoscaler/stats_scraper.go 83.7% 89.3% 5.6
pkg/autoscaler/statserver/server.go 79.2% 80.8% 1.5

Copy link
Contributor

@vagababov vagababov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/gltm

}, nil
}

// Scrape call the destination service then send it
// to the given stats chanel
func (s *ServiceScraper) Scrape(statsCh chan<- *StatMessage) {
func (s *ServiceScraper) Scrape(ctx context.Context, statsCh chan<- *StatMessage) {
logger := logging.FromContext(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have logger as a ServiceScraper member, fwiw.

@vagababov
Copy link
Contributor

meh,
/lgtm

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 27, 2019
@knative-prow-robot knative-prow-robot merged commit 0759ef8 into knative:master Feb 27, 2019
@yanweiguo yanweiguo deleted the switch_to_pull branch March 21, 2019 18:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants