From d1756a7a6ac30c2d99df3a7c5978bde9f725a4b1 Mon Sep 17 00:00:00 2001 From: Christie Wilson Date: Fri, 20 Nov 2020 15:22:23 -0500 Subject: [PATCH] Add a TEP to start measuring Pipelines performance This PR starts a TEP to begin to measure tekton pipelines performance and address https://github.com/tektoncd/pipeline/issues/540 This first iteration just tries to describe the problem vs suggesting the solution. It DOES recommend measuring SLOs and SLIs as a goal, which is kind of part of the solution, so if we think it's useful we could step back even further, but I think this is a reasonable path forward, curious what other folks think! --- ...-measuring-tekton-pipelines-performance.md | 104 ++++++++++++++++++ teps/README.md | 1 + 2 files changed, 105 insertions(+) create mode 100644 teps/0036-start-measuring-tekton-pipelines-performance.md diff --git a/teps/0036-start-measuring-tekton-pipelines-performance.md b/teps/0036-start-measuring-tekton-pipelines-performance.md new file mode 100644 index 000000000..8bac3be74 --- /dev/null +++ b/teps/0036-start-measuring-tekton-pipelines-performance.md @@ -0,0 +1,104 @@ +--- +status: proposed +title: Start Measuring Tekton Pipelines Performance +creation-date: '2020-11-20' +last-updated: '2020-11-20' +authors: +- '@bobcatfish' +--- + +# TEP-0036: Start Measuring Tekton Pipelines Performance + + +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) + - [Use Cases (optional)](#use-cases-optional) +- [Requirements](#requirements) +- [References (optional)](#references-optional) + + +## Summary + +Up until this point, we have left it to our users to report back to us on how well Tekton Pipelines performs. Relying on +our users in this way means that we can’t easily inform users about what performance to expect and also means that we +usually only catch and fix performance issues after releases. + +This proposal is all about finding an incremental step forward we can take towards being more proactive in responding to +performance issues, and will also enable us to be able to evaluate the performance impact of implementation decisions +going forward. + +Issue: [tektoncd/pipeline#540](https://github.com/tektoncd/pipeline/issues/540) + +## Motivation + +* To be able to understand and communicate about Tekton performance on an ongoing basis +* Be able to answer questions like: + * How much overhead does a Task add to the execution of a container image in a pod? + * How much overhead does a Pipeline add to the execution of a Task? + * How much overhead does using a Tekton Bundle vs referencing a Task in my cluster add? + * If we switch out X component for Y component in the controller implementation, how does this impact performance? + * How much better or worse does release X perform than release Y? + * What characteristics should I look for when choosing what kind of machine to run the Tekton Pipelines controller on? + * Being able to get ahead of issues such as [tektoncd/pipeline#3521](https://github.com/tektoncd/pipeline/issues/3521) + +### Goals + +Identify some (relatively) small step we can take towards starting to gather this kind of information, so we can build +from there. + +* Identify Service Level Indicators for Tekton Pipelines (as few as possible to start with, maybe just one) +* For each Service Level Indicator define a target range (SLO) for some known setup +* Setup infrastructure required such that: + * Contributors and users can find the data they need on Tekton Performance + * Performance is measured regularly at some interval (e.g. daily, weekly, per release) + * Reliable and actionable alerting is setup (if feasible) to notify maintainers when SLOs are violated + +Reference: [Definitions of SLIs and SLOs](https://landing.google.com/sre/sre-book/chapters/service-level-objectives/). +These are worded in terms of running observable services. Since Tekton Pipelines is providing a service that can be run +by others (vs hosting a "Tekton Pipelines" instance we expect users to use) our situation is a bit different, but I +think the same SLI and SLO concepts can be applied. + +### Non-Goals + +* Avoid trying to boil the performance and loadtesting ocean all at once + +These are all goals we likely want to tackle in subsequent TEPs so the groundwork we lay here shouldn’t preclude any of +these: +* [Benchmarking](https://dave.cheney.net/2013/06/30/how-to-write-benchmarks-in-go) +* Load testing (unless we define our initial SLOs to include it?), i.e. how does the system perform under X load +* Stress testing, i.e. where are the limits of the system’s performance +* Soak testing, i.e. continuous load over a long period of time +* Chaos testing, i.e. how does the system perform in the presence of errors +* Other Tekton projects (e.g. Triggers, CLI, Dashboard, etc.) + +### Use Cases (optional) + +* As a maintainer of Tekton Pipelines I can identify through some documented process (or dashboard or tool) where in the + commit history a performance regression was introduced +* As a user of Tekton Pipelines (possibly named @skaegi) supporting users creating Pipelines with 50+ tasks with many + uses of params and results ([tektoncd/pipeline#3251](https://github.com/tektoncd/pipeline/issues/3521)) I can + confidently upgrade without worrying about a performance degradation +* As any user for Tekton Pipelines, I can upgrade without being afraid of performance regressions +* As a maintainer of Tekton Pipelines, I can swap out one library for another in the controller code (e.g. different + serialization library or upgrade to knative/pkg) and understand how this impacts performance +* As a maintainer of Tekton Pipelines, I do not have to be nervous that our users will be exposed to serious performance + regressions +* As a possible user evaluating Tekton Pipelines, I can understand what performance to expect and choose the machines to + run it on accordingly + +## Requirements + +* Start with a bare minimum set of SLIs/SLOs and iterate from there +* Access to the new infrastructure should be given to all build captains +* All members of the community should be able to view metrics via public dashboards +* Metrics should be aggregated carefully with a preference toward distributions and percentiles vs averages (see + [the Aggregation section in this chapter on SLOs](https://landing.google.com/sre/sre-book/chapters/service-level-objectives/)) +* It should be clear who is expected to be observing and acting on these results (e.g. as part of the build captain + rotation? As an implementer of a feature?) + +## References (optional) + +* Original Issue: [tektoncd/pipeline#540](https://github.com/tektoncd/pipeline/issues/540) +* Recent performance issue: ([tektoncd/pipeline#3251](https://github.com/tektoncd/pipeline/issues/3521)) diff --git a/teps/README.md b/teps/README.md index 0aaa98357..4459fc206 100644 --- a/teps/README.md +++ b/teps/README.md @@ -145,3 +145,4 @@ This is the complete list of Tekton teps: |[TEP-0030](0030-workspace-paths.md) | workspace-paths | proposed | 2020-10-18 | |[TEP-0031](0031-tekton-bundles-cli.md) | tekton-bundles-cli | proposed | 2020-11-18 | |[TEP-0032](0032-tekton-notifications.md) | Tekton Notifications | proposed | 2020-11-18 | +|[TEP-0036](0036-start-measuring-tekton-pipelines-performance.md) | Start Measuring Tekton Pipelines Performance | proposed | 2020-11-20 |