Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEP-0036: Start measuring Pipelines performance #277

Merged
merged 1 commit into from
Jan 6, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
104 changes: 104 additions & 0 deletions teps/0036-start-measuring-tekton-pipelines-performance.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
---
status: proposed
title: Start Measuring Tekton Pipelines Performance
creation-date: '2020-11-20'
last-updated: '2020-11-20'
authors:
- '@bobcatfish'
---

# TEP-0036: Start Measuring Tekton Pipelines Performance

<!-- toc -->
- [Summary](#summary)
- [Motivation](#motivation)
- [Goals](#goals)
- [Non-Goals](#non-goals)
- [Use Cases (optional)](#use-cases-optional)
- [Requirements](#requirements)
- [References (optional)](#references-optional)
<!-- /toc -->

## Summary

Up until this point, we have left it to our users to report back to us on how well Tekton Pipelines performs. Relying on
our users in this way means that we can’t easily inform users about what performance to expect and also means that we
usually only catch and fix performance issues after releases.

This proposal is all about finding an incremental step forward we can take towards being more proactive in responding to
performance issues, and will also enable us to be able to evaluate the performance impact of implementation decisions
going forward.

Issue: [tektoncd/pipeline#540](https://github.com/tektoncd/pipeline/issues/540)

## Motivation

* To be able to understand and communicate about Tekton performance on an ongoing basis
* Be able to answer questions like:
* How much overhead does a Task add to the execution of a container image in a pod?
* How much overhead does a Pipeline add to the execution of a Task?
* How much overhead does using a Tekton Bundle vs referencing a Task in my cluster add?
* If we switch out X component for Y component in the controller implementation, how does this impact performance?
* How much better or worse does release X perform than release Y?
* What characteristics should I look for when choosing what kind of machine to run the Tekton Pipelines controller on?
* Being able to get ahead of issues such as [tektoncd/pipeline#3521](https://github.com/tektoncd/pipeline/issues/3521)

### Goals

Identify some (relatively) small step we can take towards starting to gather this kind of information, so we can build
from there.

* Identify Service Level Indicators for Tekton Pipelines (as few as possible to start with, maybe just one)
* For each Service Level Indicator define a target range (SLO) for some known setup
* Setup infrastructure required such that:
* Contributors and users can find the data they need on Tekton Performance
* Performance is measured regularly at some interval (e.g. daily, weekly, per release)
* Reliable and actionable alerting is setup (if feasible) to notify maintainers when SLOs are violated

Reference: [Definitions of SLIs and SLOs](https://landing.google.com/sre/sre-book/chapters/service-level-objectives/).
These are worded in terms of running observable services. Since Tekton Pipelines is providing a service that can be run
by others (vs hosting a "Tekton Pipelines" instance we expect users to use) our situation is a bit different, but I
think the same SLI and SLO concepts can be applied.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the concepts can be applied but I wonder if we should just call them Performance Indicators and Performance Objectives? Reason being that SLI + SLO implies (to my ear) SLA. And SLA is kinda more of a contractual thing that sometimes get used by corporations to bash each other over the head or hold back payments or get money back or w/e.

What we're offering here isn't really contractual I don't think? So I wonder if just naming them something slightly different (while 100% still linking to the SRE book as reference material for meaning) will null some of that impression?

Conversely, let's say we do stick with the SLI/SLO language. What's the SLA? Or, in other words, what are the outcomes for the Tekton Pipelines project when a performance regression is identified? Do we hold back releases? Drop all other work to focus on it? Have a dedicated build captain role specific to performance? Perhaps this also needs to be a Goal / Requirement that we have to hash out in WGs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup I agree, I like Performance Indicators and Objectives better, great suggestion @sbwsg.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm tempted to stick with SLO and SLI b/c they are well known terms, and switching to performance indicators and objectives feels a bit like we're trying to invent something new when we don't have to.

@sbwsg I think you're totally right that SLO and SLI are often used in the same context as SLA but really I don't think they imply SLA

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, I am open going with SLO and SLI, we can always change it if needed.


### Non-Goals

* Avoid trying to boil the performance and loadtesting ocean all at once

These are all goals we likely want to tackle in subsequent TEPs so the groundwork we lay here shouldn’t preclude any of
these:
* [Benchmarking](https://dave.cheney.net/2013/06/30/how-to-write-benchmarks-in-go)
* Load testing (unless we define our initial SLOs to include it?), i.e. how does the system perform under X load
* Stress testing, i.e. where are the limits of the system’s performance
* Soak testing, i.e. continuous load over a long period of time
* Chaos testing, i.e. how does the system perform in the presence of errors
* Other Tekton projects (e.g. Triggers, CLI, Dashboard, etc.)

### Use Cases (optional)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: As a maintainer of Tekton Pipelines I can identify through some documented process (or dashboard or tool) where in the commit history a performance regression was introduced.

It sounds to me like (at least initially?) we don't plan to perform this performance measurement against every PR. It'd be nice if we built this stuff out with an eye to working backwards such that on a bleary-eyed pre-coffee Monday morning I can start my build captain rotation by following the runbook (or something) to figure out where a performance regression started rearing its head.


* As a maintainer of Tekton Pipelines I can identify through some documented process (or dashboard or tool) where in the
commit history a performance regression was introduced
* As a user of Tekton Pipelines (possibly named @skaegi) supporting users creating Pipelines with 50+ tasks with many
uses of params and results ([tektoncd/pipeline#3251](https://github.com/tektoncd/pipeline/issues/3521)) I can
confidently upgrade without worrying about a performance degradation
* As any user for Tekton Pipelines, I can upgrade without being afraid of performance regressions
* As a maintainer of Tekton Pipelines, I can swap out one library for another in the controller code (e.g. different
serialization library or upgrade to knative/pkg) and understand how this impacts performance
* As a maintainer of Tekton Pipelines, I do not have to be nervous that our users will be exposed to serious performance
regressions
* As a possible user evaluating Tekton Pipelines, I can understand what performance to expect and choose the machines to
run it on accordingly

## Requirements

* Start with a bare minimum set of SLIs/SLOs and iterate from there
* Access to the new infrastructure should be given to all build captains
* All members of the community should be able to view metrics via public dashboards
* Metrics should be aggregated carefully with a preference toward distributions and percentiles vs averages (see
[the Aggregation section in this chapter on SLOs](https://landing.google.com/sre/sre-book/chapters/service-level-objectives/))
* It should be clear who is expected to be observing and acting on these results (e.g. as part of the build captain
rotation? As an implementer of a feature?)

## References (optional)

* Original Issue: [tektoncd/pipeline#540](https://github.com/tektoncd/pipeline/issues/540)
* Recent performance issue: ([tektoncd/pipeline#3251](https://github.com/tektoncd/pipeline/issues/3521))
1 change: 1 addition & 0 deletions teps/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -146,3 +146,4 @@ This is the complete list of Tekton teps:
|[TEP-0031](0031-tekton-bundles-cli.md) | tekton-bundles-cli | proposed | 2020-11-18 |
|[TEP-0032](0032-tekton-notifications.md) | Tekton Notifications | proposed | 2020-11-18 |
|[TEP-0035](0035-document-tekton-position-around-policy-authentication-authorization.md) | document-tekton-position-around-policy-authentication-authorization | implementable | 2020-12-09 |
|[TEP-0036](0036-start-measuring-tekton-pipelines-performance.md) | Start Measuring Tekton Pipelines Performance | proposed | 2020-11-20 |