Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEP-0036: Start measuring Pipelines performance #277

Merged
merged 1 commit into from
Jan 6, 2021

Conversation

bobcatfish
Copy link
Contributor

This PR starts a TEP to begin to measure tekton pipelines performance
and address tektoncd/pipeline#540

This first iteration just tries to describe the problem vs suggesting
the solution.

It DOES recommend measuring SLOs and SLIs as a goal, which is kind of
part of the solution, so if we think it's useful we could step back even
further, but I think this is a reasonable path forward, curious what
other folks think!

@tekton-robot tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 20, 2020
@bobcatfish bobcatfish changed the title Add a TEP to start measuring Pipelines performance TEP-0036: Start measuring Pipelines performance Nov 20, 2020
@vdemeester vdemeester added the kind/tep Categorizes issue or PR as related to a TEP (or needs a TEP). label Nov 23, 2020
@tekton-robot tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 1, 2020
@tekton-robot tekton-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 1, 2020
Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 Some questions but otherwise looks good to me!


* To be able to understand and communicate about Tekton performance on an ongoing basis
* Be able to answer questions like:
* How much overhead does a Task add?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read this and immediately wondered, "add to what?" Is that intentionally left out because it could be... anything? Or would it be possible to specify this a tad more to something like "How much overhead does a Task add to Pod performance?"?

I think the sticking point for me here is probably just the word "overhead" which implies a baseline over and above which the Task adds "something". What are those baselines? Are we saying that identifying those baselines is itself a Motivation that's worth calling out? (Reading ahead it sounds like it?)

Similar question for "How much does a Pipeline add?" and "How much does a Bundle add?" below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"How much overhead does a Task add to Pod performance?"

I think pod performance depends on the cloud provider or on the environment that its getting created in. I believe whatever benchmarking we put in place will be against our own CI system to begin with, please correct me if I am wrong. I think what we are trying to evaluate here is "How much overhead these changes or this release add or improve to creating a taskrun" which includes setting up resources or workspaces or volumes etc before actually a pod is created.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point @sbwsg ! I've added a bit more detail, still really vague tho so lemme know if you want me to be more precise.

@pritidesai it's true the the performance depends on the cloud provider - specifically what we're interested in here is the relative performance, and @sbwsg pointed out that half the statement was missing XD

The idea is: If i run some specific image in a pod, it takes X length of time to execute (probably on average but maybe we'll be more interested in some other way of looking at the metric over time), how much overhead does it add if I execute that same image / pod via a Task? And what if I put that Task in a Pipeline? You're right that the overhead might be different in different environments, it could be interesting to compare.

* How much overhead does using a Tekton Bundle vs referencing a Task in my cluster add?
* If we switch out X component for Y component in the controller implementation, how does this impact performance?
* How much better or worse does release X perform than release Y?
* What characteristics should I look for when choosing what kind of machine to tun the Tekton Pipelines controller on?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: tun -> run


### Goals

The goal of this proposal is to avoid trying to boil the performance and loadtesting ocean all at once, and to instead
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-goal: boiling the performance and loadtesting ocean

😄

* Chaos testing, i.e. how does the system perform in the presence of errors
* Other Tekton projects (e.g. Triggers, CLI, Dashboard, etc.)

### Use Cases (optional)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: As a maintainer of Tekton Pipelines I can identify through some documented process (or dashboard or tool) where in the commit history a performance regression was introduced.

It sounds to me like (at least initially?) we don't plan to perform this performance measurement against every PR. It'd be nice if we built this stuff out with an eye to working backwards such that on a bleary-eyed pre-coffee Monday morning I can start my build captain rotation by following the runbook (or something) to figure out where a performance regression started rearing its head.

## Requirements

* Start with a bare minimum set of SLIs/SLOs and iterate from there
* Access to the new infrastructure should be given to all “build cops”
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an assumption here that this role will also now be reviewing and acting on the perf information being collected? Or is this purely about managing the machines the performance measurements are running on?

If the role is also doing something tangible with the info, suggest maybe putting that in a Goal / Requirement too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm gonna cheat a bit and add a requirement about deciding who is responsible for acting on this info and what they're expected to do 😇

Reference: [Definitions of SLIs and SLOs](https://landing.google.com/sre/sre-book/chapters/service-level-objectives/).
These are worded in terms of running observable services. Since Tekton Pipelines is providing a service that can be run
by others (vs hosting a "Tekton Pipelines" instance we expect users to use) our situation is a bit different, but I
think the same SLI and SLO concepts can be applied.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the concepts can be applied but I wonder if we should just call them Performance Indicators and Performance Objectives? Reason being that SLI + SLO implies (to my ear) SLA. And SLA is kinda more of a contractual thing that sometimes get used by corporations to bash each other over the head or hold back payments or get money back or w/e.

What we're offering here isn't really contractual I don't think? So I wonder if just naming them something slightly different (while 100% still linking to the SRE book as reference material for meaning) will null some of that impression?

Conversely, let's say we do stick with the SLI/SLO language. What's the SLA? Or, in other words, what are the outcomes for the Tekton Pipelines project when a performance regression is identified? Do we hold back releases? Drop all other work to focus on it? Have a dedicated build captain role specific to performance? Perhaps this also needs to be a Goal / Requirement that we have to hash out in WGs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup I agree, I like Performance Indicators and Objectives better, great suggestion @sbwsg.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm tempted to stick with SLO and SLI b/c they are well known terms, and switching to performance indicators and objectives feels a bit like we're trying to invent something new when we don't have to.

@sbwsg I think you're totally right that SLO and SLI are often used in the same context as SLA but really I don't think they imply SLA

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, I am open going with SLO and SLI, we can always change it if needed.

@tekton-robot tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 18, 2020
This PR starts a TEP to begin to measure tekton pipelines performance
and address tektoncd/pipeline#540

This first iteration just tries to describe the problem vs suggesting
the solution.

It DOES recommend measuring SLOs and SLIs as a goal, which is kind of
part of the solution, so if we think it's useful we could step back even
further, but I think this is a reasonable path forward, curious what
other folks think!
@tekton-robot tekton-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 4, 2021
@bobcatfish
Copy link
Contributor Author

I think this is ready for another look @sbwsg and @pritidesai !

Also quick update on my plan going forward: I'd like to merge the problem statement, but after that I'm planning to put this on the back burner and would be happy for anyone else to take this over if they are interested.

@tekton-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sbwsg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 5, 2021
@pritidesai
Copy link
Member

/lgtm

This is a great start.

Also quick update on my plan going forward: I'd like to merge the problem statement, but after that I'm planning to put this on the back burner and would be happy for anyone else to take this over if they are interested.

@psschwei and I are interested in giving it a go if that's fine. Our goal would be to start small with just one SLI and measuring it with the Pipeline.

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 6, 2021
@tekton-robot tekton-robot merged commit 95d74ad into tektoncd:master Jan 6, 2021
@bobcatfish
Copy link
Contributor Author

That sounds great @pritidesai and @psschwei !!! 🙏 🎉 Thank you :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/tep Categorizes issue or PR as related to a TEP (or needs a TEP). lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants