Perform webhook validation for remote pipelines #6887

lbernick · 2023-06-28T18:51:57Z

Prior to this commit, remote pipelines were only validated by calling pipelineSpec.Validate in the PipelineRun reconciler. This omits some validation that is only done when validating Pipelines, rather than Pipeline specs, such as validation for propagated params and workspaces. In addition, if a cluster operator or vendor defines any validating admission webhooks for Pipelines, this validation would apply only to local Pipelines but not remote Pipelines.

This commit issues a dry-run create request for remote Pipelines and fails the PipelineRun if the apiserver rejects the request. This allows us to do webhook-based validation of remote Pipelines without ever having to create them on the cluster, ensuring validation of remote Pipelines matches validation of local Pipelines.

Similar validation will be added for remote Tasks in a separate commit.

/kind bug
closes #6670

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

n/a Has Docs if any changes are user facing, including updates to minimum requirements e.g. Kubernetes version bumps
Has Tests included if any functionality added or changed
Follows the commit message standard
Meets the Tekton contributor standards (including functionality, content, code)
Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings). See some examples of good release notes.
n/a Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes

bug fix: Remote Pipelines do not support propagated parameters and workspaces

tekton-robot · 2023-06-28T18:51:59Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

tekton-robot · 2023-06-28T19:01:40Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/reconciler/pipelinerun/pipelinerun.go	91.5%	91.2%	-0.3

tekton-robot · 2023-06-28T19:06:06Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/reconciler/pipelinerun/pipelinerun.go	91.5%	91.2%	-0.3

tekton-robot · 2023-06-28T19:32:19Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/reconciler/pipelinerun/pipelinerun.go	91.5%	91.5%	0.0
pkg/reconciler/pipelinerun/resources/pipelineref.go	92.9%	93.5%	0.6

tekton-robot · 2023-06-28T19:33:41Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/reconciler/pipelinerun/pipelinerun.go	91.5%	91.5%	0.0
pkg/reconciler/pipelinerun/resources/pipelineref.go	92.9%	93.5%	0.6

tekton-robot · 2023-06-28T19:36:16Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/reconciler/pipelinerun/pipelinerun.go	91.5%	91.2%	-0.3
pkg/reconciler/pipelinerun/resources/pipelineref.go	92.9%	93.5%	0.6

JeromeJu · 2023-06-28T19:36:38Z

/assign

tekton-robot · 2023-06-28T20:45:35Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/reconciler/pipelinerun/pipelinerun.go	91.5%	91.5%	0.0
pkg/reconciler/pipelinerun/resources/pipelineref.go	92.9%	93.5%	0.6

tekton-robot · 2023-06-28T20:47:18Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/reconciler/pipelinerun/pipelinerun.go	91.5%	91.5%	0.0
pkg/reconciler/pipelinerun/resources/pipelineref.go	92.9%	93.5%	0.6

tekton-robot · 2023-06-29T12:10:33Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/reconciler/pipelinerun/pipelinerun.go	91.5%	91.5%	0.0
pkg/reconciler/pipelinerun/resources/pipelineref.go	92.9%	93.5%	0.6

tekton-robot · 2023-06-29T12:10:49Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/reconciler/pipelinerun/pipelinerun.go	91.5%	91.5%	0.0
pkg/reconciler/pipelinerun/resources/pipelineref.go	92.9%	93.5%	0.6

pkg/reconciler/pipelinerun/pipelinerun.go

pkg/reconciler/pipelinerun/pipelinerun_test.go

chitrangpatel · 2023-07-05T15:33:49Z

/lgtm

pkg/reconciler/pipelinerun/pipelinerun.go

pkg/reconciler/pipelinerun/resources/pipelineref.go

dibyom · 2023-07-06T17:27:03Z

Overall, I like this approach. One concern is that this is adding the webhook to the runtime path of executing the pipeline. So, if the webhook is down (or is being rate limited) we'd have to handle retries using a backoff.

The Kuberentes API server usually has fairly low rate limits (for GKE its 3k per min - does the client already retry if the rate limit is exhausted?

tekton-robot · 2023-07-06T17:47:19Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/reconciler/pipelinerun/pipelinerun.go	91.5%	91.5%	0.0
pkg/reconciler/pipelinerun/resources/pipelineref.go	92.9%	93.5%	0.6

tekton-robot · 2023-07-06T17:51:43Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/reconciler/pipelinerun/pipelinerun.go	91.5%	91.5%	0.0
pkg/reconciler/pipelinerun/resources/pipelineref.go	92.9%	93.5%	0.6

tekton-robot · 2023-07-06T17:52:04Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/reconciler/pipelinerun/pipelinerun.go	91.5%	91.5%	0.0
pkg/reconciler/pipelinerun/resources/pipelineref.go	92.9%	93.5%	0.6

lbernick · 2023-07-06T18:15:44Z

Overall, I like this approach. One concern is that this is adding the webhook to the runtime path of executing the pipeline. So, if the webhook is down (or is being rate limited) we'd have to handle retries using a backoff.
The Kuberentes API server usually has fairly low rate limits (for GKE its 3k per min - does the client already retry if the rate limit is exhausted?

We already make a number of calls to the k8s api server during pipelinerun and taskrun execution; I'd expect us to have already run into rate limiting issues if they exist.

It looks like we set a QPS and burst for our k8s clients here based on config documented here and default to 20 qps. Client-go does seem to have some backoff handling (based on poking around the rest package) but it's not well documented. I don't know of a good way to handle rate limiting for our webhooks separately from other api server requests, but since platforms seem to have global limits for all api server requests I doubt this would be desirable.

I'm not totally sure I understand what you are suggesting (maybe we just need to investigate if we have a reasonable rest config for our clients?) so LMK if I've addressed your concern.

lbernick · 2023-07-06T18:16:24Z

/retest
Known flake: #6772

dibyom · 2023-07-07T16:43:34Z

pkg/reconciler/pipelinerun/resources/pipelineref.go

+		// Issue a dry-run request to create the remote Pipeline, so that it can undergo validation from validating admission webhooks
+		// without actually creating the Pipeline on the cluster
+		if _, err := tekton.TektonV1().Pipelines(namespace).Create(ctx, obj, metav1.CreateOptions{DryRun: []string{metav1.DryRunAll}}); err != nil {
+			if apierrors.IsBadRequest(err) { // Pipeline rejected by validating webhook


What happens if err is not a BadRequest but a different kind of error (say, InternalError because the Webhook server is down) - it seems like we'd pass through an invalid pipeline in that case?

dibyom · 2023-07-07T17:10:01Z

We already make a number of calls to the k8s api server during pipelinerun and taskrun execution; I'd expect us to have already run into rate limiting issues if they exist.

Most of our calls are get/list/watch calls that are cached via informers instead of actually making it to the api server. Off the top of my head, calls that make it to the server are the update calls to update statuses/labels/annotations if something changes and create calls to pods/taskruns/resolution requests etc. Also, calls to create/update statefulSets if affinity assistant is on and to manage PVCs if a workspace is being used.

I think the client side burst limits should help ensure we do not overwhelm the API server. I do think client-go retries on some errors but only for GET calls.

There is a difference between this call and the others. The other calls have are more "async" - if the resolution request CRD is created when the resolver is temporarily down, the processing will continue once the resolver server is back up. However, if the webhook happens to be down when the call is made, the error is permanent unless we add some kind of retry/backoff on our side.

I'm not totally sure I understand what you are suggesting (maybe we just need to investigate if we have a reasonable rest config for our clients?) so LMK if I've addressed your concern.

I think we should investigate what happens if we try making a dry run request when the webhook is down - is that a permanent error requiring the user to retry their run or no. Also, worth investigating if we'd run into API server rate limits - given the client side throttling that we have that seems less likely we'll hit this

lbernick · 2023-07-11T19:07:04Z

@dibyom I tried this out by deleting the webhook immediately after creating the pipelinerun. It seems like if the webhook is down, the controller just won't be able to update the pipelinerun at all (which makes sense) and the pipelinerun gets requeued while stuck in its most recent state.

I'm not sure how to test the webhook going down at the point in time when the Pipeline is verified. My guess is that it would hit this block and reconcile would return a permanent error, but updateLabelsAndAnnotations would fail before returning a permanent error, so this would also be requeued. This guess is supported by the logs I saw in the controller:

{"severity":"warn","timestamp":"2023-07-11T18:44:30.480Z","logger":"tekton-pipelines-controller","caller":"pipelinerun/pipelinerun.go:308","message":"Failed to update PipelineRun labels/annotations{error 26 0  Internal error occurred: failed calling webhook \"webhook.pipeline.tekton.dev\": failed to call webhook: Post \"https://tekton-pipelines-webhook.tekton-pipelines.svc:443/defaulting?timeout=10s\": no endpoints available for service \"tekton-pipelines-webhook\"}","commit":"94fc900-dirty","knative.dev/controller":"github.com.tektoncd.pipeline.pkg.reconciler.pipelinerun.Reconciler","knative.dev/kind":"tekton.dev.PipelineRun","knative.dev/traceid":"b180743a-692a-4c2f-ac44-b73429406127","knative.dev/key":"default/invalid-remote-pipeline-fcs8s"}

I think this is the same behavior we'd observe with the existing codebase if the webhook went down during pipelinerun reconciliation. The main difference would be the extra dry-run API call on each reconcile, as you pointed out, but the pipelinerun status would be the same.

I'm curious what you think the intended behavior should be? Ideally, we'd be able to distinguish between "webhook is down and not responding to any requests" (ideally in this case we'd fail the pipelinerun but in practice we probably cannot update the pipelinerun at all) vs "webhook is overloaded and some requests are timing out" (in which case we'd want to retry with backoff i.e. the current behavior). I'm not convinced it makes sense to handle this as part of this PR.

I also agree that we're unlikely to run into apiserver rate limits with our client side throttling.

Prior to this commit, remote pipelines were only validated by calling `pipelineSpec.Validate` in the PipelineRun reconciler. This omits some validation that is only done when validating Pipelines, rather than Pipeline specs, such as validation for propagated params and workspaces. In addition, if a cluster operator or vendor defines any validating admission webhooks for Pipelines, this validation would apply only to local Pipelines but not remote Pipelines. This commit issues a dry-run create request for remote Pipelines and fails the PipelineRun if the apiserver rejects the request. This allows us to do webhook-based validation of remote Pipelines without ever having to create them on the cluster, ensuring validation of remote Pipelines matches validation of local Pipelines. Similar validation will be added for remote Tasks in a separate commit.

lbernick · 2023-07-14T17:00:15Z

Synced offline with @dibyom; I was wrong earlier in that the pipelinerun reconciler will return a permanent error with this PR if the call to the webhook fails. I've updated the PR to better handle errors returned by the webhook.

I also realized the dry-run request will fail if an object exists with the same name, so I updated the dry-run request to use a UUID as the name.

tekton-robot · 2023-07-14T17:04:39Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/reconciler/pipelinerun/pipelinerun.go	91.8%	91.8%	0.0
pkg/reconciler/pipelinerun/resources/pipelineref.go	92.9%	89.1%	-3.8

tekton-robot · 2023-07-14T17:05:17Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/reconciler/pipelinerun/pipelinerun.go	91.8%	91.8%	0.0
pkg/reconciler/pipelinerun/resources/pipelineref.go	92.9%	89.1%	-3.8

tekton-robot · 2023-07-18T15:35:01Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dibyom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [dibyom]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

chitrangpatel · 2023-07-18T17:03:29Z

/lgtm

tekton-robot requested review from abayer and vdemeester June 28, 2023 18:52

lbernick mentioned this pull request Jun 28, 2023

Validate parameters and workspaces for referenced pipelines #6693

Closed

7 tasks

lbernick force-pushed the validate-remote-pipelines-2 branch from a5de06d to 7ee7ecf Compare June 28, 2023 18:58

lbernick force-pushed the validate-remote-pipelines-2 branch 2 times, most recently from 6f7ecb7 to e32e559 Compare June 28, 2023 19:26

lbernick marked this pull request as ready for review June 28, 2023 19:26

tekton-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 28, 2023

tekton-robot requested review from afrittoli and dibyom June 28, 2023 19:26

tekton-robot assigned JeromeJu Jun 28, 2023

lbernick force-pushed the validate-remote-pipelines-2 branch from e32e559 to 3938825 Compare June 28, 2023 20:39

lbernick force-pushed the validate-remote-pipelines-2 branch from 3938825 to 52255b6 Compare June 29, 2023 12:04

chitrangpatel reviewed Jun 29, 2023

View reviewed changes

pkg/reconciler/pipelinerun/pipelinerun.go Outdated Show resolved Hide resolved

chitrangpatel reviewed Jun 29, 2023

View reviewed changes

pkg/reconciler/pipelinerun/pipelinerun_test.go Show resolved Hide resolved

tekton-robot assigned chitrangpatel Jul 5, 2023

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 5, 2023

dibyom reviewed Jul 6, 2023

View reviewed changes

pkg/reconciler/pipelinerun/pipelinerun.go Outdated Show resolved Hide resolved

pkg/reconciler/pipelinerun/resources/pipelineref.go Outdated Show resolved Hide resolved

lbernick force-pushed the validate-remote-pipelines-2 branch from 52255b6 to 7c1a704 Compare July 6, 2023 17:39

tekton-robot removed the lgtm Indicates that a PR is ready to be merged. label Jul 6, 2023

lbernick force-pushed the validate-remote-pipelines-2 branch from 7c1a704 to 7f0c14e Compare July 6, 2023 17:43

dibyom reviewed Jul 7, 2023

View reviewed changes

afrittoli added this to the Pipelines v0.50 (LTS) milestone Jul 11, 2023

lbernick force-pushed the validate-remote-pipelines-2 branch from 7f0c14e to b8fbeab Compare July 14, 2023 16:58

dibyom approved these changes Jul 18, 2023

View reviewed changes

tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 18, 2023

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 18, 2023

tekton-robot merged commit b3cd760 into tektoncd:main Jul 18, 2023
2 checks passed

lbernick mentioned this pull request Jul 19, 2023

Add webhook validation for remote Tasks #6942

Merged

5 tasks

vdemeester mentioned this pull request Aug 2, 2023

After upgarding to v0.50.0, get PipelineValidationFailed when using cluster resolver in a PipelineRun #7015

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perform webhook validation for remote pipelines #6887

Perform webhook validation for remote pipelines #6887

lbernick commented Jun 28, 2023 •

edited

Loading

tekton-robot commented Jun 28, 2023

tekton-robot commented Jun 28, 2023

tekton-robot commented Jun 28, 2023

tekton-robot commented Jun 28, 2023

tekton-robot commented Jun 28, 2023

tekton-robot commented Jun 28, 2023

JeromeJu commented Jun 28, 2023

tekton-robot commented Jun 28, 2023

tekton-robot commented Jun 28, 2023

tekton-robot commented Jun 29, 2023

tekton-robot commented Jun 29, 2023

chitrangpatel commented Jul 5, 2023

dibyom commented Jul 6, 2023

tekton-robot commented Jul 6, 2023

tekton-robot commented Jul 6, 2023

tekton-robot commented Jul 6, 2023

lbernick commented Jul 6, 2023

lbernick commented Jul 6, 2023

dibyom Jul 7, 2023

dibyom commented Jul 7, 2023

lbernick commented Jul 11, 2023

lbernick commented Jul 14, 2023

tekton-robot commented Jul 14, 2023

tekton-robot commented Jul 14, 2023

tekton-robot commented Jul 18, 2023

chitrangpatel commented Jul 18, 2023

Perform webhook validation for remote pipelines #6887

Perform webhook validation for remote pipelines #6887

Conversation

lbernick commented Jun 28, 2023 • edited Loading

Submitter Checklist

Release Notes

tekton-robot commented Jun 28, 2023

tekton-robot commented Jun 28, 2023

tekton-robot commented Jun 28, 2023

tekton-robot commented Jun 28, 2023

tekton-robot commented Jun 28, 2023

tekton-robot commented Jun 28, 2023

JeromeJu commented Jun 28, 2023

tekton-robot commented Jun 28, 2023

tekton-robot commented Jun 28, 2023

tekton-robot commented Jun 29, 2023

tekton-robot commented Jun 29, 2023

chitrangpatel commented Jul 5, 2023

dibyom commented Jul 6, 2023

tekton-robot commented Jul 6, 2023

tekton-robot commented Jul 6, 2023

tekton-robot commented Jul 6, 2023

lbernick commented Jul 6, 2023

lbernick commented Jul 6, 2023

dibyom Jul 7, 2023

Choose a reason for hiding this comment

dibyom commented Jul 7, 2023

lbernick commented Jul 11, 2023

lbernick commented Jul 14, 2023

tekton-robot commented Jul 14, 2023

tekton-robot commented Jul 14, 2023

tekton-robot commented Jul 18, 2023

chitrangpatel commented Jul 18, 2023

lbernick commented Jun 28, 2023 •

edited

Loading