Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perform webhook validation for remote pipelines #6887

Merged
merged 1 commit into from
Jul 18, 2023

Conversation

lbernick
Copy link
Member

@lbernick lbernick commented Jun 28, 2023

Prior to this commit, remote pipelines were only validated by calling pipelineSpec.Validate in the PipelineRun reconciler. This omits some validation that is only done when validating Pipelines, rather than Pipeline specs, such as validation for propagated params and workspaces. In addition, if a cluster operator or vendor defines any validating admission webhooks for Pipelines, this validation would apply only to local Pipelines but not remote Pipelines.

This commit issues a dry-run create request for remote Pipelines and fails the PipelineRun if the apiserver rejects the request. This allows us to do webhook-based validation of remote Pipelines without ever having to create them on the cluster, ensuring validation of remote Pipelines matches validation of local Pipelines.

Similar validation will be added for remote Tasks in a separate commit.

/kind bug
closes #6670

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

  • n/a Has Docs if any changes are user facing, including updates to minimum requirements e.g. Kubernetes version bumps
  • Has Tests included if any functionality added or changed
  • Follows the commit message standard
  • Meets the Tekton contributor standards (including functionality, content, code)
  • Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
  • Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings). See some examples of good release notes.
  • n/a Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes

bug fix: Remote Pipelines do not support propagated parameters and workspaces

@tekton-robot
Copy link
Collaborator

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@tekton-robot tekton-robot added kind/bug Categorizes issue or PR as related to a bug. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 28, 2023
@lbernick lbernick force-pushed the validate-remote-pipelines-2 branch from a5de06d to 7ee7ecf Compare June 28, 2023 18:58
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/pipelinerun.go 91.5% 91.2% -0.3

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/pipelinerun.go 91.5% 91.2% -0.3

@lbernick lbernick force-pushed the validate-remote-pipelines-2 branch 2 times, most recently from 6f7ecb7 to e32e559 Compare June 28, 2023 19:26
@lbernick lbernick marked this pull request as ready for review June 28, 2023 19:26
@tekton-robot tekton-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 28, 2023
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/pipelinerun.go 91.5% 91.5% 0.0
pkg/reconciler/pipelinerun/resources/pipelineref.go 92.9% 93.5% 0.6

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/pipelinerun.go 91.5% 91.5% 0.0
pkg/reconciler/pipelinerun/resources/pipelineref.go 92.9% 93.5% 0.6

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/pipelinerun.go 91.5% 91.2% -0.3
pkg/reconciler/pipelinerun/resources/pipelineref.go 92.9% 93.5% 0.6

@JeromeJu
Copy link
Member

/assign

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/pipelinerun.go 91.5% 91.5% 0.0
pkg/reconciler/pipelinerun/resources/pipelineref.go 92.9% 93.5% 0.6

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/pipelinerun.go 91.5% 91.5% 0.0
pkg/reconciler/pipelinerun/resources/pipelineref.go 92.9% 93.5% 0.6

@lbernick lbernick force-pushed the validate-remote-pipelines-2 branch from 3938825 to 52255b6 Compare June 29, 2023 12:04
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/pipelinerun.go 91.5% 91.5% 0.0
pkg/reconciler/pipelinerun/resources/pipelineref.go 92.9% 93.5% 0.6

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/pipelinerun.go 91.5% 91.5% 0.0
pkg/reconciler/pipelinerun/resources/pipelineref.go 92.9% 93.5% 0.6

@chitrangpatel
Copy link
Contributor

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 5, 2023
@dibyom
Copy link
Member

dibyom commented Jul 6, 2023

Overall, I like this approach. One concern is that this is adding the webhook to the runtime path of executing the pipeline. So, if the webhook is down (or is being rate limited) we'd have to handle retries using a backoff.

The Kuberentes API server usually has fairly low rate limits (for GKE its 3k per min - does the client already retry if the rate limit is exhausted?

@lbernick lbernick force-pushed the validate-remote-pipelines-2 branch from 52255b6 to 7c1a704 Compare July 6, 2023 17:39
@tekton-robot tekton-robot removed the lgtm Indicates that a PR is ready to be merged. label Jul 6, 2023
@lbernick lbernick force-pushed the validate-remote-pipelines-2 branch from 7c1a704 to 7f0c14e Compare July 6, 2023 17:43
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/pipelinerun.go 91.5% 91.5% 0.0
pkg/reconciler/pipelinerun/resources/pipelineref.go 92.9% 93.5% 0.6

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/pipelinerun.go 91.5% 91.5% 0.0
pkg/reconciler/pipelinerun/resources/pipelineref.go 92.9% 93.5% 0.6

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/pipelinerun.go 91.5% 91.5% 0.0
pkg/reconciler/pipelinerun/resources/pipelineref.go 92.9% 93.5% 0.6

@lbernick
Copy link
Member Author

lbernick commented Jul 6, 2023

Overall, I like this approach. One concern is that this is adding the webhook to the runtime path of executing the pipeline. So, if the webhook is down (or is being rate limited) we'd have to handle retries using a backoff.
The Kuberentes API server usually has fairly low rate limits (for GKE its 3k per min - does the client already retry if the rate limit is exhausted?

We already make a number of calls to the k8s api server during pipelinerun and taskrun execution; I'd expect us to have already run into rate limiting issues if they exist.

It looks like we set a QPS and burst for our k8s clients here based on config documented here and default to 20 qps. Client-go does seem to have some backoff handling (based on poking around the rest package) but it's not well documented. I don't know of a good way to handle rate limiting for our webhooks separately from other api server requests, but since platforms seem to have global limits for all api server requests I doubt this would be desirable.

I'm not totally sure I understand what you are suggesting (maybe we just need to investigate if we have a reasonable rest config for our clients?) so LMK if I've addressed your concern.

@lbernick
Copy link
Member Author

lbernick commented Jul 6, 2023

/retest
Known flake: #6772

// Issue a dry-run request to create the remote Pipeline, so that it can undergo validation from validating admission webhooks
// without actually creating the Pipeline on the cluster
if _, err := tekton.TektonV1().Pipelines(namespace).Create(ctx, obj, metav1.CreateOptions{DryRun: []string{metav1.DryRunAll}}); err != nil {
if apierrors.IsBadRequest(err) { // Pipeline rejected by validating webhook
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if err is not a BadRequest but a different kind of error (say, InternalError because the Webhook server is down) - it seems like we'd pass through an invalid pipeline in that case?

@dibyom
Copy link
Member

dibyom commented Jul 7, 2023

We already make a number of calls to the k8s api server during pipelinerun and taskrun execution; I'd expect us to have already run into rate limiting issues if they exist.

Most of our calls are get/list/watch calls that are cached via informers instead of actually making it to the api server. Off the top of my head, calls that make it to the server are the update calls to update statuses/labels/annotations if something changes and create calls to pods/taskruns/resolution requests etc. Also, calls to create/update statefulSets if affinity assistant is on and to manage PVCs if a workspace is being used.

I think the client side burst limits should help ensure we do not overwhelm the API server. I do think client-go retries on some errors but only for GET calls.

There is a difference between this call and the others. The other calls have are more "async" - if the resolution request CRD is created when the resolver is temporarily down, the processing will continue once the resolver server is back up. However, if the webhook happens to be down when the call is made, the error is permanent unless we add some kind of retry/backoff on our side.

I'm not totally sure I understand what you are suggesting (maybe we just need to investigate if we have a reasonable rest config for our clients?) so LMK if I've addressed your concern.

I think we should investigate what happens if we try making a dry run request when the webhook is down - is that a permanent error requiring the user to retry their run or no. Also, worth investigating if we'd run into API server rate limits - given the client side throttling that we have that seems less likely we'll hit this

@afrittoli afrittoli added this to the Pipelines v0.50 (LTS) milestone Jul 11, 2023
@lbernick
Copy link
Member Author

@dibyom I tried this out by deleting the webhook immediately after creating the pipelinerun. It seems like if the webhook is down, the controller just won't be able to update the pipelinerun at all (which makes sense) and the pipelinerun gets requeued while stuck in its most recent state.

I'm not sure how to test the webhook going down at the point in time when the Pipeline is verified. My guess is that it would hit this block and reconcile would return a permanent error, but updateLabelsAndAnnotations would fail before returning a permanent error, so this would also be requeued. This guess is supported by the logs I saw in the controller:

{"severity":"warn","timestamp":"2023-07-11T18:44:30.480Z","logger":"tekton-pipelines-controller","caller":"pipelinerun/pipelinerun.go:308","message":"Failed to update PipelineRun labels/annotations{error 26 0  Internal error occurred: failed calling webhook \"webhook.pipeline.tekton.dev\": failed to call webhook: Post \"https://tekton-pipelines-webhook.tekton-pipelines.svc:443/defaulting?timeout=10s\": no endpoints available for service \"tekton-pipelines-webhook\"}","commit":"94fc900-dirty","knative.dev/controller":"github.com.tektoncd.pipeline.pkg.reconciler.pipelinerun.Reconciler","knative.dev/kind":"tekton.dev.PipelineRun","knative.dev/traceid":"b180743a-692a-4c2f-ac44-b73429406127","knative.dev/key":"default/invalid-remote-pipeline-fcs8s"}

I think this is the same behavior we'd observe with the existing codebase if the webhook went down during pipelinerun reconciliation. The main difference would be the extra dry-run API call on each reconcile, as you pointed out, but the pipelinerun status would be the same.

I'm curious what you think the intended behavior should be? Ideally, we'd be able to distinguish between "webhook is down and not responding to any requests" (ideally in this case we'd fail the pipelinerun but in practice we probably cannot update the pipelinerun at all) vs "webhook is overloaded and some requests are timing out" (in which case we'd want to retry with backoff i.e. the current behavior). I'm not convinced it makes sense to handle this as part of this PR.

I also agree that we're unlikely to run into apiserver rate limits with our client side throttling.

Prior to this commit, remote pipelines were only validated
by calling `pipelineSpec.Validate` in the PipelineRun reconciler.
This omits some validation that is only done when validating Pipelines,
rather than Pipeline specs, such as validation for propagated params
and workspaces. In addition, if a cluster operator or vendor defines
any validating admission webhooks for Pipelines, this validation would
apply only to local Pipelines but not remote Pipelines.

This commit issues a dry-run create request for remote Pipelines
and fails the PipelineRun if the apiserver rejects the request.
This allows us to do webhook-based validation of remote Pipelines
without ever having to create them on the cluster, ensuring validation
of remote Pipelines matches validation of local Pipelines.

Similar validation will be added for remote Tasks in a separate commit.
@lbernick lbernick force-pushed the validate-remote-pipelines-2 branch from 7f0c14e to b8fbeab Compare July 14, 2023 16:58
@lbernick
Copy link
Member Author

Synced offline with @dibyom; I was wrong earlier in that the pipelinerun reconciler will return a permanent error with this PR if the call to the webhook fails. I've updated the PR to better handle errors returned by the webhook.

I also realized the dry-run request will fail if an object exists with the same name, so I updated the dry-run request to use a UUID as the name.

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/pipelinerun.go 91.8% 91.8% 0.0
pkg/reconciler/pipelinerun/resources/pipelineref.go 92.9% 89.1% -3.8

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/pipelinerun.go 91.8% 91.8% 0.0
pkg/reconciler/pipelinerun/resources/pipelineref.go 92.9% 89.1% -3.8

@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dibyom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 18, 2023
@chitrangpatel
Copy link
Contributor

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 18, 2023
@tekton-robot tekton-robot merged commit b3cd760 into tektoncd:main Jul 18, 2023
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Accidental support of propagated params for remote (not local) referenced pipelines
7 participants