-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perform webhook validation for remote pipelines #6887
Perform webhook validation for remote pipelines #6887
Conversation
Skipping CI for Draft Pull Request. |
a5de06d
to
7ee7ecf
Compare
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
6f7ecb7
to
e32e559
Compare
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
/assign |
e32e559
to
3938825
Compare
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
3938825
to
52255b6
Compare
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
/lgtm |
Overall, I like this approach. One concern is that this is adding the webhook to the runtime path of executing the pipeline. So, if the webhook is down (or is being rate limited) we'd have to handle retries using a backoff. The Kuberentes API server usually has fairly low rate limits (for GKE its 3k per min - does the client already retry if the rate limit is exhausted? |
52255b6
to
7c1a704
Compare
7c1a704
to
7f0c14e
Compare
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
We already make a number of calls to the k8s api server during pipelinerun and taskrun execution; I'd expect us to have already run into rate limiting issues if they exist. It looks like we set a QPS and burst for our k8s clients here based on config documented here and default to 20 qps. Client-go does seem to have some backoff handling (based on poking around the rest package) but it's not well documented. I don't know of a good way to handle rate limiting for our webhooks separately from other api server requests, but since platforms seem to have global limits for all api server requests I doubt this would be desirable. I'm not totally sure I understand what you are suggesting (maybe we just need to investigate if we have a reasonable rest config for our clients?) so LMK if I've addressed your concern. |
/retest |
// Issue a dry-run request to create the remote Pipeline, so that it can undergo validation from validating admission webhooks | ||
// without actually creating the Pipeline on the cluster | ||
if _, err := tekton.TektonV1().Pipelines(namespace).Create(ctx, obj, metav1.CreateOptions{DryRun: []string{metav1.DryRunAll}}); err != nil { | ||
if apierrors.IsBadRequest(err) { // Pipeline rejected by validating webhook |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if err is not a BadRequest but a different kind of error (say, InternalError because the Webhook server is down) - it seems like we'd pass through an invalid pipeline in that case?
Most of our calls are get/list/watch calls that are cached via informers instead of actually making it to the api server. Off the top of my head, calls that make it to the server are the update calls to update statuses/labels/annotations if something changes and create calls to pods/taskruns/resolution requests etc. Also, calls to create/update statefulSets if affinity assistant is on and to manage PVCs if a workspace is being used. I think the client side burst limits should help ensure we do not overwhelm the API server. I do think client-go retries on some errors but only for GET calls. There is a difference between this call and the others. The other calls have are more "async" - if the resolution request CRD is created when the resolver is temporarily down, the processing will continue once the resolver server is back up. However, if the webhook happens to be down when the call is made, the error is permanent unless we add some kind of retry/backoff on our side.
I think we should investigate what happens if we try making a dry run request when the webhook is down - is that a permanent error requiring the user to retry their run or no. Also, worth investigating if we'd run into API server rate limits - given the client side throttling that we have that seems less likely we'll hit this |
@dibyom I tried this out by deleting the webhook immediately after creating the pipelinerun. It seems like if the webhook is down, the controller just won't be able to update the pipelinerun at all (which makes sense) and the pipelinerun gets requeued while stuck in its most recent state. I'm not sure how to test the webhook going down at the point in time when the Pipeline is verified. My guess is that it would hit this block and
I think this is the same behavior we'd observe with the existing codebase if the webhook went down during pipelinerun reconciliation. The main difference would be the extra dry-run API call on each reconcile, as you pointed out, but the pipelinerun status would be the same. I'm curious what you think the intended behavior should be? Ideally, we'd be able to distinguish between "webhook is down and not responding to any requests" (ideally in this case we'd fail the pipelinerun but in practice we probably cannot update the pipelinerun at all) vs "webhook is overloaded and some requests are timing out" (in which case we'd want to retry with backoff i.e. the current behavior). I'm not convinced it makes sense to handle this as part of this PR. I also agree that we're unlikely to run into apiserver rate limits with our client side throttling. |
Prior to this commit, remote pipelines were only validated by calling `pipelineSpec.Validate` in the PipelineRun reconciler. This omits some validation that is only done when validating Pipelines, rather than Pipeline specs, such as validation for propagated params and workspaces. In addition, if a cluster operator or vendor defines any validating admission webhooks for Pipelines, this validation would apply only to local Pipelines but not remote Pipelines. This commit issues a dry-run create request for remote Pipelines and fails the PipelineRun if the apiserver rejects the request. This allows us to do webhook-based validation of remote Pipelines without ever having to create them on the cluster, ensuring validation of remote Pipelines matches validation of local Pipelines. Similar validation will be added for remote Tasks in a separate commit.
7f0c14e
to
b8fbeab
Compare
Synced offline with @dibyom; I was wrong earlier in that the pipelinerun reconciler will return a permanent error with this PR if the call to the webhook fails. I've updated the PR to better handle errors returned by the webhook. I also realized the dry-run request will fail if an object exists with the same name, so I updated the dry-run request to use a UUID as the name. |
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dibyom The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
Prior to this commit, remote pipelines were only validated by calling
pipelineSpec.Validate
in the PipelineRun reconciler. This omits some validation that is only done when validating Pipelines, rather than Pipeline specs, such as validation for propagated params and workspaces. In addition, if a cluster operator or vendor defines any validating admission webhooks for Pipelines, this validation would apply only to local Pipelines but not remote Pipelines.This commit issues a dry-run create request for remote Pipelines and fails the PipelineRun if the apiserver rejects the request. This allows us to do webhook-based validation of remote Pipelines without ever having to create them on the cluster, ensuring validation of remote Pipelines matches validation of local Pipelines.
Similar validation will be added for remote Tasks in a separate commit.
/kind bug
closes #6670
Submitter Checklist
As the author of this PR, please check off the items in this checklist:
/kind <type>
. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tepRelease Notes