Skip to content
This repository has been archived by the owner on Sep 5, 2019. It is now read-only.

Add possibility for a client to cancel a build… #510

Merged
merged 1 commit into from
Jan 22, 2019

Conversation

vdemeester
Copy link
Member

… by updating the status of a build.

It will mark the build as failed and clean any pods related to it.

Related to tektoncd/pipeline#272
I tried that on knative/build first (as it's simpler 🙃)

cc @abayer @imjasonh

Signed-off-by: Vincent Demeester [email protected]

Copy link
Member

@imjasonh imjasonh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thanks for adding this. Two questions:

1.) Is it normally common practice for a client to be able to update k8s resources' status this way? I'm thinking of something like Job, which runs to completion, but which if you want to cancel it you basically just delete it. In this case, we want to stop the build's execution without also deleting any record of that resource existing.

2.) Once this is done we should also make the same change for TaskRun, and then PipelineRun, which should cancel (but not delete) its created TaskRuns. We might even just want to go straight to TaskRun and drop this PR. I'll leave that decision to up to you.

@vdemeester
Copy link
Member Author

@imjasonh

1.) Is it normally common practice for a client to be able to update k8s resources' status this way? I'm thinking of something like Job, which runs to completion, but which if you want to cancel it you basically just delete it. In this case, we want to stop the build's execution without also deleting any record of that resource existing.

I really am not sure… I'm trying to look around for that 😅

2.) Once this is done we should also make the same change for TaskRun, and then PipelineRun, which should cancel (but not delete) its created TaskRuns. We might even just want to go straight to TaskRun and drop this PR. I'll leave that decision to up to you.

Yes, I did this here as it was simpler/quicker but I intend to follow-up on build-pipeline. I'm also fine doing it directly on build-pipeline, I just thought if we want it in build-pipeline we may want it there too 👼

@vdemeester vdemeester force-pushed the cancel-build branch 2 times, most recently from 641b3c4 to d8a3947 Compare January 7, 2019 15:32
@shashwathi
Copy link
Contributor

@vdemeester Great PR

If build is completed then what is the point updating status to Cancelled?
Build is completed so all associated resources are deleted.
If build is running then cluster name would be set in build status.
So I don't understand why there is a API call to get pod resource? and why not just call the delete API directly?

@vdemeester
Copy link
Member Author

@vdemeester Great PR

If build is completed then what is the point updating status to Cancelled?
Build is completed so all associated resources are deleted.

So, if you want to cancel the build (currently using the spec), then you want to mark it as Cancelled as soon as possible and stop the execution, thus deleting the pod real quick 👼, same way we do on build timeout 🙃. (making me think there could be a small refactor to share code there).

If build is running then cluster name would be set in build status.
So I don't understand why there is a API call to get pod resource? and why not just call the delete API directly?

Indeed, there is no need.

@abayer
Copy link

abayer commented Jan 7, 2019

At Kubecon, @bobcatfish told me (in the context of tektoncd/pipeline#355) that I shouldn't be making changes to the status of one CRD while reconciling another one, so I jumped through some hoops to handle having the PipelineRun's timeout apply to its TaskRuns. Not sure if that is a general thing where nothing should be modifying a CRD's status but its reconciler or if it's just that one CRD's reconciler shouldn't mess with another CRD's status.

@shashwathi
Copy link
Contributor

shashwathi commented Jan 7, 2019

So, if you want to cancel the build (currently using the spec), then you want to mark it as Cancelled as soon as possible and stop the execution, thus deleting the pod real quick 👼

I agree with this completely. I do not think you understood my question so let me rephrase
If build is completed (pods are not present and only build object exists), then what is the point of marking build as "Cancelled"?

I am thinking of this like a cycle.
build starts -> pod creation -> pod finished and deleted -> build completed.

If user marks build as "Cancelled" after completion, I do not see how that has any effect on underlying resource(pod). I would expect in this case for status to be not changed at all. If build has finished then status should not be changed(IMO).

same way we do on build timeout 🙃. (making me think there could be a small refactor to share code there).

Controller checks if build is still running and then cancels the build if timeout has passed. It doesn't cancel if build has already completed.

@vdemeester
Copy link
Member Author

I agree with this completely. I do not think you understood my question so let me rephrase
If build is completed (pods are not present and only build object exists), then what is the point of marking build as "Cancelled"?

I am thinking of this like a cycle.
build starts -> pod creation -> pod finished and deleted -> build completed.

If user marks build as "Cancelled" after completion, I do not see how that has any effect on underlying resource(pod). I would expect in this case for status to be not changed at all. If build has finished then status should not be changed(IMO).

Indeed there is no point, and if a user marks the build as Cancelled after completion, nothing should happen indeed. It should already be the case (https://github.com/knative/build/pull/510/files#diff-f2d9d6d36e3340295720d4f3e028cb14R152). If the build is completed (successfully or not), it won't ever go into the "cancel" block 👼

Controller checks if build is still running and then cancels the build if timeout has passed. It doesn't cancel if build has already completed.

Yes, this is what happens.

Copy link
Contributor

@shashwathi shashwathi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vdemeester my apologies for not understanding the flow correctly :) Thanks for pinging and explaining it. I appreciate it
I have reviewed again. 👍 Please take a look at my comments


// Wait for a little while
// FIXME(vdemeester) I would prefer something less flaky-prone
time.Sleep(20 * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably consider adding polling function to check build status is "Running" for 10sec(? or more). That is better than wait solution IMO. WDYT @vdemeester ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shashwathi yeah, that's why the FIXME 👼 not sure why I didn't use the polling functions here 😅 will update 😉

}

if _, err := clients.buildClient.watchBuild(buildName); err == nil {
t.Fatalf("watchBuild did not return expected `cancelled` error")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

watchBuild function returns build object so you could consider verifying that build status is updated with expected reason("BuildCancelled") here.

t.Errorf("error syncing build: %v", err)
}

// Check that the build has the expected timeout status.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: I think you meant cancelled status here

logger.Warnf("build %q has no pod running yet", build.Name)
return nil
}
p, err := c.kubeclientset.CoreV1().Pods(build.Namespace).Get(build.Status.Cluster.PodName, metav1.GetOptions{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I already mentioned this(?) in previous comment about reusing build.Status.Cluster.PodName for Delete API directly. Sorry to repeat it again

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh right, didn't update that 😓

@vdemeester vdemeester force-pushed the cancel-build branch 3 times, most recently from a2c94b6 to 5d8630b Compare January 8, 2019 19:02
Copy link
Contributor

@shashwathi shashwathi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Thank you for addressing my comments.

@shashwathi
Copy link
Contributor

/hold

@shashwathi
Copy link
Contributor

I noticed @imjasonh had some comment earlier in thread so I am adding hold for him to add his review.

@imjasonh
Copy link
Member

/lgtm
/approve
/hold cancel

Oh my goodness I didn't realize this was blocking on me, so so sorry to keep this held up so long. Thanks for this change! ❤️

@vdemeester
Copy link
Member Author

ah.. I may need to rebase and fix 😅

Copy link
Contributor

@shashwathi shashwathi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
/lgtm

@knative-prow-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ImJasonH, shashwathi, vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [ImJasonH,shashwathi]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

… by updating the status of a build.

It will mark the build as failed and clean any pods related to it.

Signed-off-by: Vincent Demeester <[email protected]>
@knative-metrics-robot
Copy link

The following is the coverage report on pkg/.
Say /test pull-knative-build-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/build/build.go 76.8% 75.9% -0.9

@vdemeester
Copy link
Member Author

Hum… the coverage build is a bit a pain 😅 as the threshold is higher than the current coverage on that file, any change trigger a failure… 😓

@knative-prow-robot
Copy link

knative-prow-robot commented Jan 22, 2019

@vdemeester: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-knative-build-go-coverage 0e37001 link /test pull-knative-build-go-coverage

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@vdemeester
Copy link
Member Author

/test pull-knative-build-integration-tests

@bobcatfish
Copy link
Contributor

Hum… the coverage build is a bit a pain sweat_smile as the threshold is higher than the current coverage on that file, any change trigger a failure… sweat

I agree @vdemeester - feel free to open an issue and we could work on makin this better - tbh I think we should drop the threshold for the reconcilers themselves - and/or add end to end test coverage measurement to the equation

/lgtm
/meow space

@knative-prow-robot
Copy link

@bobcatfish: cat image

In response to this:

Hum… the coverage build is a bit a pain sweat_smile as the threshold is higher than the current coverage on that file, any change trigger a failure… sweat

I agree @vdemeester - feel free to open an issue and we could work on makin this better - tbh I think we should drop the threshold for the reconcilers themselves - and/or add end to end test coverage measurement to the equation

/lgtm
/meow space

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@knative-prow-robot knative-prow-robot merged commit addfaf4 into knative:master Jan 22, 2019
@vdemeester vdemeester deleted the cancel-build branch January 22, 2019 19:13
vdemeester added a commit to vdemeester/knative-build that referenced this pull request Apr 3, 2019
… by updating the status of a build.

It will mark the build as failed and clean any pods related to it.

Signed-off-by: Vincent Demeester <[email protected]>
vdemeester added a commit to vdemeester/knative-build that referenced this pull request Apr 3, 2019
… by updating the status of a build.

It will mark the build as failed and clean any pods related to it.

Signed-off-by: Vincent Demeester <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants