Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid picking old helm-delete job #232

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 37 additions & 3 deletions pkg/controllers/chart/chart.go
Original file line number Diff line number Diff line change
Expand Up @@ -226,11 +226,31 @@ func (c *Controller) OnRemove(key string, chart *v1.HelmChart) (*v1.HelmChart, e
return nil, nil
}

if chart.DeletionTimestamp == nil {
return nil, nil
}

expectedJob, objs, err := c.getJobAndRelatedResources(chart)
if err != nil {
return nil, err
}

// remove old helm-delete job if it was left
job, err := c.jobCache.Get(chart.Namespace, expectedJob.Name)
Comment on lines +238 to +239
Copy link
Member

@brandond brandond Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to do this here, the job patcher already deletes and re-creates the job as necessary, as part of the Apply logic:

  • func (c *Controller) jobPatcher(namespace, name string, pt types.PatchType, data []byte) (runtime.Object, error) {
    err := c.jobs.Delete(namespace, name, &metav1.DeleteOptions{PropagationPolicy: &deletePolicy})
    if err == nil || apierrors.IsNotFound(err) {
    return nil, fmt.Errorf("create or replace job")
    }
    return nil, err
    }
  • c.apply = apply.
    WithCacheTypes(helms, confs, jobs, crbs, sas, cm, s).
    WithStrictCaching().
    WithPatcher(jobs.GroupVersionKind(), c.jobPatcher)
  • // if the job already exists but is tied to an install or upgrade, there will be a need to patch so
    // the apply will execute the jobPatcher, which will delete the install/upgrade job and recreate a uninstall job
    err = generic.ConfigureApplyForObject(c.apply, chart, &generic.GeneratingHandlerOptions{

Is there some other bit of logic that the patcher needs to capture?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the bug is due to WithOwner(chart). in

First round of create & delete a HelmChart X

HelmChart X with ID A 
  helm-delete job 1
  helm-delete job 2 (and possiblely left orphaned)

Second round of create & delete a HelmChart X

Chart X with ID B
  helm-delete job 2 will not be replaced due to `WithOwner`

But

job, err := c.jobCache.Get(chart.Namespace, expectedJob.Name)
will be hit and also
c.recorder.Eventf(chart, corev1.EventTypeNormal, "RemoveJob", "Uninstalled HelmChart using Job %s/%s, removing resources", job.Namespace, job.Name)

That's the gap in the logic.

Copy link
Author

@w13915984028 w13915984028 Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we manually create a same namespaced & named helm-delete job before deleting a HelmChart, I guess it will also shortcut the HelmChart's real deleting.

Copy link
Member

@brandond brandond Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember correctly, the WithOwner stuff uses the owning object's name, namespace, and GVK to track resources across namespaces. As long as the owning HelmChart object shares those same attributes across delete/create cycles it should track it properly. If you run the controller with --debug you should get debug logs from the DesiredSet stuff showing you what it's trying to do - does that shed any light on the situation?

Copy link
Author

@w13915984028 w13915984028 Apr 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brandond

I add some debug #177 (comment) on Harvester environment, the logs /events can clearly show that, the generic.ConfigureApplyForObject is not able to correctly identify the old left helm-delete job.

From the debug, it looks essential to remove the old job.

Btw the created helm-delete job, has no Owner field; the apply has a interl owner field.

harv41:/home/rancher # kk get jobs -n cattle-logging-system helm-delete-rancher-logging -oyaml
apiVersion: batch/v1
kind: Job
metadata:
  annotations:
    batch.kubernetes.io/job-tracking: ""
    objectset.rio.cattle.io/applied: ...
    objectset.rio.cattle.io/id: helm-chart-registration
    objectset.rio.cattle.io/owner-gvk: helm.cattle.io/v1, Kind=HelmChart
    objectset.rio.cattle.io/owner-name: rancher-logging
    objectset.rio.cattle.io/owner-namespace: cattle-logging-system
  creationTimestamp: "2024-04-19T08:32:28Z"
  finalizers:
  - wrangler.cattle.io/promote-node-controller
  generation: 1
  labels:
    helmcharts.helm.cattle.io/chart: rancher-logging
    objectset.rio.cattle.io/hash: ea9e4622f05ef91896488127ad4be76f709ee325
  name: helm-delete-rancher-logging
  namespace: cattle-logging-system
  resourceVersion: "739101"
  uid: bcc55913-0ad5-48e9-8dcf-8b380f037ee8
spec:
  backoffLimit: 1000
  completionMode: NonIndexed
  completions: 1
  parallelism: 1
  selector:
    matchLabels:
      batch.kubernetes.io/controller-uid: bcc55913-0ad5-48e9-8dcf-8b380f037ee8
  suspend: false
  template:
    metadata:
      annotations:
        helmcharts.helm.cattle.io/configHash: SHA256=0832CE8EBAB899509A3AB82C345F3DF72C4D3300D5CDB2E0C47BB943F9630B34
      creationTimestamp: null
      labels:
        batch.kubernetes.io/controller-uid: bcc55913-0ad5-48e9-8dcf-8b380f037ee8
        batch.kubernetes.io/job-name: helm-delete-rancher-logging
        controller-uid: bcc55913-0ad5-48e9-8dcf-8b380f037ee8
        helmcharts.helm.cattle.io/chart: rancher-logging
        job-name: helm-delete-rancher-logging
    spec:
...  

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The orphaned job, was left per this return

	// once we have run the above logic, we can now check if the job is complete
	job, err = c.jobCache.Get(chart.Namespace, expectedJob.Name)
	if apierrors.IsNotFound(err) {
		// the above apply should have created it, something is wrong.
		// if you are here, there must be a bug in the code.
	@@ -269,7 +289,7 @@ func (c *Controller) OnRemove(key string, chart *v1.HelmChart) (*v1.HelmChart, e
		chartCopy.Status.JobName = job.Name
// the `chart` may be on on apiserver side
		newChart, err := c.helms.Update(chartCopy)
		if err != nil {
			return chart, fmt.Errorf("unable to update status of helm chart to add uninstall job name %s, %w", chartCopy.Status.JobName, err)
		}

And the normal running path has this call

	// note: an empty apply removes all resources owned by this chart
	err = generic.ConfigureApplyForObject(c.apply, chart, &generic.GeneratingHandlerOptions{
		AllowClusterScoped: true,
	}).
		WithOwner(chart).
		WithSetID("helm-chart-registration").
		ApplyObjects()

I will test if add this call to the above return works.

if err == nil && job.CreationTimestamp.Before(chart.DeletionTimestamp) {
err = c.jobs.Delete(chart.Namespace, expectedJob.Name, &metav1.DeleteOptions{PropagationPolicy: &deletePolicy})
if err != nil {
if !apierrors.IsNotFound(err) {
return nil, fmt.Errorf("fail to delete old helm-delete job %w", err)
}
// if IsNotFound, continue
} else {
// wait old job to be removed
c.helms.EnqueueAfter(chart.Namespace, chart.Name, 1*time.Second)
return nil, nil
}
}

// note: on the logic of running an apply here...
// if the uninstall job does not exist, it will create it
// if the job already exists and it is uninstalling, nothing will change since there's no need to patch
Expand All @@ -251,7 +271,7 @@ func (c *Controller) OnRemove(key string, chart *v1.HelmChart) (*v1.HelmChart, e
time.Sleep(3 * time.Second)

// once we have run the above logic, we can now check if the job is complete
job, err := c.jobCache.Get(chart.Namespace, expectedJob.Name)
job, err = c.jobCache.Get(chart.Namespace, expectedJob.Name)
if apierrors.IsNotFound(err) {
// the above apply should have created it, something is wrong.
// if you are here, there must be a bug in the code.
Expand All @@ -269,7 +289,21 @@ func (c *Controller) OnRemove(key string, chart *v1.HelmChart) (*v1.HelmChart, e
chartCopy.Status.JobName = job.Name
newChart, err := c.helms.Update(chartCopy)
if err != nil {
return chart, fmt.Errorf("unable to update status of helm chart to add uninstall job name %s", chartCopy.Status.JobName)
// if chart is gone, clean resources
if apierrors.IsNotFound(err) || strings.Contains(err.Error(), "StorageError") {
Copy link
Member

@brandond brandond Apr 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the error you shared: StorageError: invalid object, Code: 4, Key: /registry/helm.cattle.io/helmcharts/cattle-logging-system/rancher-logging, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 78e8d91d-430e-4c7d-a3f9-156c461eedab, UID in object meta: , requeuing"

I think you can use https://pkg.go.dev/k8s.io/apiserver/pkg/storage#IsInvalidObj

Suggested change
if apierrors.IsNotFound(err) || strings.Contains(err.Error(), "StorageError") {
if apierrors.IsNotFound(err) || storage.IsInvalidObj(err) {

It is odd that there's not an apierror for this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried, it is not like normal k8s package for client usage.

apierrors "k8s.io/apimachinery/pkg/api/errors"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see anything that prevents you from importing "k8s.io/apiserver/pkg/storage"?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait my validation, it may not work as expected when using this package.

The log shows it does not go to the expected if, and the job is still left.

image

harv41:/home/rancher # kk get jobs -A
NAMESPACE               NAME                               COMPLETIONS   DURATION   AGE
cattle-logging-system   helm-delete-rancher-logging        1/1           4s         4m4s

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

storage.IsInvalidObj is not working as expected, revert first.

// note: an empty apply removes all resources owned by this chart
err = generic.ConfigureApplyForObject(c.apply, chart, &generic.GeneratingHandlerOptions{
AllowClusterScoped: true,
}).
WithOwner(chart).
WithSetID("helm-chart-registration").
ApplyObjects()
if err != nil {
return nil, fmt.Errorf("chart is gone, but unable to remove resources tied to HelmChart %s/%s, %w", chart.Namespace, chart.Name, err)
}
return chart, nil
}
return chart, fmt.Errorf("unable to update status of helm chart to add uninstall job name %s, %w", chartCopy.Status.JobName, err)
}
return newChart, fmt.Errorf("waiting for delete of helm chart for %s by %s", key, job.Name)
}
Expand All @@ -285,7 +319,7 @@ func (c *Controller) OnRemove(key string, chart *v1.HelmChart) (*v1.HelmChart, e
WithSetID("helm-chart-registration").
ApplyObjects()
if err != nil {
return nil, fmt.Errorf("unable to remove resources tied to HelmChart %s/%s: %s", chart.Namespace, chart.Name, err)
return nil, fmt.Errorf("unable to remove resources tied to HelmChart %s/%s: %w", chart.Namespace, chart.Name, err)
}

return chart, nil
Expand Down