[Feature] Enable automatic cleanup of finished compaction jobs and make TTL configurable #556

unmarshall · 2023-03-16T10:11:48Z

Feature (What you would like to be added):

Druid creates compaction jobs for every etcd cluster based on crossing a defined threshold of accumulated events since the last compaction run. These jobs run to completion but are not automatically removed. An operator today has to manually delete these jobs.

Since k8s version 1.21 spec.ttlSecondsAfterFinished can be set to configure after how much time the job resource should be removed.
Introduce a new flag that the consumer can pass onto druid to configure the value for the TTL. Assume a default value (> 0) of ttlSecondsAfterFinished and use it if none is explicitly passed by the consumer for all compaction jobs across all etcd clusters.

Motivation (Why is this needed?):
To prevent additional work for an operator to manually remove completed jobs.

The text was updated successfully, but these errors were encountered:

seshachalam-yv · 2023-03-16T10:50:11Z

/assign

renormalize · 2024-02-21T08:17:16Z

#711 introduced functionality which deletes all compaction jobs - after they enter the Completed or Failed state.

etcd-druid/controllers/compaction/reconciler.go

Lines 140 to 171 in fec5c09

    
           // Delete job if the job succeeded 
        
           if job.Status.Succeeded > 0 { 
        
           	metricJobsCurrent.With(prometheus.Labels{druidmetrics.EtcdNamespace: etcd.Namespace}).Set(0) 
        
           	if job.Status.CompletionTime != nil { 
        
           		metricJobDurationSeconds.With(prometheus.Labels{druidmetrics.LabelSucceeded: druidmetrics.ValueSucceededTrue, druidmetrics.EtcdNamespace: etcd.Namespace}).Observe(job.Status.CompletionTime.Time.Sub(job.Status.StartTime.Time).Seconds()) 
        
           	} 
        
           	if err := r.Delete(ctx, job, client.PropagationPolicy(metav1.DeletePropagationForeground)); err != nil { 
        
           		logger.Error(err, "Couldn't delete the successful job", "namespace", etcd.Namespace, "name", etcd.GetCompactionJobName()) 
        
           		return ctrl.Result{ 
        
           			RequeueAfter: 10 * time.Second, 
        
           		}, fmt.Errorf("error while deleting successful compaction job: %v", err) 
        
           	} 
        
           	metricJobsTotal.With(prometheus.Labels{druidmetrics.LabelSucceeded: druidmetrics.ValueSucceededTrue, druidmetrics.EtcdNamespace: etcd.Namespace}).Inc() 
        
           } 
        
           // Delete job and requeue if the job failed 
        
           if job.Status.Failed > 0 { 
        
           	metricJobsCurrent.With(prometheus.Labels{druidmetrics.EtcdNamespace: etcd.Namespace}).Set(0) 
        
           	if job.Status.StartTime != nil { 
        
           		metricJobDurationSeconds.With(prometheus.Labels{druidmetrics.LabelSucceeded: druidmetrics.ValueSucceededFalse, druidmetrics.EtcdNamespace: etcd.Namespace}).Observe(time.Since(job.Status.StartTime.Time).Seconds()) 
        
           	} 
        
           	err := r.Delete(ctx, job, client.PropagationPolicy(metav1.DeletePropagationForeground)) 
        
           	if err != nil { 
        
           		return ctrl.Result{ 
        
           			RequeueAfter: 10 * time.Second, 
        
           		}, fmt.Errorf("error while deleting failed compaction job: %v", err) 
        
           	} 
        
           	metricJobsTotal.With(prometheus.Labels{druidmetrics.LabelSucceeded: druidmetrics.ValueSucceededFalse, druidmetrics.EtcdNamespace: etcd.Namespace}).Inc() 
        
           	return ctrl.Result{ 
        
           		RequeueAfter: 10 * time.Second, 
        
           	}, nil 
        
           }

The controller reconciles whenever there is a job status change event, or when the snapshot lease gets updated. These events would make the controller:

Delete the old job which enters the Completed state
Delete the old job which enters the Failed state and then create a new job.

Even for a job which fails when etcd-druid is down, the next snapshot lease update would cause the controller to reconcile, and delete the old job which it now sees to be in the Failed state.

Halting work on this for now, until further brainstorming reveals edge cases which might not be covered in #711.

shreyas-s-rao · 2024-06-05T09:36:16Z

We will need to relook at the compaction controller design and re-think garbage collection of completed compaction jobs.

unmarshall added the kind/enhancement Enhancement, improvement, extension label Mar 16, 2023

gardener-robot assigned seshachalam-yv Mar 16, 2023

unmarshall unassigned seshachalam-yv Apr 12, 2023

shreyas-s-rao mentioned this issue Oct 12, 2023

☂️ Enhance Etcd spec to include snapshot compaction spec #708

Open

3 tasks

gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Dec 21, 2023

shreyas-s-rao added this to the v0.23.0 milestone Jan 12, 2024

shreyas-s-rao mentioned this issue Jan 12, 2024

Add missing delete permission for jobs, record events for completed compaction jobs #747

Closed

renormalize self-assigned this Jan 17, 2024

renormalize removed their assignment Feb 26, 2024

shreyas-s-rao modified the milestones: v0.23.0, v0.24.0 Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Enable automatic cleanup of finished compaction jobs and make TTL configurable #556

[Feature] Enable automatic cleanup of finished compaction jobs and make TTL configurable #556

unmarshall commented Mar 16, 2023

seshachalam-yv commented Mar 16, 2023

renormalize commented Feb 21, 2024

shreyas-s-rao commented Jun 5, 2024

[Feature] Enable automatic cleanup of finished compaction jobs and make TTL configurable #556

[Feature] Enable automatic cleanup of finished compaction jobs and make TTL configurable #556

Comments

unmarshall commented Mar 16, 2023

seshachalam-yv commented Mar 16, 2023

renormalize commented Feb 21, 2024

shreyas-s-rao commented Jun 5, 2024