Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Enable automatic cleanup of finished compaction jobs and make TTL configurable #556

Open
Tracked by #708
unmarshall opened this issue Mar 16, 2023 · 3 comments
Labels
kind/enhancement Enhancement, improvement, extension lifecycle/stale Nobody worked on this for 6 months (will further age)
Milestone

Comments

@unmarshall
Copy link
Contributor

Feature (What you would like to be added):

Druid creates compaction jobs for every etcd cluster based on crossing a defined threshold of accumulated events since the last compaction run. These jobs run to completion but are not automatically removed. An operator today has to manually delete these jobs.

Since k8s version 1.21 spec.ttlSecondsAfterFinished can be set to configure after how much time the job resource should be removed.
Introduce a new flag that the consumer can pass onto druid to configure the value for the TTL. Assume a default value (> 0) of ttlSecondsAfterFinished and use it if none is explicitly passed by the consumer for all compaction jobs across all etcd clusters.

Motivation (Why is this needed?):
To prevent additional work for an operator to manually remove completed jobs.

@unmarshall unmarshall added the kind/enhancement Enhancement, improvement, extension label Mar 16, 2023
@seshachalam-yv
Copy link
Contributor

/assign

@renormalize
Copy link
Member

#711 introduced functionality which deletes all compaction jobs - after they enter the Completed or Failed state.

// Delete job if the job succeeded
if job.Status.Succeeded > 0 {
metricJobsCurrent.With(prometheus.Labels{druidmetrics.EtcdNamespace: etcd.Namespace}).Set(0)
if job.Status.CompletionTime != nil {
metricJobDurationSeconds.With(prometheus.Labels{druidmetrics.LabelSucceeded: druidmetrics.ValueSucceededTrue, druidmetrics.EtcdNamespace: etcd.Namespace}).Observe(job.Status.CompletionTime.Time.Sub(job.Status.StartTime.Time).Seconds())
}
if err := r.Delete(ctx, job, client.PropagationPolicy(metav1.DeletePropagationForeground)); err != nil {
logger.Error(err, "Couldn't delete the successful job", "namespace", etcd.Namespace, "name", etcd.GetCompactionJobName())
return ctrl.Result{
RequeueAfter: 10 * time.Second,
}, fmt.Errorf("error while deleting successful compaction job: %v", err)
}
metricJobsTotal.With(prometheus.Labels{druidmetrics.LabelSucceeded: druidmetrics.ValueSucceededTrue, druidmetrics.EtcdNamespace: etcd.Namespace}).Inc()
}
// Delete job and requeue if the job failed
if job.Status.Failed > 0 {
metricJobsCurrent.With(prometheus.Labels{druidmetrics.EtcdNamespace: etcd.Namespace}).Set(0)
if job.Status.StartTime != nil {
metricJobDurationSeconds.With(prometheus.Labels{druidmetrics.LabelSucceeded: druidmetrics.ValueSucceededFalse, druidmetrics.EtcdNamespace: etcd.Namespace}).Observe(time.Since(job.Status.StartTime.Time).Seconds())
}
err := r.Delete(ctx, job, client.PropagationPolicy(metav1.DeletePropagationForeground))
if err != nil {
return ctrl.Result{
RequeueAfter: 10 * time.Second,
}, fmt.Errorf("error while deleting failed compaction job: %v", err)
}
metricJobsTotal.With(prometheus.Labels{druidmetrics.LabelSucceeded: druidmetrics.ValueSucceededFalse, druidmetrics.EtcdNamespace: etcd.Namespace}).Inc()
return ctrl.Result{
RequeueAfter: 10 * time.Second,
}, nil
}

The controller reconciles whenever there is a job status change event, or when the snapshot lease gets updated. These events would make the controller:

  • Delete the old job which enters the Completed state
  • Delete the old job which enters the Failed state and then create a new job.

Even for a job which fails when etcd-druid is down, the next snapshot lease update would cause the controller to reconcile, and delete the old job which it now sees to be in the Failed state.

Halting work on this for now, until further brainstorming reveals edge cases which might not be covered in #711.

@renormalize renormalize removed their assignment Feb 26, 2024
@shreyas-s-rao
Copy link
Contributor

We will need to relook at the compaction controller design and re-think garbage collection of completed compaction jobs.

@shreyas-s-rao shreyas-s-rao modified the milestones: v0.23.0, v0.24.0 Jun 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement Enhancement, improvement, extension lifecycle/stale Nobody worked on this for 6 months (will further age)
Projects
None yet
Development

No branches or pull requests

5 participants