Job TTLs not working #1533

arashd · 2022-02-09T19:23:29Z

I'm running into an issue where the "Ttl Seconds After Finished" on my TFJobs isn't being respected and I suspect it's because the reconcile loop where CleanupJob runs isn't run frequently enough for all jobs.

As an example: I start a TFJob with a TTL of 1 minute. The reconcile loop runs anytime there's a state change (including when the job successfully finishes). It doesn't get deleted upon finishing because the TTL hasn't passed. Then, the reconcile loop doesn't run, in this case, until 7 hours after the job was finished. At that point the job does get cleaned up. (this and other jobs past TTL)

My question is: Is the fact that the reconcile loop doesn't run at shorter intervals for finished jobs expected behaviour? Is there a way to find out where that default 7 hours period if set? If so, would changing it to a smaller default to accommodate the TTL feature make sense?

Here are the training-operator logs for the job from start to finish:

training-operator
time="2022-02-09T01:42:43Z" level=info msg="TFJob test-ttl-6 is created."
Error
2022-02-08 17:42:43.836 PST
training-operator
time="2022-02-09T01:42:43Z" level=info msg="Reconciling for job test-ttl-6"
Error
2022-02-08 17:42:43.836 PST
training-operator
time="2022-02-09T01:42:43Z" level=warning msg="The restart policy of replica Chief of the job test-ttl-6 is not OnFailure or Always. Not counted in backoff limit."
Error
2022-02-08 17:42:43.836 PST
training-operator
time="2022-02-09T01:42:43Z" level=info msg="Need to create new pod: chief-0" job=adelijani.test-ttl-6 uid=5eeb5c65-b33b-4a56-ba8d-c02ce3bccb39
Error
2022-02-08 17:42:44.424 PST
training-operator
time="2022-02-09T01:42:44Z" level=info msg="Controller test-ttl-6 created pod test-ttl-6-chief-0" job=.test-ttl-6 pod=.test-ttl-6-chief-0 uid=
Error
2022-02-08 17:42:44.424 PST
training-operator
time="2022-02-09T01:42:44Z" level=info msg="need to create new service: chief-0" job=adelijani.test-ttl-6 replica-type=chief uid=5eeb5c65-b33b-4a56-ba8d-c02ce3bccb39
Error
2022-02-08 17:42:44.424 PST
training-operator
2022-02-09T01:42:44.424Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"TFJob","namespace":"adelijani","name":"test-ttl-6","uid":"5eeb5c65-b33b-4a56-ba8d-c02ce3bccb39","apiVersion":"kubeflow.org/v1","resourceVersion":"806803740"}, "reason": "SuccessfulCreatePod", "message": "Created pod: test-ttl-6-chief-0"}
Error
2022-02-08 17:42:44.436 PST
training-operator
time="2022-02-09T01:42:44Z" level=info msg="Controller test-ttl-6 created service test-ttl-6-chief-0"
Error
2022-02-08 17:42:44.436 PST
training-operator
time="2022-02-09T01:42:44Z" level=info msg="TFJob=adelijani/test-ttl-6, ReplicaType=Chief expected=1, running=0, failed=0" job=adelijani.test-ttl-6 uid=5eeb5c65-b33b-4a56-ba8d-c02ce3bccb39
Error
2022-02-08 17:42:44.436 PST
training-operator
2022-02-09T01:42:44.436Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"TFJob","namespace":"adelijani","name":"test-ttl-6","uid":"5eeb5c65-b33b-4a56-ba8d-c02ce3bccb39","apiVersion":"kubeflow.org/v1","resourceVersion":"806803740"}, "reason": "SuccessfulCreateService", "message": "Created service: test-ttl-6-chief-0"}
Error
2022-02-08 17:42:44.444 PST
training-operator
time="2022-02-09T01:42:44Z" level=info msg="Finished updating TFJobs Status \"test-ttl-6\" (8.171651ms)" job=adelijani.test-ttl-6 uid=5eeb5c65-b33b-4a56-ba8d-c02ce3bccb39
Error
2022-02-08 17:42:44.444 PST
training-operator
time="2022-02-09T01:42:44Z" level=info msg="Reconciling for job test-ttl-6"
Error
2022-02-08 17:42:44.444 PST
training-operator
time="2022-02-09T01:42:44Z" level=warning msg="The restart policy of replica Chief of the job test-ttl-6 is not OnFailure or Always. Not counted in backoff limit."
Error
2022-02-08 17:42:44.444 PST
training-operator
time="2022-02-09T01:42:44Z" level=info msg="TFJob=adelijani/test-ttl-6, ReplicaType=Chief expected=1, running=0, failed=0" job=adelijani.test-ttl-6 uid=5eeb5c65-b33b-4a56-ba8d-c02ce3bccb39
Error
2022-02-08 17:42:44.450 PST
training-operator
time="2022-02-09T01:42:44Z" level=info msg="Finished updating TFJobs Status \"test-ttl-6\" (5.933567ms)" job=adelijani.test-ttl-6 uid=5eeb5c65-b33b-4a56-ba8d-c02ce3bccb39
Error
2022-02-08 17:42:44.451 PST
training-operator
time="2022-02-09T01:42:44Z" level=warning msg="Reconcile Tensorflow Job error Operation cannot be fulfilled on tfjobs.kubeflow.org \"test-ttl-6\": the object has been modified; please apply your changes to the latest version and try again"
Error
2022-02-08 17:42:44.451 PST
training-operator
2022-02-09T01:42:44.450Z ERROR controller-runtime.manager.controller.tfjob-controller Reconciler error {"name": "test-ttl-6", "namespace": "adelijani", "error": "Operation cannot be fulfilled on tfjobs.kubeflow.org \"test-ttl-6\": the object has been modified; please apply your changes to the latest version and try again"}
Error
2022-02-08 17:42:44.451 PST
training-operator
time="2022-02-09T01:42:44Z" level=info msg="Reconciling for job test-ttl-6"
Error
2022-02-08 17:42:44.451 PST
training-operator
time="2022-02-09T01:42:44Z" level=warning msg="The restart policy of replica Chief of the job test-ttl-6 is not OnFailure or Always. Not counted in backoff limit."
Error
2022-02-08 17:42:44.451 PST
training-operator
time="2022-02-09T01:42:44Z" level=info msg="TFJob=adelijani/test-ttl-6, ReplicaType=Chief expected=1, running=0, failed=0" job=adelijani.test-ttl-6 uid=5eeb5c65-b33b-4a56-ba8d-c02ce3bccb39
Error
2022-02-08 17:42:44.456 PST
training-operator
time="2022-02-09T01:42:44Z" level=info msg="Reconciling for job test-ttl-6"
Error
2022-02-08 17:42:44.456 PST
training-operator
time="2022-02-09T01:42:44Z" level=warning msg="The restart policy of replica Chief of the job test-ttl-6 is not OnFailure or Always. Not counted in backoff limit."
Error
2022-02-08 17:42:44.456 PST
training-operator
time="2022-02-09T01:42:44Z" level=info msg="TFJob=adelijani/test-ttl-6, ReplicaType=Chief expected=1, running=0, failed=0" job=adelijani.test-ttl-6 uid=5eeb5c65-b33b-4a56-ba8d-c02ce3bccb39
Error
2022-02-08 17:42:44.457 PST
training-operator
time="2022-02-09T01:42:44Z" level=info msg="Reconciling for job test-ttl-6"
Error
2022-02-08 17:42:44.457 PST
training-operator
time="2022-02-09T01:42:44Z" level=warning msg="The restart policy of replica Chief of the job test-ttl-6 is not OnFailure or Always. Not counted in backoff limit."
Error
2022-02-08 17:42:44.457 PST
training-operator
time="2022-02-09T01:42:44Z" level=info msg="TFJob=adelijani/test-ttl-6, ReplicaType=Chief expected=1, running=0, failed=0" job=adelijani.test-ttl-6 uid=5eeb5c65-b33b-4a56-ba8d-c02ce3bccb39
Error
2022-02-08 17:42:44.835 PST
training-operator
time="2022-02-09T01:42:44Z" level=info msg="Reconciling for job test-ttl-6"
Error
2022-02-08 17:42:44.835 PST
training-operator
time="2022-02-09T01:42:44Z" level=warning msg="The restart policy of replica Chief of the job test-ttl-6 is not OnFailure or Always. Not counted in backoff limit."
Error
2022-02-08 17:42:44.835 PST
training-operator
time="2022-02-09T01:42:44Z" level=info msg="TFJob=adelijani/test-ttl-6, ReplicaType=Chief expected=1, running=0, failed=0" job=adelijani.test-ttl-6 uid=5eeb5c65-b33b-4a56-ba8d-c02ce3bccb39

The text was updated successfully, but these errors were encountered:

zw0610 · 2022-02-10T02:43:00Z

Is there a way to find out where that default 7 hours period if set?

I believe so. When constructing the manager, an option named SyncPeriod should do the work. You will need to modify the main.go file to enable such option.

However, I believe we might find another way to re-support such feature after the workqueue disabled in the reconcile mode. @Jeffwan

arashd · 2022-02-10T18:35:55Z

Thanks for the helpful guidance. Looking forward to hearing if there's a way to change this without touching code.

Garrybest mentioned this issue Jun 14, 2022

fix: requeue when expire time is not up yet #1614

Merged

1 task

google-oss-prow bot closed this as completed in #1614 Jun 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job TTLs not working #1533

Job TTLs not working #1533

arashd commented Feb 9, 2022

zw0610 commented Feb 10, 2022 •

edited

Loading

arashd commented Feb 10, 2022

Job TTLs not working #1533

Job TTLs not working #1533

Comments

arashd commented Feb 9, 2022

zw0610 commented Feb 10, 2022 • edited Loading

arashd commented Feb 10, 2022

zw0610 commented Feb 10, 2022 •

edited

Loading