You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm running into an issue where the "Ttl Seconds After Finished" on my TFJobs isn't being respected and I suspect it's because the reconcile loop where CleanupJob runs isn't run frequently enough for all jobs.
As an example: I start a TFJob with a TTL of 1 minute. The reconcile loop runs anytime there's a state change (including when the job successfully finishes). It doesn't get deleted upon finishing because the TTL hasn't passed. Then, the reconcile loop doesn't run, in this case, until 7 hours after the job was finished. At that point the job does get cleaned up. (this and other jobs past TTL)
My question is: Is the fact that the reconcile loop doesn't run at shorter intervals for finished jobs expected behaviour? Is there a way to find out where that default 7 hours period if set? If so, would changing it to a smaller default to accommodate the TTL feature make sense?
Here are the training-operator logs for the job from start to finish:
training-operator
time="2022-02-09T01:42:43Z" level=info msg="TFJob test-ttl-6 is created."
Error
2022-02-08 17:42:43.836 PST
training-operator
time="2022-02-09T01:42:43Z" level=info msg="Reconciling for job test-ttl-6"
Error
2022-02-08 17:42:43.836 PST
training-operator
time="2022-02-09T01:42:43Z" level=warning msg="The restart policy of replica Chief of the job test-ttl-6 is not OnFailure or Always. Not counted in backoff limit."
Error
2022-02-08 17:42:43.836 PST
training-operator
time="2022-02-09T01:42:43Z" level=info msg="Need to create new pod: chief-0" job=adelijani.test-ttl-6 uid=5eeb5c65-b33b-4a56-ba8d-c02ce3bccb39
Error
2022-02-08 17:42:44.424 PST
training-operator
time="2022-02-09T01:42:44Z" level=info msg="Controller test-ttl-6 created pod test-ttl-6-chief-0" job=.test-ttl-6 pod=.test-ttl-6-chief-0 uid=
Error
2022-02-08 17:42:44.424 PST
training-operator
time="2022-02-09T01:42:44Z" level=info msg="need to create new service: chief-0" job=adelijani.test-ttl-6 replica-type=chief uid=5eeb5c65-b33b-4a56-ba8d-c02ce3bccb39
Error
2022-02-08 17:42:44.424 PST
training-operator
2022-02-09T01:42:44.424Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"TFJob","namespace":"adelijani","name":"test-ttl-6","uid":"5eeb5c65-b33b-4a56-ba8d-c02ce3bccb39","apiVersion":"kubeflow.org/v1","resourceVersion":"806803740"}, "reason": "SuccessfulCreatePod", "message": "Created pod: test-ttl-6-chief-0"}
Error
2022-02-08 17:42:44.436 PST
training-operator
time="2022-02-09T01:42:44Z" level=info msg="Controller test-ttl-6 created service test-ttl-6-chief-0"
Error
2022-02-08 17:42:44.436 PST
training-operator
time="2022-02-09T01:42:44Z" level=info msg="TFJob=adelijani/test-ttl-6, ReplicaType=Chief expected=1, running=0, failed=0" job=adelijani.test-ttl-6 uid=5eeb5c65-b33b-4a56-ba8d-c02ce3bccb39
Error
2022-02-08 17:42:44.436 PST
training-operator
2022-02-09T01:42:44.436Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"TFJob","namespace":"adelijani","name":"test-ttl-6","uid":"5eeb5c65-b33b-4a56-ba8d-c02ce3bccb39","apiVersion":"kubeflow.org/v1","resourceVersion":"806803740"}, "reason": "SuccessfulCreateService", "message": "Created service: test-ttl-6-chief-0"}
Error
2022-02-08 17:42:44.444 PST
training-operator
time="2022-02-09T01:42:44Z" level=info msg="Finished updating TFJobs Status \"test-ttl-6\" (8.171651ms)" job=adelijani.test-ttl-6 uid=5eeb5c65-b33b-4a56-ba8d-c02ce3bccb39
Error
2022-02-08 17:42:44.444 PST
training-operator
time="2022-02-09T01:42:44Z" level=info msg="Reconciling for job test-ttl-6"
Error
2022-02-08 17:42:44.444 PST
training-operator
time="2022-02-09T01:42:44Z" level=warning msg="The restart policy of replica Chief of the job test-ttl-6 is not OnFailure or Always. Not counted in backoff limit."
Error
2022-02-08 17:42:44.444 PST
training-operator
time="2022-02-09T01:42:44Z" level=info msg="TFJob=adelijani/test-ttl-6, ReplicaType=Chief expected=1, running=0, failed=0" job=adelijani.test-ttl-6 uid=5eeb5c65-b33b-4a56-ba8d-c02ce3bccb39
Error
2022-02-08 17:42:44.450 PST
training-operator
time="2022-02-09T01:42:44Z" level=info msg="Finished updating TFJobs Status \"test-ttl-6\" (5.933567ms)" job=adelijani.test-ttl-6 uid=5eeb5c65-b33b-4a56-ba8d-c02ce3bccb39
Error
2022-02-08 17:42:44.451 PST
training-operator
time="2022-02-09T01:42:44Z" level=warning msg="Reconcile Tensorflow Job error Operation cannot be fulfilled on tfjobs.kubeflow.org \"test-ttl-6\": the object has been modified; please apply your changes to the latest version and try again"
Error
2022-02-08 17:42:44.451 PST
training-operator
2022-02-09T01:42:44.450Z ERROR controller-runtime.manager.controller.tfjob-controller Reconciler error {"name": "test-ttl-6", "namespace": "adelijani", "error": "Operation cannot be fulfilled on tfjobs.kubeflow.org \"test-ttl-6\": the object has been modified; please apply your changes to the latest version and try again"}
Error
2022-02-08 17:42:44.451 PST
training-operator
time="2022-02-09T01:42:44Z" level=info msg="Reconciling for job test-ttl-6"
Error
2022-02-08 17:42:44.451 PST
training-operator
time="2022-02-09T01:42:44Z" level=warning msg="The restart policy of replica Chief of the job test-ttl-6 is not OnFailure or Always. Not counted in backoff limit."
Error
2022-02-08 17:42:44.451 PST
training-operator
time="2022-02-09T01:42:44Z" level=info msg="TFJob=adelijani/test-ttl-6, ReplicaType=Chief expected=1, running=0, failed=0" job=adelijani.test-ttl-6 uid=5eeb5c65-b33b-4a56-ba8d-c02ce3bccb39
Error
2022-02-08 17:42:44.456 PST
training-operator
time="2022-02-09T01:42:44Z" level=info msg="Reconciling for job test-ttl-6"
Error
2022-02-08 17:42:44.456 PST
training-operator
time="2022-02-09T01:42:44Z" level=warning msg="The restart policy of replica Chief of the job test-ttl-6 is not OnFailure or Always. Not counted in backoff limit."
Error
2022-02-08 17:42:44.456 PST
training-operator
time="2022-02-09T01:42:44Z" level=info msg="TFJob=adelijani/test-ttl-6, ReplicaType=Chief expected=1, running=0, failed=0" job=adelijani.test-ttl-6 uid=5eeb5c65-b33b-4a56-ba8d-c02ce3bccb39
Error
2022-02-08 17:42:44.457 PST
training-operator
time="2022-02-09T01:42:44Z" level=info msg="Reconciling for job test-ttl-6"
Error
2022-02-08 17:42:44.457 PST
training-operator
time="2022-02-09T01:42:44Z" level=warning msg="The restart policy of replica Chief of the job test-ttl-6 is not OnFailure or Always. Not counted in backoff limit."
Error
2022-02-08 17:42:44.457 PST
training-operator
time="2022-02-09T01:42:44Z" level=info msg="TFJob=adelijani/test-ttl-6, ReplicaType=Chief expected=1, running=0, failed=0" job=adelijani.test-ttl-6 uid=5eeb5c65-b33b-4a56-ba8d-c02ce3bccb39
Error
2022-02-08 17:42:44.835 PST
training-operator
time="2022-02-09T01:42:44Z" level=info msg="Reconciling for job test-ttl-6"
Error
2022-02-08 17:42:44.835 PST
training-operator
time="2022-02-09T01:42:44Z" level=warning msg="The restart policy of replica Chief of the job test-ttl-6 is not OnFailure or Always. Not counted in backoff limit."
Error
2022-02-08 17:42:44.835 PST
training-operator
time="2022-02-09T01:42:44Z" level=info msg="TFJob=adelijani/test-ttl-6, ReplicaType=Chief expected=1, running=0, failed=0" job=adelijani.test-ttl-6 uid=5eeb5c65-b33b-4a56-ba8d-c02ce3bccb39
The text was updated successfully, but these errors were encountered:
Is there a way to find out where that default 7 hours period if set?
I believe so. When constructing the manager, an option named SyncPeriod should do the work. You will need to modify the main.go file to enable such option.
However, I believe we might find another way to re-support such feature after the workqueue disabled in the reconcile mode. @Jeffwan
I'm running into an issue where the "Ttl Seconds After Finished" on my TFJobs isn't being respected and I suspect it's because the reconcile loop where
CleanupJob
runs isn't run frequently enough for all jobs.As an example: I start a TFJob with a TTL of 1 minute. The reconcile loop runs anytime there's a state change (including when the job successfully finishes). It doesn't get deleted upon finishing because the TTL hasn't passed. Then, the reconcile loop doesn't run, in this case, until 7 hours after the job was finished. At that point the job does get cleaned up. (this and other jobs past TTL)
My question is: Is the fact that the reconcile loop doesn't run at shorter intervals for finished jobs expected behaviour? Is there a way to find out where that default 7 hours period if set? If so, would changing it to a smaller default to accommodate the TTL feature make sense?
Here are the training-operator logs for the job from start to finish:
The text was updated successfully, but these errors were encountered: