-
Notifications
You must be signed in to change notification settings - Fork 700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to handle long pending pods in a TF-job? #1282
Comments
Eventually, pod should get into running state? why does it remain in pending? |
Thanks for your reply, the pod would run into the pending stage (a.k.a hang ) forever. we have to run a monitor script to find these poods and clean it, as a result, the tfjob would fail as a result. thus, we hope tf operator can find these pods and restart it. |
Interesting. Does Tensorflow support restarting one or more workers? /cc @gaocegege |
It depends on the logic of the training script. |
I think the problem is how to find these bad pods. We cannot tell if the pod is actually pending or hanging. |
can we have a white list to record the cases where these pods are hanging there forever? then, tf operator can find pod to restart based on the white list |
@merlintang Can you elaborate more on your proposal? |
For these pending pod , we can find the related container state. For example, one pod for its container (init container or job container) is running into the "ImagePullBackOff" / "CreateContainerConfigError"/ "CreateContainerError", and the restart_count > upper_bound, we would say this pod can not resume to work. In this case, we need to restart this pod, then the scheduler would allocate this pod to a new node. By this way, we can avoid one pending jobs forever. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
In the production environment, we would run into some kinds of pod scheduling problems.
For example, one pod fails to get the volume, one pod fails to get the image, or the pod fails to schedule because the related node is broken. In these cases, these pods run into the pending stage. As a result, we have to ask users to delete the current TFjob, and restart a new Job.
However, this would waste resources. For example, we have a TFjob with 100 workers. 99 work come to start, and only one pod is pending. After a period of time, besides the pending pod, other pods already spent resources to train. it is not a good idea to restart this job.
Therefore, we hope the TF operator can retry these long-pending pods if these pending pods meet a certain rule. Hope to learn advice from you. what do you think? Thanks in advance
The text was updated successfully, but these errors were encountered: