You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have tried a couple different models (yolox and convmixer) with the PyTorchJob CRD. I watch the logs from the master and worker pods and see the training progress and complete, but when training is done and the logs finish the pods go from the "Running" status to "Not Ready." I would expect them to go to "Completed" status so my pipeline that I've made to train and serve the model can progress, but instead it times-out and fails. I am running Kubeflow v1.5 on EKS and the latest version of training-operator.
The text was updated successfully, but these errors were encountered:
I have tried a couple different models (yolox and convmixer) with the PyTorchJob CRD. I watch the logs from the master and worker pods and see the training progress and complete, but when training is done and the logs finish the pods go from the "Running" status to "Not Ready." I would expect them to go to "Completed" status so my pipeline that I've made to train and serve the model can progress, but instead it times-out and fails. I am running Kubeflow v1.5 on EKS and the latest version of training-operator.
The text was updated successfully, but these errors were encountered: