Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyTorchJob Pods "Not Ready" After Completing Training #1577

Closed
TrevorM15 opened this issue Apr 20, 2022 · 1 comment
Closed

PyTorchJob Pods "Not Ready" After Completing Training #1577

TrevorM15 opened this issue Apr 20, 2022 · 1 comment

Comments

@TrevorM15
Copy link

TrevorM15 commented Apr 20, 2022

I have tried a couple different models (yolox and convmixer) with the PyTorchJob CRD. I watch the logs from the master and worker pods and see the training progress and complete, but when training is done and the logs finish the pods go from the "Running" status to "Not Ready." I would expect them to go to "Completed" status so my pipeline that I've made to train and serve the model can progress, but instead it times-out and fails. I am running Kubeflow v1.5 on EKS and the latest version of training-operator.

@TrevorM15
Copy link
Author

I was putting the istio sidecar annotation in the wrong spot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant