-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[backend] Intermittent Failures of GPU-Enabled KFP Tasks with Exit Code 255 in Init Phase #10379
Comments
Would you like to consider upgrading KFP version to the latest and try again? Currently you are using v2 alpha |
@zijianjoy Thank you for your reply! I am under the impression that KFP is "backward compatible" from V2 to V1. Because we use the 1.8.22 version of the Also, we are on KubeFlow version 1.7 as this is the most recent published release in the AWS Labs So overall, I didn't think it was going to be very easy or safe to just "upgrade the KFP version" because of the factors I mentioned above. Please let me know if I am misunderstanding any of this. |
KFP v2 is already GAed. The latest Kubeflow version 1.8 is already using it. Please contact AWS in order to obtain a newer version of AWS distribution. |
@zijianjoy AWS is still undecided if they will create a new distro for KF 1.8 awslabs/kubeflow-manifests#794. Also, I am unconvinced this would even resolve our issue as we are still using KFP sdk Also, even in the newest version of KFP V2 platform-specific things such as creating PVC on k8s, etc are all buggy still right now, and there is also no label setting or tolerance setting for the pods in v2 pipelines, which we need to isolate our KFP task pods onto our karpenter autoscaled EC2's. |
Understood about the situation. However, we currently have been focusing on supporting v2. Therefore, I will keep this issue open and lean on community to chime in for help. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it. |
Environment
How did you deploy Kubeflow Pipelines (KFP)?
We used the AWS Labs provided
kubeflow-manifests
rds-s3
terraform full deployment optionv1.7.0-aws-b1.0.3
https://awslabs.github.io/kubeflow-manifests/release-v1.7.0-aws-b1.0.3/docs/deployment/rds-s3/guide-terraform/
KFP version:
2.0.0-alpha.7
KFP SDK version:
1.8.22
Steps to reproduce
gpu-operator
v23.9.1
https://github.com/NVIDIA/gpu-operator/tree/v23.9.1 by setting the following in AWS Labskubeflow-manifests
https://github.com/awslabs/kubeflow-manifests/blob/v1.7.0-aws-b1.0.3/deployments/rds-s3/terraform/main.tf#L218karpenter
for autoscaling of GPU nodes using helmv0.32.1
https://karpenter.sh/v0.32/getting-started/migrating-from-cas/#deploy-karpenterThese tasks will intermittently fail with
This step is in Error state with this message: Unknown (exit code 255)
. Most of the time they are successful.We have found it to be perhaps more likely to happen when there are more concurrent GPU tasks/nodes running in the cluster at once.
EDIT: This seems to be not true, we have now observed this exit code 255 behavior when there is only one GPU node and GPU-enabled task running on the entire cluster. Probably this is just intermittent and more likely to happen the more GPU tasks are running simply because there are more of them so at least one of them getting this issue is more likely at that point.
Expected result
These GPU KFP tasks should never encounter this issue and instead, always succeed (or at least fail with an understandable reason that is based upon a bug in the source code of the task itself).
Materials and Reference
init
container failed with 255 for an unknown reasongcr.io/ml-pipeline/argoexec:v3.3.8-license-compliance
with noerror
logs in the logs of that container, just normalinfo
logs downloading the input s3 artifacts. It seems like this container isn't actually what is causing the failure.g4dn
). The AMI we are using is theAmazon EKS optimized accelerated Amazon Linux 2 AMI
, example:amazon-eks-gpu-node-1.25-v20231230
task.set_retry(num_retries)
does not work, because the task's proper pipeline task container never actually gets started as the pod fails in the init stage and so the KF retries never come into play.restartPolicy
on the KFP pod itself toonFailure
is not an option because:Impacted by this bug? Give it a 👍.
The text was updated successfully, but these errors were encountered: