Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[backend] Intermittent Failures of GPU-Enabled KFP Tasks with Exit Code 255 in Init Phase #10379

Closed
tom-pavz opened this issue Jan 11, 2024 · 7 comments
Labels
area/backend kind/bug lifecycle/stale The issue / pull request is stale, any activities remove this label.

Comments

@tom-pavz
Copy link

tom-pavz commented Jan 11, 2024

Environment

Steps to reproduce

task.set_gpu_limit("1")

These tasks will intermittently fail with This step is in Error state with this message: Unknown (exit code 255). Most of the time they are successful.

We have found it to be perhaps more likely to happen when there are more concurrent GPU tasks/nodes running in the cluster at once.
EDIT: This seems to be not true, we have now observed this exit code 255 behavior when there is only one GPU node and GPU-enabled task running on the entire cluster. Probably this is just intermittent and more likely to happen the more GPU tasks are running simply because there are more of them so at least one of them getting this issue is more likely at that point.

Expected result

These GPU KFP tasks should never encounter this issue and instead, always succeed (or at least fail with an understandable reason that is based upon a bug in the source code of the task itself).

Materials and Reference

  • It looks like (at least the one time that we closely monitored this as it happened) this init container failed with 255 for an unknown reason gcr.io/ml-pipeline/argoexec:v3.3.8-license-compliance with no error logs in the logs of that container, just normal info logs downloading the input s3 artifacts. It seems like this container isn't actually what is causing the failure.
  • Hitting retry on these pipelines/tasks sees them succeed, as the issue is intermittent. In fact, other tasks in the pipeline can get scheduled on the same node after this first task on it fails and actually run to completion successfully, so it doesn't seem that the entire node is broken for any reason.
  • We have only ever seen this issue on tasks that are requesting to be run on a GPU-enabled EC2 (g4dn). The AMI we are using is the Amazon EKS optimized accelerated Amazon Linux 2 AMI, example: amazon-eks-gpu-node-1.25-v20231230
  • Setting a task.set_retry(num_retries) does not work, because the task's proper pipeline task container never actually gets started as the pod fails in the init stage and so the KF retries never come into play.
  • Setting the k8s restartPolicy on the KFP pod itself to onFailure is not an option because:
  1. KFP SDK does not provide an interface to do this
  2. This would cause all pipelines that have legitimate failures (say a bug in the src code of the task) to indefinitely retry over and over again and never actually fail the pipeline as it should.

Impacted by this bug? Give it a 👍.

@zijianjoy
Copy link
Collaborator

Would you like to consider upgrading KFP version to the latest and try again? Currently you are using v2 alpha

@tom-pavz
Copy link
Author

Would you like to consider upgrading KFP version to the latest and try again? Currently you are using v2 alpha

@zijianjoy Thank you for your reply!

I am under the impression that KFP is "backward compatible" from V2 to V1. Because we use the 1.8.22 version of the kfp python sdk, we are effectively still using V1 of KFP which would not be an unstable release. Please let me know if I am misunderstanding this.

Also, we are on KubeFlow version 1.7 as this is the most recent published release in the AWS Labs kubeflow-manifests repo: https://github.com/awslabs/kubeflow-manifests/releases, and even in the 1st party kubeflow/manifests repo, KF 1.7 has the V2 alpha KFP version: https://github.com/kubeflow/manifests/tree/v1.7.0.

So overall, I didn't think it was going to be very easy or safe to just "upgrade the KFP version" because of the factors I mentioned above. Please let me know if I am misunderstanding any of this.

@zijianjoy
Copy link
Collaborator

KFP v2 is already GAed. The latest Kubeflow version 1.8 is already using it. Please contact AWS in order to obtain a newer version of AWS distribution.

@tom-pavz
Copy link
Author

tom-pavz commented Jan 12, 2024

KFP v2 is already GAed. The latest Kubeflow version 1.8 is already using it. Please contact AWS in order to obtain a newer version of AWS distribution.

@zijianjoy AWS is still undecided if they will create a new distro for KF 1.8 awslabs/kubeflow-manifests#794.

Also, I am unconvinced this would even resolve our issue as we are still using KFP sdk 1.8.22, and so I don't see how bumping to a newer 2.x.x server version would help that? It doesn't seem to me like other KFP 1.8.x users are encountering this issue, so I was looking for some help on how to resolve it in our current deployment.

Also, even in the newest version of KFP V2 platform-specific things such as creating PVC on k8s, etc are all buggy still right now, and there is also no label setting or tolerance setting for the pods in v2 pipelines, which we need to isolate our KFP task pods onto our karpenter autoscaled EC2's.

@zijianjoy
Copy link
Collaborator

Understood about the situation. However, we currently have been focusing on supporting v2. Therefore, I will keep this issue open and lean on community to chime in for help.

Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Mar 14, 2024
Copy link

github-actions bot commented Apr 5, 2024

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

@github-actions github-actions bot closed this as completed Apr 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/backend kind/bug lifecycle/stale The issue / pull request is stale, any activities remove this label.
Projects
Status: Closed
Development

No branches or pull requests

2 participants