Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inventory-operator: doesn't detect when nvdp-nvidia-device-plugin marks GPU as unhealthy #249

Open
andy108369 opened this issue Aug 16, 2024 · 1 comment
Assignees
Labels
P2 repo/provider Akash provider-services repo issues

Comments

@andy108369
Copy link
Contributor

Logs https://gist.github.com/andy108369/cac9f968f1c6a3eb7c6e92135b8afd42

querying 8443/status endpoint would report all 8 GPUs are available, but at least one was marked as unhealthy.

Rarely you can recover from this error by bouncing nvdp-nvidia-device-plugin pod on the node where it was marked unhealthy.
But the point is that inventory-operator should ideally detect this as otherwise GPU deployments will be stuck in "Pending" until all 8 GPUs will become available again:

Events:
  Type     Reason            Age    From               Message
  ----     ------            ----   ----               -------
  Warning  FailedScheduling  2m40s  default-scheduler  0/8 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/8 nodes are available: 8 Preemption is not helpful for scheduling..
  Warning  FailedScheduling  2m38s  default-scheduler  0/8 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 7 Insufficient nvidia.com/gpu. preemption: 0/8 nodes are available: 1 Preemption is not helpful for scheduling, 7 No preemption victims found for incoming pod..
@chainzero chainzero added repo/provider Akash provider-services repo issues and removed awaiting-triage labels Aug 21, 2024
@chainzero chainzero added the P2 label Aug 21, 2024
@andy108369
Copy link
Contributor Author

related:
#244
#240
#207

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 repo/provider Akash provider-services repo issues
Projects
None yet
Development

No branches or pull requests

3 participants