Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K3s not detecting GPUs for nodes #8575

Closed
sarahwooders opened this issue Oct 9, 2023 · 3 comments
Closed

K3s not detecting GPUs for nodes #8575

sarahwooders opened this issue Oct 9, 2023 · 3 comments

Comments

@sarahwooders
Copy link

sarahwooders commented Oct 9, 2023

Environmental Info:
K3s Version: v1.27.6+k3s1 (bd04941)
go version go1.20.8

Node OS: Ubuntu 20.04.6 LTS
Node(s) CPU architecture:

> lscpu
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      46 bits physical, 48 bits virtual
CPU(s):                             4
On-line CPU(s) list:                0-3
Thread(s) per core:                 2
Core(s) per socket:                 2
Socket(s):                          1
NUMA node(s):                       1
Vendor ID:                          GenuineIntel
CPU family:                         6
Model:                              85
Model name:                         Intel(R) Xeon(R) CPU @ 2.20GHz
Stepping:                           7
CPU MHz:                            2200.150
BogoMIPS:                           4400.30
Hypervisor vendor:                  KVM
Virtualization type:                full
L1d cache:                          64 KiB
L1i cache:                          64 KiB
L2 cache:                           2 MiB
L3 cache:                           38.5 MiB
NUMA node0 CPU(s):                  0-3
Vulnerability Gather data sampling: Unknown: Dependent on hypervisor status
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Mitigation; Clear CPU buffers; SMT Host state unknown
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Retbleed:             Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Mitigation; Clear CPU buffers; SMT Host state unknown
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdts
                                    cp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2a
                                    pic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanc
                                    ed fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd a
                                    vx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat avx512_vnni md_clear arch_capabilities

GPUs:

> nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA L4           Off  | 00000000:00:03.0 Off |                    0 |
| N/A   50C    P0    30W /  72W |      0MiB / 23034MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Cluster Configuration:
1 server (running on GCP e2-standard-2 instance), 2 agents (running on GCP g2-standard-4 instances with L4 GPUs)

Describe the bug:
I am trying to run pods on GPU nodes, however K3 does not seem to detect the GPUs.
Steps To Reproduce:

Expected behavior:
I expect pods to be able to be scheduled on GPU nodes.

Actual behavior:
When I try to run a pod on the cluster, the pod is stuck with the status "PENDING".

When I run kubectl get events, I see the following error:

42m         Normal    NodeReady                        node/worker-0             Node worker-0 status is now: NodeReady
21m         Warning   FailedScheduling                 pod/gpu-pod               0/3 nodes are available: 3 Insufficient nvidia.com/gpu. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod..

When I run kubectl describe nodes, the "Allocated Resource" section does not show any GPUs:

Name:               worker-0
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=k3s
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=worker-0
                    kubernetes.io/os=linux
                    node.kubernetes.io/instance-type=k3s
Annotations:        flannel.alpha.coreos.com/backend-data: {"VNI":1,"VtepMAC":"7e:17:3f:a5:6c:93"}
                    flannel.alpha.coreos.com/backend-type: vxlan
                    flannel.alpha.coreos.com/kube-subnet-manager: true
                    flannel.alpha.coreos.com/public-ip: 10.128.0.27
                    k3s.io/hostname: worker-0
                    k3s.io/internal-ip: 10.128.0.27
                    k3s.io/node-args: ["agent"]
                    k3s.io/node-config-hash: OB3EXJBFKZISCZ6MYWHG5CQIGVW2SVAONUDFFWLCNATQ4AKKHJGA====
                    k3s.io/node-env:
                      {"K3S_DATA_DIR":"/var/lib/rancher/k3s/data/3dfc950bd39d2e2b435291ab8c1333aa6051fcaf46325aee898819f3b99d4b21","K3S_TOKEN":"********","K3S_U...
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Mon, 09 Oct 2023 02:11:22 +0000
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  worker-0
  AcquireTime:     <unset>
  RenewTime:       Mon, 09 Oct 2023 04:06:24 +0000
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Mon, 09 Oct 2023 04:02:39 +0000   Mon, 09 Oct 2023 02:11:22 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Mon, 09 Oct 2023 04:02:39 +0000   Mon, 09 Oct 2023 02:11:22 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Mon, 09 Oct 2023 04:02:39 +0000   Mon, 09 Oct 2023 02:11:22 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Mon, 09 Oct 2023 04:02:39 +0000   Mon, 09 Oct 2023 03:16:42 +0000   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  10.128.0.27
  Hostname:    worker-0
Capacity:
  cpu:                4
  ephemeral-storage:  203056560Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             16369944Ki
  pods:               110
Allocatable:
  cpu:                4
  ephemeral-storage:  197533421414
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             16369944Ki
  pods:               110
System Info:
  Machine ID:                 7abeed5a838ae7b19b18b5e927865544
  System UUID:                7abeed5a-838a-e7b1-9b18-b5e927865544
  Boot ID:                    f690f684-9079-4c53-b954-1eb79c1cb727
  Kernel Version:             5.15.0-1044-gcp
  OS Image:                   Ubuntu 20.04.6 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.7.6-k3s1.27
  Kubelet Version:            v1.27.6+k3s1
  Kube-Proxy Version:         v1.27.6+k3s1
PodCIDR:                      10.42.1.0/24
PodCIDRs:                     10.42.1.0/24
ProviderID:                   k3s://worker-0
Non-terminated Pods:          (3 in total)
  Namespace                   Name                                    CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                    ------------  ----------  ---------------  -------------  ---
  kube-system                 svclb-traefik-e554b86f-lm8s4            0 (0%)        0 (0%)      0 (0%)           0 (0%)         115m
  kube-system                 nvidia-device-plugin-daemonset-9ddlc    0 (0%)        0 (0%)      0 (0%)           0 (0%)         103m
  kube-system                 traefik-64f55bb67d-qq9f8                0 (0%)        0 (0%)      0 (0%)           0 (0%)         115m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests  Limits
  --------           --------  ------
  cpu                0 (0%)    0 (0%)
  memory             0 (0%)    0 (0%)
  ephemeral-storage  0 (0%)    0 (0%)
  hugepages-1Gi      0 (0%)    0 (0%)
  hugepages-2Mi      0 (0%)    0 (0%)
Events:
  Type     Reason                   Age                  From                   Message
  ----     ------                   ----                 ----                   -------
  Normal   Starting                 112m                 kube-proxy             
  Normal   Starting                 115m                 kube-proxy             
  Normal   Starting                 49m                  kube-proxy             
  Normal   Starting                 77m                  kube-proxy             
  Normal   Starting                 115m                 kubelet                Starting kubelet.
  Normal   Synced                   115m                 cloud-node-controller  Node synced successfully
  Warning  InvalidDiskCapacity      115m                 kubelet                invalid capacity 0 on image filesystem
  Normal   NodeHasSufficientMemory  115m (x2 over 115m)  kubelet                Node worker-0 status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    115m (x2 over 115m)  kubelet                Node worker-0 status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     115m (x2 over 115m)  kubelet                Node worker-0 status is now: NodeHasSufficientPID
  Normal   NodeAllocatableEnforced  115m                 kubelet                Updated Node Allocatable limit across pods
  Normal   NodeReady                115m                 kubelet                Node worker-0 status is now: NodeReady
  Normal   RegisteredNode           115m                 node-controller        Node worker-0 event: Registered Node worker-0 in Controller
  Normal   NodeAllocatableEnforced  112m                 kubelet                Updated Node Allocatable limit across pods
  Normal   NodeNotReady             112m                 kubelet                Node worker-0 status is now: NodeNotReady
  Warning  InvalidDiskCapacity      112m                 kubelet                invalid capacity 0 on image filesystem
  Normal   NodeHasSufficientMemory  112m                 kubelet                Node worker-0 status is now: NodeHasSufficientMemory
  Normal   Starting                 112m                 kubelet                Starting kubelet.
  Normal   NodeHasNoDiskPressure    112m                 kubelet                Node worker-0 status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     112m                 kubelet                Node worker-0 status is now: NodeHasSufficientPID
  Warning  Rebooted                 112m                 kubelet                Node worker-0 has been rebooted, boot id: f690f684-9079-4c53-b954-1eb79c1cb727
  Normal   NodeReady                112m                 kubelet                Node worker-0 status is now: NodeReady
  Normal   NodeHasNoDiskPressure    77m                  kubelet                Node worker-0 status is now: NodeHasNoDiskPressure
  Normal   NodeAllocatableEnforced  77m                  kubelet                Updated Node Allocatable limit across pods
  Normal   Starting                 77m                  kubelet                Starting kubelet.
  Warning  InvalidDiskCapacity      77m                  kubelet                invalid capacity 0 on image filesystem
  Normal   NodeNotReady             77m                  kubelet                Node worker-0 status is now: NodeNotReady
  Normal   NodeHasSufficientMemory  77m                  kubelet                Node worker-0 status is now: NodeHasSufficientMemory
  Normal   NodeHasSufficientPID     77m                  kubelet                Node worker-0 status is now: NodeHasSufficientPID
  Normal   NodeReady                77m                  kubelet                Node worker-0 status is now: NodeReady
  Normal   RegisteredNode           77m                  node-controller        Node worker-0 event: Registered Node worker-0 in Controller
  Normal   RegisteredNode           76m                  node-controller        Node worker-0 event: Registered Node worker-0 in Controller
  Normal   RegisteredNode           59m                  node-controller        Node worker-0 event: Registered Node worker-0 in Controller
  Normal   RegisteredNode           53m                  node-controller        Node worker-0 event: Registered Node worker-0 in Controller
  Normal   Starting                 49m                  kubelet                Starting kubelet.
  Warning  InvalidDiskCapacity      49m                  kubelet                invalid capacity 0 on image filesystem
  Normal   NodeHasSufficientMemory  49m                  kubelet                Node worker-0 status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    49m                  kubelet                Node worker-0 status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     49m                  kubelet                Node worker-0 status is now: NodeHasSufficientPID
  Normal   NodeNotReady             49m                  kubelet                Node worker-0 status is now: NodeNotReady
  Normal   NodeAllocatableEnforced  49m                  kubelet                Updated Node Allocatable limit across pods
  Normal   NodeReady                49m                  kubelet                Node worker-0 status is now: NodeReady

@sarahwooders sarahwooders changed the title K3 not detecting GPUs for nodes K3s not detecting GPUs for nodes Oct 9, 2023
@athithya-raj
Copy link

how it was resolved?

@Clasyc
Copy link

Clasyc commented Apr 3, 2024

I have similar issue, I see available amd gpu on my node:

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                290m (2%)   18 (150%)
  memory             570Mi (1%)  35938Mi (114%)
  ephemeral-storage  0 (0%)      0 (0%)
  hugepages-1Gi      0 (0%)      0 (0%)
  hugepages-2Mi      0 (0%)      0 (0%)
  amd.com/gpu        1           1
Events:              <none>

pod container requests gpu:

Containers:
  immich:
    Image:      altran1502/immich-server:v1.100.0
    Port:       32003/TCP
    Host Port:  0/TCP
    Command:
      /bin/sh
    Args:
      -c
      /usr/src/app/start-microservices.sh
    Limits:
      amd.com/gpu:  1
      cpu:          4
      memory:       8Gi
    Requests:
      amd.com/gpu:  1
      cpu:          10m
      memory:       50Mi

but I still see in the scheduler errors:

Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  29m                  default-scheduler  0/1 nodes are available: 1 Insufficient amd.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..
  Warning  FailedScheduling  9m18s (x3 over 24m)  default-scheduler  0/1 nodes are available: 1 Insufficient amd.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..

k3s service logs:

Apr 03 21:06:19 truenas k3s[7743]: E0403 11:06:19.479080    7743 event_broadcaster.go:253] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"immich-microservices-77c66b674f-ks6ww.17c2d711b8df5452", GenerateName:"", Namespace:"ix-immich", SelfLink:"", UID:"32dfb47c-e6a4-4f0f-a08e-9ccb67699728", ResourceVersion:"6273", Generation:0, CreationTimestamp:time.Date(2024, time.April, 3, 11, 1, 19, 0, time.Local), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry{v1.ManagedFieldsEntry{Manager:"k3s", Operation:"Update", APIVersion:"events.k8s.io/v1", Time:time.Date(2024, time.April, 3, 11, 1, 19, 0, time.Local), FieldsType:"FieldsV1", FieldsV1:(*v1.FieldsV1)(0xc00cafe078), Subresource:""}}}, EventTime:time.Date(2024, time.April, 3, 11, 1, 19, 470263000, time.Local), Series:(*v1.EventSeries)(0xc005f568e0), ReportingController:"default-scheduler", ReportingInstance:"default-scheduler-truenas", Action:"Scheduling", Reason:"FailedScheduling", Regarding:v1.ObjectReference{Kind:"Pod", Namespace:"ix-immich", Name:"immich-microservices-77c66b674f-ks6ww", UID:"a92e5d34-0d52-4577-be81-353eaf8c1df3", APIVersion:"v1", ResourceVersion:"5607", FieldPath:""}, Related:(*v1.ObjectReference)(nil), Note:"0/1 nodes are available: 1 Insufficient amd.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..", Type:"Warning", DeprecatedSource:v1.EventSource{Component:"", Host:""}, DeprecatedFirstTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeprecatedLastTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeprecatedCount:0}': 'Event "immich-microservices-77c66b674f-ks6ww.17c2d711b8df5452" is invalid: series.count: Invalid value: "": should be at least 2' (will not retry!)

@tiberio-baptista
Copy link

How was this resolved? We're having a similar issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

4 participants