Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Issue with efa device plugin running as root #6222

Closed
vsoch opened this issue Jan 31, 2023 · 16 comments · Fixed by #6302
Closed

[Bug] Issue with efa device plugin running as root #6222

vsoch opened this issue Jan 31, 2023 · 16 comments · Fixed by #6302
Labels
kind/bug priority/important-longterm Important over the long term, but may not be currently staffed and/or may require multiple releases

Comments

@vsoch
Copy link
Contributor

vsoch commented Jan 31, 2023

Hi! I opened the issue here aws-samples/aws-efa-eks#8 so they can be tracked in sync. I just updated my version of eksctl and it pulled in the new changes, and we started seeing the issue I'll report here. We are creating an EKS cluster with eksctl, specifically like this:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: flux-cluster
  region: us-east-2
  version: "1.23"
  

availabilityZones: ["us-east-2b", "us-east-2c"]
managedNodeGroups:
  - name: workers
    instanceType: hpc6a.48xlarge
    minSize: 64
    maxSize: 64
    labels: { "fluxoperator": "true" }
    availabilityZones: ["us-east-2b"]
    efaEnabled: true
    placement:
      groupName: eks-efa-testing

And when I request a job asking for efa for my pods, e.g, (this is our operator CRD that has worked before):

# Resource limits to enable efa
resources:
    limits:
        vpc.amazonaws.com/efa: 1
        memory: "340G"
        cpu: 94

the pods are stuck in pending. Further inspection reveals:

Events:
  Type     Reason            Age                 From               Message
  ----     ------            ----                ----               -------
  Warning  FailedScheduling  27s (x11 over 13m)  default-scheduler  0/64 nodes are available: 64 Insufficient vpc.amazonaws.com/efa.

And then I realized I could look at the logs of the pod that is supposed to provide the efa (which is where I found the container name / config that is provided in the manifest folder of this repo) and I saw:

$ kubectl describe pods -n kube-system aws-efa-k8s-device-plugin-daemonset-zpg2s
...
  Warning  Failed     64m (x12 over 66m)    kubelet            Error: container has runAsNonRoot and image will run as root (pod: "aws-efa-k8s-device-plugin-daemonset-zpg2s_kube-system(1b46d2ac-c922-449b-b630-bab344976d9f)", container: aws-efa-k8s-device-plugin)
  Normal   Pulled     115s (x303 over 66m)  kubelet            Container image "602401143452.dkr.ecr.us-east-2.amazonaws.com/eks/aws-efa-k8s-device-plugin:v0.3.3" already present on machine

I traced that to this change 943de83 that must have come with the updated eksctl. And unless there is a plan to update the container, I want to suggest you remove this added boolean. This is likely the version I used that was working before the update (and mirrors the one I found in your example repo) https://github.com/weaveworks/eksctl/blob/7ad54ae5d60d730e6d2ca8741d866f5415bab518/pkg/addons/assets/efa-device-plugin.yaml. Thanks!

@vsoch vsoch added the kind/bug label Jan 31, 2023
@vsoch
Copy link
Contributor Author

vsoch commented Jan 31, 2023

Wanted to post an update that I cloned main, removed that one line, rebuilt, recreated my cluster, and it works correctly as it did before! So I am fairly certain this is a bug.

@vsoch
Copy link
Contributor Author

vsoch commented Jan 31, 2023

Let me know if you'd like me to open a PR to fix this one detail - would be happy to!

@cPu1
Copy link
Collaborator

cPu1 commented Jan 31, 2023

Let me know if you'd like me to open a PR to fix this one detail - would be happy to!

@vsoch, sure, please go ahead. We are happy to accept contributions.

@Himangini
Copy link
Collaborator

Wanted to post an update that I cloned main, removed that one line, rebuilt, recreated my cluster, and it works correctly as it did before! So I am fairly certain this is a bug.

@vsoch Can you try using this securityContext ?

securityContext:
          runAsNonRoot: true
          allowPrivilegeEscalation: false
          runAsUser: 1000

@vsoch
Copy link
Contributor Author

vsoch commented Jan 31, 2023

I can't offer testing that soon, I won't be running experiments again for a bit (they are expensive), but I could offer maybe next month. For the time being I'm just restoring the original context.

@vsoch
Copy link
Contributor Author

vsoch commented Feb 4, 2023

heyo! Got a chance to try your suggestion - no go.

2023/02/04 00:52:22 Could not start device plugin: listen unix /var/lib/kubelet/device-plugins/aws-efa-device-plugin.sock: bind: permission denied
2023/02/04 00:52:22 Could not contact Kubelet, retrying. Did you enable the device plugin feature gate?

Works fine when I remove that block and restore to the suggested one (removing runAsNonRoot).

2023/02/04 01:30:28 EFA Device list: [{rdmap0s6 uverbs0 /sys/class/infiniband_verbs/uverbs0 /sys/class/infiniband/rdmap0s6}]
2023/02/04 01:30:28 Starting FS watcher.
2023/02/04 01:30:28 Starting OS watcher.
2023/02/04 01:30:28 device: rdmap0s6,uverbs0,/sys/class/infiniband_verbs/uverbs0,/sys/class/infiniband/rdmap0s6

2023/02/04 01:30:28 Starting to serve on /var/lib/kubelet/device-plugins/aws-efa-device-plugin.sock
2023/02/04 01:30:28 Registered device plugin with Kubelet
2023/02/04 01:30:35 Request IDs: [&ContainerAllocateRequest{DevicesIDs:[rdmap0s6],}]
2023/02/04 01:30:35 Checking if device:`rdmap0s6` exists

@github-actions
Copy link
Contributor

github-actions bot commented Mar 6, 2023

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the stale label Mar 6, 2023
@vsoch
Copy link
Contributor Author

vsoch commented Mar 6, 2023

Ping - I opened a PR to fix this! #6302

@cPu1 cPu1 removed the stale label Mar 6, 2023
@DanielJuravski
Copy link

This helm chart resolved the issue https://github.com/aws-samples/efa-device-plugin-helm

@vsoch vsoch closed this as completed Apr 2, 2023
@vsoch
Copy link
Contributor Author

vsoch commented Jun 25, 2023

Hi - you still haven't fixed this. I just installed a fresh eksctl, and created a cluster, my pods are errored:

1$ kubectl get pods --all-namespaces
NAMESPACE     NAME                                        READY   STATUS                       RESTARTS   AGE
kube-system   aws-efa-k8s-device-plugin-daemonset-5v7mh   0/1     CreateContainerConfigError   0          2m43s
kube-system   aws-efa-k8s-device-plugin-daemonset-6t5qx   0/1     CreateContainerConfigError   0          2m43s
kube-system   aws-efa-k8s-device-plugin-daemonset-9djsw   0/1     CreateContainerConfigError   0          2m43s
kube-system   aws-efa-k8s-device-plugin-daemonset-kpb99   0/1     CreateContainerConfigError   0          2m43s
kube-system   aws-efa-k8s-device-plugin-daemonset-ltmhb   0/1     CreateContainerConfigError   0          2m43s
kube-system   aws-efa-k8s-device-plugin-daemonset-mqt75   0/1     CreateContainerConfigError   0          2m43s
kube-system   aws-efa-k8s-device-plugin-daemonset-rsx9p   0/1     CreateContainerConfigError   0          2m43s
kube-system   aws-efa-k8s-device-plugin-daemonset-xhs2k   0/1     CreateContainerConfigError   0          2m43s

and the issue is:

 Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  3m28s                 default-scheduler  Successfully assigned kube-system/aws-efa-k8s-device-plugin-daemonset-5v7mh to ip-192-168-31-106.us-east-2.compute.internal
  Normal   Pulling    3m27s                 kubelet            Pulling image "602401143452.dkr.ecr.us-east-2.amazonaws.com/eks/aws-efa-k8s-device-plugin:v0.3.3"
  Normal   Pulled     3m22s                 kubelet            Successfully pulled image "602401143452.dkr.ecr.us-east-2.amazonaws.com/eks/aws-efa-k8s-device-plugin:v0.3.3" in 5.255980889s (5.256008801s including waiting)
  Warning  Failed     71s (x12 over 3m22s)  kubelet            Error: container has runAsNonRoot and image will run as root (pod: "aws-efa-k8s-device-plugin-daemonset-5v7mh_kube-system(d058715c-af39-47de-aa92-220f9adab871)", container: aws-efa-k8s-device-plugin)
  Normal   Pulled     71s (x11 over 3m22s)  kubelet            Container image "602401143452.dkr.ecr.us-east-2.amazonaws.com/eks/aws-efa-k8s-device-plugin:v0.3.3" already present on machine

If the efaEnabled parameter is no longer functional here, maybe that should be made clear and a link to the helm chart with instructions should be provided?

@vsoch vsoch reopened this Jun 25, 2023
@bollig
Copy link

bollig commented Jun 26, 2023

eksEnabled

Double checking that you mean efaEnabled rather than eksEnabled ?

@vsoch
Copy link
Contributor Author

vsoch commented Jun 26, 2023

Yes just a typo - fixed!

@vsoch
Copy link
Contributor Author

vsoch commented Jul 28, 2023

@cPu1 and @Himangini this is still an issue and it's almost 7 months later - I've tested your suggestions and I've now opened two PRs #6302 and #6743 that fix this. The helm chart is not a solution because we are using the plugin yaml that is provided here. What is your plan to fix this in eksctl and what else can I do to help?

@github-actions
Copy link
Contributor

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the stale label Oct 29, 2023
@vsoch
Copy link
Contributor Author

vsoch commented Oct 29, 2023

Please don't close the issue stalebot - I think a resolution would either be to fix the config here or remove the efaEnabled flag (which will not work without root).

@github-actions github-actions bot removed the stale label Oct 30, 2023
@cPu1 cPu1 added the priority/important-longterm Important over the long term, but may not be currently staffed and/or may require multiple releases label Oct 30, 2023
@cPu1 cPu1 closed this as completed in #6302 Jul 4, 2024
@vsoch
Copy link
Contributor Author

vsoch commented Jul 4, 2024

Thank you! Really happy to see this go through.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug priority/important-longterm Important over the long term, but may not be currently staffed and/or may require multiple releases
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants