Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Ensure EFA installer uses full path to EFA bin and does not install if already present #1780

Merged
merged 1 commit into from
Oct 3, 2023

Conversation

bryantbiggs
Copy link
Contributor

Description

  • User data has been updated to check if EFA driver has been installed and skills installing the EFA driver. Starting with EKS v1.28, the EKS GPU AMI now comes with the EFA driver installed
    • A note has been added that the preferred method is to install the EFA driver on the AMI instead of during provisioning
  • The previous command fi_info -p efa -t FI_EP_RDM was failing since the fi_info executable is not located on the PATh yet (requires a reboot). Instead, we can use the full path to the executable to get around this
  • The EFA device plugin has been converted to a Terraform daemonset to drop the use of the kubectl provider

Motivation and Context

  • Current pattern is throwing an error and preventing nodes from joining since it errors out on fi_info -p efa -t FI_EP_RDM and never reaches the bootstrap script portion

How was this change tested?

  • Yes, I have tested the PR using my local account setup (Provide any test evidence report under Additional Notes)
  • Yes, I have updated the docs for this feature
  • Yes, I ran pre-commit run -a with this PR

Additional Notes

NAMESPACE               NAME                                                         READY   STATUS      RESTARTS   AGE
gpu-operator            gpu-feature-discovery-n9rmj                                  1/1     Running     0          5m7s
gpu-operator            gpu-operator-7c44d8f7d-kcbbl                                 1/1     Running     0          30m
gpu-operator            gpu-operator-node-feature-discovery-master-b44f595bf-rr2x8   1/1     Running     0          30m
gpu-operator            gpu-operator-node-feature-discovery-worker-9pflc             1/1     Running     0          28m
gpu-operator            gpu-operator-node-feature-discovery-worker-d52gc             1/1     Running     0          5m11s
gpu-operator            gpu-operator-node-feature-discovery-worker-jmcch             1/1     Running     0          28m
gpu-operator            nvidia-container-toolkit-daemonset-bxhc4                     1/1     Running     0          5m7s
gpu-operator            nvidia-cuda-validator-ztzj6                                  0/1     Completed   0          4m51s
gpu-operator            nvidia-device-plugin-daemonset-jq9wv                         1/1     Running     0          5m7s
gpu-operator            nvidia-device-plugin-validator-mvnjm                         0/1     Completed   0          4m35s
gpu-operator            nvidia-operator-validator-9k6t9                              1/1     Running     0          5m7s
kube-prometheus-stack   alertmanager-kube-prometheus-stack-alertmanager-0            2/2     Running     0          27m
kube-prometheus-stack   kube-prometheus-stack-grafana-66b97ddb5d-v9mcq               3/3     Running     0          27m
kube-prometheus-stack   kube-prometheus-stack-kube-state-metrics-5b5c595697-ct8jt    1/1     Running     0          27m
kube-prometheus-stack   kube-prometheus-stack-operator-6876c99f4-ss2qq               1/1     Running     0          27m
kube-prometheus-stack   kube-prometheus-stack-prometheus-node-exporter-mf2m5         1/1     Running     0          5m16s
kube-prometheus-stack   kube-prometheus-stack-prometheus-node-exporter-svtmd         1/1     Running     0          27m
kube-prometheus-stack   kube-prometheus-stack-prometheus-node-exporter-tjbdl         1/1     Running     0          27m
kube-prometheus-stack   prometheus-kube-prometheus-stack-prometheus-0                2/2     Running     0          27m
kube-system             aws-node-68dw5                                               1/1     Running     0          27m
kube-system             aws-node-8jv2l                                               1/1     Running     0          5m17s
kube-system             aws-node-p62j6                                               1/1     Running     0          26m
kube-system             coredns-6ff9c46cd8-5mfqf                                     1/1     Running     0          34m
kube-system             coredns-6ff9c46cd8-srkzh                                     1/1     Running     0          34m
kube-system             efs-csi-controller-54b8d456b8-lgssf                          3/3     Running     0          26m
kube-system             efs-csi-controller-54b8d456b8-s9vsv                          3/3     Running     0          26m
kube-system             efs-csi-node-4b62w                                           3/3     Running     0          26m
kube-system             efs-csi-node-7hg22                                           3/3     Running     0          5m17s
kube-system             efs-csi-node-qp7z4                                           3/3     Running     0          26m
kube-system             fsx-csi-controller-fbbc5458b-cbbg8                           4/4     Running     0          26m
kube-system             fsx-csi-controller-fbbc5458b-rwqvj                           4/4     Running     0          26m
kube-system             fsx-csi-node-5xgr4                                           3/3     Running     0          26m
kube-system             fsx-csi-node-6plsp                                           3/3     Running     0          5m16s
kube-system             fsx-csi-node-9kmcc                                           3/3     Running     0          26m
kube-system             kube-proxy-d86mz                                             1/1     Running     0          28m
kube-system             kube-proxy-sp2lc                                             1/1     Running     0          28m
kube-system             kube-proxy-zlpcw                                             1/1     Running     0          5m16s
kube-system             metrics-server-76c55fc4fc-vbf6v                              1/1     Running     0          30m
prometheus-adapter      prometheus-adapter-69599fc86b-67hbv                          1/1     Running     0          30m
prometheus-adapter      prometheus-adapter-69599fc86b-gmjd2                          1/1     Running     0          30m

@bryantbiggs bryantbiggs requested a review from a team as a code owner October 3, 2023 00:55
@bryantbiggs bryantbiggs temporarily deployed to EKS Blueprints Test October 3, 2023 00:55 — with GitHub Actions Inactive
@bryantbiggs bryantbiggs merged commit 9e407f0 into main Oct 3, 2023
56 checks passed
@bryantbiggs bryantbiggs deleted the fix/efa branch October 3, 2023 11:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants