Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade openscapes AWS EKS k8s version 1.21 to 1.24 #2125

Closed
consideRatio opened this issue Feb 1, 2023 · 0 comments · Fixed by #2139
Closed

Upgrade openscapes AWS EKS k8s version 1.21 to 1.24 #2125

consideRatio opened this issue Feb 1, 2023 · 0 comments · Fixed by #2139
Assignees

Comments

@consideRatio
Copy link
Member

consideRatio commented Feb 1, 2023

The oldest k8s version we use is now in openscapes, with k8s 1.21. Let's get it upgraded so that the oldest version becomes 1.22.

This issue was branched off from #2057.

I most recently upgraded an EKS cluster in #2085, and will use notes from there and adjust them as I go and iterate on them once again.

  # For reference, this is the steps I took when upgrading carbonplan from k8s 1.19 to k8s
  # 1.24, Jan 24th 2023.
  #
  # 1. Updated the version field in this config from 1.19 to 1.20
  #
  #    - It is not allowed to upgrade the control plane more than one minor at the time
  #
  # 2. Upgraded the control plane (takes ~10 minutes)
  #
  #    - I ran into permission errors, so I visited the AWS cloud console to
  #      create an access key for my user and set it up temporary environment
  #      variables.
  #
  #      export AWS_ACCESS_KEY_ID="..."
  #      export AWS_SECRET_ACCESS_KEY="..."
  #
  #    eksctl upgrade cluster --config-file eksctl-cluster-config.yaml --approve
  #
  # 3. Deleted all non-core nodegroups
  #
  #    - I had to add a --drain=false flag due to an error likely related to a
  #      very old EKS cluster.
  #
  #    - I used --include="nb-*,dask-*" because I saw that the core node pool
  #      was named "core-a", and the other nodes started with "nb-" or "dask-".
  #
  #    eksctl delete nodegroup --config-file=eksctl-cluster-config.yaml --include "nb-*,dask-*" --approve --drain=false
  #
  # 4. Updated the version field in this config from 1.20 to 1.22
  #
  #    - It is allowed to have a nodegroup +-2 minors away from the control plan version
  #
  # 5. Created a new core nodepool (core-b)
  #
  #    - I ran into "Unauthorized" errors and resolved them by first using the
  #      deployer to acquire credentials to modify a ConfigMap named "aws-auth"
  #      in the k8s namespace kube-system.
  #
  #      deployer use-cluster-credentials carbonplan
  #
  #      kubectl edit cm -n kube-system aws-auth
  #
  #    eksctl create nodegroup --config-file=eksctl-cluster-config.yaml --include "core-b" --install-nvidia-plugin=false
  #
  # 6. Deleted the old core nodepool (core-a)
  #
  #    - I first updated the eksctl config file to include a "core-a" entry,
  #      because I didn't really add a "core-b" previously, I just renamed the
  #      "core-a" to "core-b".
  #
  #    eksctl delete nodegroup --config-file=eksctl-cluster-config.yaml --include "core-a" --approve
  #
  # 7. Upgraded add-ons (takes ~3*5s)
  #
  #    eksctl utils update-kube-proxy --cluster=carbonplanhub --approve
  #    eksctl utils update-aws-node --cluster=carbonplanhub --approve
  #    kubectl patch daemonset -n kube-system aws-node --patch='{"spec":{"template":{"spec":{"$setElementOrder/containers":[{"name":"aws-node"}],"containers":[{"name":"aws-node","securityContext":{"allowPrivilegeEscalation":null,"runAsNonRoot":null}}]}}}}'
  #    eksctl utils update-coredns --cluster=carbonplanhub --approve
  #
  #    - I diagnosed two separate errors following this:
  #
  #      kubectl get pod -n kube-system
  #      kubectl describe pod -n kube-system aws-node-7rcsw
  #
  #      Warning  Failed     9s (x7 over 69s)  kubelet            Error: container has runAsNonRoot and image will run as root
  #
  #      - the aws-node daemonset's pods failed to start because of a too
  #        restrictive container securityContext not running as root.
  #
  #        aws-node issue: https://github.com/weaveworks/eksctl/issues/6048.
  #
  #        Resolved by removing `runAsNonRoot: true` and
  #        `allowPrivilegeEscalation: false`. Using --output-patch=true led me
  #        to a `kubectl patch` command to use.
  #
  #        kubectl edit ds -n kube-system aws-node --output-patch=true
  #
  #      - the kube-proxy deamonset's pods failed to pull the image, it was not
  #        found.
  #
  #        This didn't need to be resolved mid way through upgrades, and was an
  #        issue that went away in k8s 1.23.
  #
  # 8. Update the version field in this config from 1.22 to 1.21
  #
  # 9. Upgraded the control plane, as in step 2.
  #
  # A. Upgraded add-ons, as in step 7.
  #
  # B. Update the version field in this config from 1.21 to 1.22
  #
  # C. Upgraded the control plane, as in step 2.
  #
  # D. Upgraded add-ons, as in step 7.
  #
  # E. I refreshed the ekscluster config's .jsonnet file based on
  #    template.jsonnet which has been updated to declare a addon related to ebs
  #    storage. In practice, this was probably not used later by subsequent
  #    commands I realize. It feels good to have it in the ekscluster config
  #    though to reflect adding it manually.
  #
  #    addons: [
  #        {
  #            // aws-ebs-csi-driver ensures that our PVCs are bound to PVs that
  #            // couple to AWS EBS based storage, without it expect to see pods
  #            // mounting a PVC failing to schedule and PVC resources that are
  #            // unbound.
  #            //
  #            // Related docs: https://docs.aws.amazon.com/eks/latest/userguide/managing-ebs-csi.html
  #            //
  #            name: 'aws-ebs-csi-driver',
  #            wellKnownPolicies: {
  #                ebsCSIController: true,
  #            },
  #        },
  #    ],
  #
  #    eksctl create iamserviceaccount \
  #             --name=ebs-csi-controller-sa \
  #             --namespace=kube-system \
  #             --cluster=carbonplanhub \
  #             --attach-policy-arn=arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy \
  #             --approve \
  #             --role-only \
  #             --role-name=AmazonEKS_EBS_CSI_DriverRole
  #    
  #    eksctl create addon --name=aws-ebs-csi-driver --cluster=carbonplanhub --service-account-role-arn=arn:aws:iam::631969445205:role/AmazonEKS_EBS_CSI_DriverRole --force
  #
  # F. Update the version field in this config from 1.22 to 1.23
  #
  # G. Upgraded the control plane, as in step 2.
  #
  # H. Upgraded add-ons, as in step 7.
  #
  # I. Update the version field in this config from 1.23 to 1.24
  #
  # J. Upgraded the control plane, as in step 2.
  #
  # K. Upgraded add-ons, as in step 7.
  #
  # L. I created a new core node pool and deleted the old, as in step 5-6.
  #
  #    eksctl create nodegroup --config-file=eksctl-cluster-config.yaml --include "core-a" --install-nvidia-plugin=false
  #    eksctl delete nodegroup --config-file=eksctl-cluster-config.yaml --include "core-b" --approve
  #
  # M. I recreated all other nodegroups.
  #
  #    eksctl create nodegroup --config-file=eksctl-cluster-config.yaml --include "nb-*,dask-*" --install-nvidia-plugin=false
  #
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant