Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove block-device-mapping from attach limit #1843

Merged

Conversation

jsafrane
Copy link
Contributor

@jsafrane jsafrane commented Nov 21, 2023

Is this a bug fix or adding new feature?
/kind bug
Fixes: #1808

What is this PR about? / Why do we need it?
meta-data/block-device-mapping reports devices that were attached to the VM at the time the VM was started. It does not reflect devices attached/detached later.

In addition, even when it showed the attachments accurately, the CSI driver uses this information only at the driver startup. At this time, there may be some CSI volumes attached from the previous node boot. Those volumes are then counted as permanently attached to the node, while Kubernetes may detach them later.

Therefore do not count block-device-mapping at all and always reserve just 1 attachment for the root disk. Assume all other attachments are CSI volumes, and let Kubernetes do the calculations of attached volumes instead.

What testing is done?
Tested with Kubernetes 1.28 (OpenShift) + a node shut down without draining it first and starting it with some CSI driver volumes attached.

Before this PR:

$ kubectl get csinode -o yaml
...
kind: CSINode
spec:
  drivers:
  - allocatable:
      count: 22   # 4 CSI volumes are counted as permanently attached by the driver + 1 root disk
    name: ebs.csi.aws.com
    nodeID: i-0d86f24259c8ce1bd
    topologyKeys:
    - topology.ebs.csi.aws.com/zone

After this PR:

kind: CSINode
spec:
  drivers:
  - allocatable:
      count: 26 # just the root disk is counted as permanently attached.
    name: ebs.csi.aws.com
    nodeID: i-0d86f24259c8ce1bd
    topologyKeys:
    - topology.ebs.csi.aws.com/zone

meta-data/block-device-mapping reports devices that were attached to the
VM at the time the VM was started. It does not reflect devices
attached/detached later.

In addition, even when it showed the attachments accurately, the CSI driver
uses this information only at the driver startup. At this time, there
may be some CSI volumes attached from the previous node boot. Those
volumes are then counted as permanently attached to the node, while
Kubernetes may detach them later.

Therefore do not count block-device-mapping at all and always reserve just
1 attachment for the root disk. Assume all other attachments are CSI
volumes, and let Kubernetes do the calculations of attached volumes
instead.
@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 21, 2023
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 21, 2023
Copy link

Code Coverage Diff

File Old Coverage New Coverage Delta
github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/cloud/metadata.go 86.4% 85.7% -0.6
github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/cloud/metadata_ec2.go 84.1% 84.2% 0.1
github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver/node.go 80.2% 80.0% -0.3

Copy link
Member

@torredil torredil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jsafrane /lgtm

At the very least, blockDevMappings should definitely not take into account CSI-managed volumes. Users relying on the current behavior can make use of the node.volumeAttachLimit parameter to "reserve" attachment slots.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 21, 2023
@torredil
Copy link
Member

cc: @ConnorJC3 @AndrewSirenko to review before approval.

@AndrewSirenko
Copy link
Contributor

Big thank you @jsafrane for diving deep on #1808.

It's quite hard and error prone to recover from this situation. Cluster admin must drain + shut down a node that has misleading capacity count and start it again. It's not enough to restart the CSI driver.

^^ is quite the workload interruption to pay without this PR.

@torredil
Copy link
Member

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: torredil

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 29, 2023
@k8s-ci-robot k8s-ci-robot merged commit 61c721b into kubernetes-sigs:master Nov 29, 2023
19 checks passed
@zerkms
Copy link
Contributor

zerkms commented Nov 29, 2023

@jsafrane I don't fully understand this particular fix, but does it take into account network interfaces? They also contribute to the same limit. Eg: if you attach 10 more network interfaces, you will be short by number of EBS to be attached by the same number.

@kon-angelo
Copy link

kon-angelo commented Nov 30, 2023

I also need to express concerns. Wouldn't this PR break the use of any non-boot volume if it is not done via kubernetes ?

For example some of our users in gardener do opt for additional data volumes which will completely break. It is quite the assumption to make that no other volume can be used at creation and it is a highly breaking change that should be marked as such.

@ConnorJC3
Copy link
Contributor

I didn't get a chance to review this PR before it was merged, but I do agree that it breaks the ability to use non-CSI volumes (other than the root volume) on nodes where the EBS CSI driver is running.

That's a significant enough breaking change we need to be looking into a fix or revert of this PR before the next release.

@torredil
Copy link
Member

torredil commented Dec 1, 2023

Hi @kon-angelo thanks for chiming in and starting this discussion, really appreciate the feedback.

How are you folks solving this problem with other providers? It would be great to work on a solution that works well with the broader community - currently, K8s does not support managing both non-CSI and CSI volumes simultaneously. Other CSI drivers do not take into account non-CSI managed volumes.

Not at all familiar with gardener so pardon if this is completely unfeasible or does not meet your requirements - given that the driver is deployed by default on every cluster already: would it be possible to follow the standard and use the driver to manage these additional data volumes?


I'm not inclined to revert the changes made by this PR - it fixes a critical bug in the driver that affects all users relying on it to manage the lifecycle of their volumes (the supported, common case) - and as previously mentioned, the resulting failure mode requires manual intervention to fix (once the scheduler incorrectly schedules, the stateful pod needs to be deleted + node cordoned so that the same mistake does not happen again).

Wouldn't this PR break the use of any non-boot volume if it is not done via kubernetes ?

it breaks the ability to use non-CSI volumes (other than the root volume) on nodes where the EBS CSI driver is running.

For clarity: the changes in this PR do not inherently break the ability to use non-CSI volumes on nodes where the EBS CSI driver is running. If the node has capacity to support the attachment, it will succeed (these data volumes are practically guaranteed to successfully grab a slot - I'd be more worried about the consequences of that). That said, this has been actively discouraged in the community: kubernetes/kubernetes#81533 (comment) cc: @gnufied

As Jan already mentioned, the driver reports the allocatable count on the node only once during driver initialization. There is no reconciliation. Even when reverting, if users attach (or detach) volumes or ENIs outside of K8s such as via the CLI, the allocatable count will not be updated after plugin registration.

Finally, also wanted to highlight that since the driver is unaware of block devices attached outside of K8s, attempting to force the CSI driver to be aware of non-CSI attachments means the hard-coded table of values specifying device names is unreliable - if the driver attempts to re-use a device name, the attachment will fail indefinitely. This async failure mode is extra troublesome because it's not very obvious and quite hard to track down.

@jsafrane
Copy link
Contributor Author

jsafrane commented Dec 1, 2023

There is no Kubernetes way to count volume attachments done outside of Kubernetes. Different CSI drivers solve it in different way, a very quick check of few random cloud based CSI drivers shows:

  • OpenStack Cinder reads attach limit from a config file.
  • vSphere and Alibaba Disk can optionally read attach limit from env. var.
  • Azure Disk can optionally read attach limit from node labels (that defeats CSI purpose to be container orchestrator agnostic!).
  • IBM VPC and Google Persistent Disk reserves 1 slot for the root disk and does not have any way to override the limit.

None of these CSI drivers counts existing attachments at the driver startup.

Does AWS provide a reliable way how to distinguish CSI and non-CSI attachment in instance metadata or with very limited AWS permissions? And is driver pod start acceptable time to count the non-CSI attachments? It can happen at arbitrary time, so those non-CSI attachments basically need to be attached forever, which does not sound very reliable to me. But it would work for Gardener's data disks, if I understood it correctly.

I agree this PR could break some clusters and it should be treated as a breaking change. In the worst case, gardener and other clusters that need reserved attachment slots could set --volume-attach-limit to override driver's calculation.

@kon-angelo
Copy link

kon-angelo commented Dec 1, 2023

Hey @torredil

Not at all familiar with gardener

Not necessary. I will be happy to answer any points.

How are you folks solving this problem with other providers?

As @jsafrane analyzed, it is not that easy to support this for other providers. But it was working as expected for AWS. The behavior of the driver before this PR made it so EBS volumes at boot time would be taken into account for the allocatable calculation.

We had faced a different (but related) issue in the past where the AWS API was breaking its convention and was reporting ebs volumes via the metadata service even without a node reboot. In gardener also we don't do node reboots. As part of the normal operation we would opt to create new nodes which is why the old behavior made more sense at least for us. We treat boot-time volumes as part of the an immutable node/VM spec.

For clarity: the changes in this PR do not inherently break the ability to use non-CSI volumes on nodes where the EBS CSI driver is running

I think you already made the example I would use. Before this PR, you could rely on the kube-scheduler correctly scheduling pods even on nodes with non-CSI volumes. Now you will face the issue that a pod is scheduled and everything looks "healthy" from k8s side but the volume can never be attached.

I'd be more worried about the consequences of that). That said, this has been actively discouraged in the community: kubernetes/kubernetes#81533 (comment) cc: @gnufied

That comment also is dangerous then as it mentions a "docker" volume which is already a problem with this PR.

As Jan already mentioned, the driver reports the allocatable count on the node only once during driver initialization. There is no reconciliation. Even when reverting, if users attach (or detach) volumes or ENIs outside of K8s such as via the CLI, the allocatable count will not be updated after plugin registration.

Which is what made the behavior correct, as this allows you to separate CSI and non-CSI volumes by virtue of the first being there on VM creation and the metadata service accounting it.

Does AWS provide a reliable way how to distinguish CSI and non-CSI attachment in instance metadata or with very limited AWS permissions? But it would work for Gardener's data disks, if I understood it correctly.

At least not from the metadata service. But this would be the optimal solution.

And is driver pod start acceptable time to count the non-CSI attachments? It can happen at arbitrary time, so those non-CSI attachments basically need to be attached forever, which does not sound very reliable to me.

This may be a bit Gardener specific, but non-CSI attachments are there for the node lifetime. They are also used in some cases to setup the node e.g. for kubelet, containerd, additional drivers etc. Gardener just set the line at VM boot up and everything past that happens in k8s/CSI world only.
I would otherwise agree that arbitrary volume operations on the node and outside of CSI, should not be in scope.

@jsafrane
Copy link
Contributor Author

jsafrane commented Dec 4, 2023

[...] as this allows you to separate CSI and non-CSI volumes by virtue of the first being there on VM creation and the metadata service accounting it.

That is actually the problem here. AWS metadata report attachments not at VM creation time, but at node startup time. If there were CSI volume attached to it at the startup time, then the instance metadata contain both CSI and non-CSI volume and the driver has no way to tell which is which. We can teach cluster admins to always drain nodes before shutdown / reboot, but in reality there are always some nodes that get shut down ungracefully.

torredil added a commit to torredil/aws-ebs-csi-driver that referenced this pull request Dec 11, 2023
torredil added a commit to torredil/aws-ebs-csi-driver that referenced this pull request Dec 11, 2023
@torredil torredil mentioned this pull request Dec 11, 2023
torredil added a commit to torredil/aws-ebs-csi-driver that referenced this pull request Dec 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The driver reports incorrect number of allocatables count
7 participants