Remove block-device-mapping from attach limit #1843

jsafrane · 2023-11-21T17:17:41Z

Is this a bug fix or adding new feature?
/kind bug
Fixes: #1808

What is this PR about? / Why do we need it?
meta-data/block-device-mapping reports devices that were attached to the VM at the time the VM was started. It does not reflect devices attached/detached later.

In addition, even when it showed the attachments accurately, the CSI driver uses this information only at the driver startup. At this time, there may be some CSI volumes attached from the previous node boot. Those volumes are then counted as permanently attached to the node, while Kubernetes may detach them later.

Therefore do not count block-device-mapping at all and always reserve just 1 attachment for the root disk. Assume all other attachments are CSI volumes, and let Kubernetes do the calculations of attached volumes instead.

What testing is done?
Tested with Kubernetes 1.28 (OpenShift) + a node shut down without draining it first and starting it with some CSI driver volumes attached.

Before this PR:

$ kubectl get csinode -o yaml
...
kind: CSINode
spec:
  drivers:
  - allocatable:
      count: 22   # 4 CSI volumes are counted as permanently attached by the driver + 1 root disk
    name: ebs.csi.aws.com
    nodeID: i-0d86f24259c8ce1bd
    topologyKeys:
    - topology.ebs.csi.aws.com/zone

After this PR:

kind: CSINode
spec:
  drivers:
  - allocatable:
      count: 26 # just the root disk is counted as permanently attached.
    name: ebs.csi.aws.com
    nodeID: i-0d86f24259c8ce1bd
    topologyKeys:
    - topology.ebs.csi.aws.com/zone

meta-data/block-device-mapping reports devices that were attached to the VM at the time the VM was started. It does not reflect devices attached/detached later. In addition, even when it showed the attachments accurately, the CSI driver uses this information only at the driver startup. At this time, there may be some CSI volumes attached from the previous node boot. Those volumes are then counted as permanently attached to the node, while Kubernetes may detach them later. Therefore do not count block-device-mapping at all and always reserve just 1 attachment for the root disk. Assume all other attachments are CSI volumes, and let Kubernetes do the calculations of attached volumes instead.

github-actions · 2023-11-21T17:23:30Z

Code Coverage Diff

File	Old Coverage	New Coverage	Delta
github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/cloud/metadata.go	86.4%	85.7%	-0.6
github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/cloud/metadata_ec2.go	84.1%	84.2%	0.1
github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver/node.go	80.2%	80.0%	-0.3

torredil

Thank you @jsafrane /lgtm

At the very least, blockDevMappings should definitely not take into account CSI-managed volumes. Users relying on the current behavior can make use of the node.volumeAttachLimit parameter to "reserve" attachment slots.

torredil · 2023-11-21T19:55:26Z

cc: @ConnorJC3 @AndrewSirenko to review before approval.

AndrewSirenko · 2023-11-24T23:33:03Z

Big thank you @jsafrane for diving deep on #1808.

It's quite hard and error prone to recover from this situation. Cluster admin must drain + shut down a node that has misleading capacity count and start it again. It's not enough to restart the CSI driver.

^^ is quite the workload interruption to pay without this PR.

torredil · 2023-11-29T14:42:48Z

/approve

k8s-ci-robot · 2023-11-29T14:43:01Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: torredil

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [torredil]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

zerkms · 2023-11-29T21:51:56Z

@jsafrane I don't fully understand this particular fix, but does it take into account network interfaces? They also contribute to the same limit. Eg: if you attach 10 more network interfaces, you will be short by number of EBS to be attached by the same number.

kon-angelo · 2023-11-30T09:29:13Z

I also need to express concerns. Wouldn't this PR break the use of any non-boot volume if it is not done via kubernetes ?

For example some of our users in gardener do opt for additional data volumes which will completely break. It is quite the assumption to make that no other volume can be used at creation and it is a highly breaking change that should be marked as such.

ConnorJC3 · 2023-11-30T15:39:27Z

I didn't get a chance to review this PR before it was merged, but I do agree that it breaks the ability to use non-CSI volumes (other than the root volume) on nodes where the EBS CSI driver is running.

That's a significant enough breaking change we need to be looking into a fix or revert of this PR before the next release.

torredil · 2023-12-01T05:05:24Z

Hi @kon-angelo thanks for chiming in and starting this discussion, really appreciate the feedback.

How are you folks solving this problem with other providers? It would be great to work on a solution that works well with the broader community - currently, K8s does not support managing both non-CSI and CSI volumes simultaneously. Other CSI drivers do not take into account non-CSI managed volumes.

Not at all familiar with gardener so pardon if this is completely unfeasible or does not meet your requirements - given that the driver is deployed by default on every cluster already: would it be possible to follow the standard and use the driver to manage these additional data volumes?

I'm not inclined to revert the changes made by this PR - it fixes a critical bug in the driver that affects all users relying on it to manage the lifecycle of their volumes (the supported, common case) - and as previously mentioned, the resulting failure mode requires manual intervention to fix (once the scheduler incorrectly schedules, the stateful pod needs to be deleted + node cordoned so that the same mistake does not happen again).

Wouldn't this PR break the use of any non-boot volume if it is not done via kubernetes ?

it breaks the ability to use non-CSI volumes (other than the root volume) on nodes where the EBS CSI driver is running.

For clarity: the changes in this PR do not inherently break the ability to use non-CSI volumes on nodes where the EBS CSI driver is running. If the node has capacity to support the attachment, it will succeed (these data volumes are practically guaranteed to successfully grab a slot - I'd be more worried about the consequences of that). That said, this has been actively discouraged in the community: kubernetes/kubernetes#81533 (comment) cc: @gnufied

As Jan already mentioned, the driver reports the allocatable count on the node only once during driver initialization. There is no reconciliation. Even when reverting, if users attach (or detach) volumes or ENIs outside of K8s such as via the CLI, the allocatable count will not be updated after plugin registration.

Finally, also wanted to highlight that since the driver is unaware of block devices attached outside of K8s, attempting to force the CSI driver to be aware of non-CSI attachments means the hard-coded table of values specifying device names is unreliable - if the driver attempts to re-use a device name, the attachment will fail indefinitely. This async failure mode is extra troublesome because it's not very obvious and quite hard to track down.

jsafrane · 2023-12-01T10:00:14Z

There is no Kubernetes way to count volume attachments done outside of Kubernetes. Different CSI drivers solve it in different way, a very quick check of few random cloud based CSI drivers shows:

OpenStack Cinder reads attach limit from a config file.
vSphere and Alibaba Disk can optionally read attach limit from env. var.
Azure Disk can optionally read attach limit from node labels (that defeats CSI purpose to be container orchestrator agnostic!).
IBM VPC and Google Persistent Disk reserves 1 slot for the root disk and does not have any way to override the limit.

None of these CSI drivers counts existing attachments at the driver startup.

Does AWS provide a reliable way how to distinguish CSI and non-CSI attachment in instance metadata or with very limited AWS permissions? And is driver pod start acceptable time to count the non-CSI attachments? It can happen at arbitrary time, so those non-CSI attachments basically need to be attached forever, which does not sound very reliable to me. But it would work for Gardener's data disks, if I understood it correctly.

I agree this PR could break some clusters and it should be treated as a breaking change. In the worst case, gardener and other clusters that need reserved attachment slots could set --volume-attach-limit to override driver's calculation.

kon-angelo · 2023-12-01T13:21:35Z

Hey @torredil

Not at all familiar with gardener

Not necessary. I will be happy to answer any points.

How are you folks solving this problem with other providers?

As @jsafrane analyzed, it is not that easy to support this for other providers. But it was working as expected for AWS. The behavior of the driver before this PR made it so EBS volumes at boot time would be taken into account for the allocatable calculation.

We had faced a different (but related) issue in the past where the AWS API was breaking its convention and was reporting ebs volumes via the metadata service even without a node reboot. In gardener also we don't do node reboots. As part of the normal operation we would opt to create new nodes which is why the old behavior made more sense at least for us. We treat boot-time volumes as part of the an immutable node/VM spec.

For clarity: the changes in this PR do not inherently break the ability to use non-CSI volumes on nodes where the EBS CSI driver is running

I think you already made the example I would use. Before this PR, you could rely on the kube-scheduler correctly scheduling pods even on nodes with non-CSI volumes. Now you will face the issue that a pod is scheduled and everything looks "healthy" from k8s side but the volume can never be attached.

I'd be more worried about the consequences of that). That said, this has been actively discouraged in the community: kubernetes/kubernetes#81533 (comment) cc: @gnufied

That comment also is dangerous then as it mentions a "docker" volume which is already a problem with this PR.

As Jan already mentioned, the driver reports the allocatable count on the node only once during driver initialization. There is no reconciliation. Even when reverting, if users attach (or detach) volumes or ENIs outside of K8s such as via the CLI, the allocatable count will not be updated after plugin registration.

Which is what made the behavior correct, as this allows you to separate CSI and non-CSI volumes by virtue of the first being there on VM creation and the metadata service accounting it.

Does AWS provide a reliable way how to distinguish CSI and non-CSI attachment in instance metadata or with very limited AWS permissions? But it would work for Gardener's data disks, if I understood it correctly.

At least not from the metadata service. But this would be the optimal solution.

And is driver pod start acceptable time to count the non-CSI attachments? It can happen at arbitrary time, so those non-CSI attachments basically need to be attached forever, which does not sound very reliable to me.

This may be a bit Gardener specific, but non-CSI attachments are there for the node lifetime. They are also used in some cases to setup the node e.g. for kubelet, containerd, additional drivers etc. Gardener just set the line at VM boot up and everything past that happens in k8s/CSI world only.
I would otherwise agree that arbitrary volume operations on the node and outside of CSI, should not be in scope.

jsafrane · 2023-12-04T14:25:46Z

[...] as this allows you to separate CSI and non-CSI volumes by virtue of the first being there on VM creation and the metadata service accounting it.

That is actually the problem here. AWS metadata report attachments not at VM creation time, but at node startup time. If there were CSI volume attached to it at the startup time, then the instance metadata contain both CSI and non-CSI volume and the driver has no way to tell which is which. We can teach cluster admins to always drain nodes before shutdown / reboot, but in reality there are always some nodes that get shut down ungracefully.

… limit Signed-off-by: Eddie Torres <[email protected]>

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 21, 2023

k8s-ci-robot requested review from AndrewSirenko and torredil November 21, 2023 17:17

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 21, 2023

torredil approved these changes Nov 21, 2023

View reviewed changes

k8s-ci-robot assigned torredil Nov 21, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 21, 2023

AndrewSirenko approved these changes Nov 24, 2023

View reviewed changes

k8s-ci-robot assigned AndrewSirenko Nov 24, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 29, 2023

k8s-ci-robot merged commit 61c721b into kubernetes-sigs:master Nov 29, 2023
19 checks passed

torredil added a commit to torredil/aws-ebs-csi-driver that referenced this pull request Dec 11, 2023

Revert kubernetes-sigs#1843 - Remove block-device-mapping from attach…

08cd33c

… limit Signed-off-by: Eddie Torres <[email protected]>

torredil added a commit to torredil/aws-ebs-csi-driver that referenced this pull request Dec 11, 2023

Revert kubernetes-sigs#1843 - Remove block-device-mapping from attach…

f5e4181

… limit Signed-off-by: Eddie Torres <[email protected]>

torredil mentioned this pull request Dec 11, 2023

Update getvolumeslimit #1859

Merged

torredil added a commit to torredil/aws-ebs-csi-driver that referenced this pull request Dec 11, 2023

Revert kubernetes-sigs#1843 - Remove block-device-mapping from attach…

80fbe37

… limit Signed-off-by: Eddie Torres <[email protected]>

jsafrane mentioned this pull request Feb 1, 2024

The driver reports incorrect number of allocatables count #1808

Closed

gnufied mentioned this pull request Feb 1, 2024

Add option to disable discovery of attached volumes from instance metadata #1917

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove block-device-mapping from attach limit #1843

Remove block-device-mapping from attach limit #1843

jsafrane commented Nov 21, 2023 •

edited

Loading

github-actions bot commented Nov 21, 2023

torredil left a comment

torredil commented Nov 21, 2023

AndrewSirenko commented Nov 24, 2023

torredil commented Nov 29, 2023

k8s-ci-robot commented Nov 29, 2023

zerkms commented Nov 29, 2023

kon-angelo commented Nov 30, 2023 •

edited

Loading

ConnorJC3 commented Nov 30, 2023

torredil commented Dec 1, 2023

jsafrane commented Dec 1, 2023

kon-angelo commented Dec 1, 2023 •

edited

Loading

jsafrane commented Dec 4, 2023

Remove block-device-mapping from attach limit #1843

Remove block-device-mapping from attach limit #1843

Conversation

jsafrane commented Nov 21, 2023 • edited Loading

github-actions bot commented Nov 21, 2023

Code Coverage Diff

torredil left a comment

Choose a reason for hiding this comment

torredil commented Nov 21, 2023

AndrewSirenko commented Nov 24, 2023

torredil commented Nov 29, 2023

k8s-ci-robot commented Nov 29, 2023

zerkms commented Nov 29, 2023

kon-angelo commented Nov 30, 2023 • edited Loading

ConnorJC3 commented Nov 30, 2023

torredil commented Dec 1, 2023

jsafrane commented Dec 1, 2023

kon-angelo commented Dec 1, 2023 • edited Loading

jsafrane commented Dec 4, 2023

jsafrane commented Nov 21, 2023 •

edited

Loading

kon-angelo commented Nov 30, 2023 •

edited

Loading

kon-angelo commented Dec 1, 2023 •

edited

Loading