Support clustered allocation when formatting ext4 filesystem #1706

AndrewSirenko · 2023-08-01T19:58:16Z

Is this a bug fix or adding new feature?
New feature

What is this PR about? / Why do we need it?
Close #1687

This PR:

Adds ext4BigAlloc and ext4ClusterSize storage class parameters that expose enabling clustered allocation on ext4 filesystems.
Allows developers to enable torn write prevention on their dynamically provisioned volumes by setting by setting ext4BigAlloc to true and 'ext4ClusterSize` to the desired cluster size.

The ext4BigAlloc parameter enables clustered allocation on ext4 filesystems by enabling the bigalloc feature (ie performing mkfs.ext4 -O bigalloc) when formatting a dynamically-provisioned Persistent Volume (PV). See this ext4 man page for more details about bigalloc.

The ext4ClusterSize parameter allows users to set a custom cluster size when formatting a PV with the ext4 filesystem IF they have set ext4BigAlloc to true. (appends -C ext4ClusterSize to other mkfs.ext4 formatting options).

One benefit to exposing this formatting option is that developers can enable torn write prevention on their volumes to improve the performance of their I/O-intensive relational database workloads and reduce latency without negatively impacting data resiliency. Relational databases that use InnoDB or XtraDB as the database engine, such as MySQL and MariaDB, will benefit from torn write prevention.

What testing is done?
Unit Tests + Manual Testing on EKS Cluster.

Manual Testing

Create an EKS cluster, deploy the ebs-csi-driver a version of the driver with an image built from this PR
Test Happy Path (valid ext4ClusterSize parameter value) and Unhappy Path (Invalid ext4ClusterSize parameter value)

Happy Path: Storage class with valid `ext4ClusterSize` parameter value yields properly formatted filesystem on PV

Create a pod with the following manifest (Note the ext4ClusterSize: "16384" in the storage class definition):

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ebs-claim
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: ebs-sc
  resources:
    requests:
      storage: 4Gi
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ebs-sc
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer
parameters:
  csi.storage.k8s.io/fstype: ext4
  type: gp3
  ext4BigAlloc: "true"
  ext4ClusterSize: "16384"
---
apiVersion: v1
kind: Pod
metadata:
  name: app
spec:
  containers:
  - name: app
    image: docker.io/ubuntu
    command: ["/bin/sh"]
    args: ["-c", "while true; do echo $(date -u) >> /data/out.txt; sleep 5; done"]
    securityContext:
      privileged: true
    volumeMounts:
    - name: persistent-storage
      mountPath: /data
  volumes:
  - name: persistent-storage
    persistentVolumeClaim:
      claimName: ebs-claim

Run kubectl exec -it app -- tune2fs -l /dev/nvme1n1 which contains the cluster-size parameter:

Cluster size:                   16384
Clusters per group:       16384

Unhappy Path: Storage class with invalid `ext4ClusterSize` parameter value yields failed mount and `mkfs.ext4` error

Create a pod with happy case manifest but change the ext4ClusterSize parameter

-  ext4ClusterSize: "16384"
+  ext4ClusterSize: "-1"

Run kubectl describe pod app to yield the following events:

Events:
  Type     Reason                  Age                From                     Message
  ----     ------                  ----               ----                     -------
  Warning  FailedScheduling        36s                default-scheduler        0/2 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/2 nodes are available: 2 No preemption victims found for incoming pod..
  Normal   Scheduled               31s                default-scheduler        Successfully assigned kube-system/app to i-08389b759ac9c01e8
  Normal   SuccessfulAttachVolume  29s                attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-e99db0dc-7d06-46b7-8789-f8541b54764f"
  Warning  FailedMount             13s (x6 over 29s)  kubelet                  MountVolume.MountDevice failed for volume "pvc-e99db0dc-7d06-46b7-8789-f8541b54764f" : rpc error: code = Internal desc = could not format "/dev/nvme1n1" and mount it at "/var/lib/kubelet/plugins/kubernetes.io/csi/ebs.csi.aws.com/fcceddc2f6cd5b18fd811ba25411e9e130f23a2d461e73daa39658c9011642b0/globalmount": format of disk "/dev/nvme1n1" failed: type:("ext4") target:("/var/lib/kubelet/plugins/kubernetes.io/csi/ebs.csi.aws.com/fcceddc2f6cd5b18fd811ba25411e9e130f23a2d461e73daa39658c9011642b0/globalmount") options:("defaults") errcode:(exit status 1) output:(mkfs.ext4: invalid cluster size - 1)

Note the FailedMount due to the error mkfs.ext4: invalid cluster size - 1 at the end of the last line.

k8s-ci-robot · 2023-08-01T19:58:24Z

Hi @AndrewSirenko. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pkg/driver/node_test.go

torredil · 2023-08-02T10:48:30Z

/ok-to-test

ConnorJC3

lgtm other than my one suggestion, open to opinions on that

ConnorJC3 · 2023-08-04T16:08:05Z

docs/parameters.md

@@ -18,6 +18,7 @@ The AWS EBS CSI Driver supports [tagging](tagging.md) through `StorageClass.para
 | "inodeSize"                  |                                                    |         | The inode size to use when formatting the underlying filesystem. Only supported on linux nodes and with fstype `ext2`, `ext3`, `ext4`, or `xfs`.                                                                                                                                                                                                                                               |
 | "bytesPerINode"              |                                                    |         | The `bytes-per-inode` to use when formatting the underlying filesystem. Only supported on linux nodes and with fstype `ext2`, `ext3`, `ext4`.                                                                                                                                                                                                                                                  |
 | "numberOfINodes"             |                                                    |         | The `number-of-inodes` to use when formatting the underlying filesystem. Only supported on linux nodes and with fstype `ext2`, `ext3`, `ext4`.                                                                                                                                                                                                                                                 |
+| "bigAllocClusterSize         |                                                    |         | The cluster size (in bytes) to use when formatting an `ext4` filesystem with the `bigalloc` option enabled. Specifying a value will enable the `bigalloc` formatting option.                                                                                                                                                                                                                                                                                                |


From my understanding (and based on the kernel wiki here: https://ext4.wiki.kernel.org/index.php/Bigalloc) bigalloc is just the parameter to enable clustered allocation on ext4.

Thus, I propose this parameter (and associated internal variable/function names) is renamed to 'clusterSize because that is what this is actually doing. This also would make sense if we add support in the future for another filesystem that supports specifying cluster size.

I disagree because I think bigAllocClusterSize helps the customer more easily look up what this parameter does. My only grievance with clusterSize is that Windows renamed block size to cluster size. This is why I added the explicit mention of bigalloc in. https://support.microsoft.com/en-us/topic/default-cluster-size-for-ntfs-fat-and-exfat-9772e6f1-e31a-00d7-e18f-73169155af95

This issue is exacerbated by the first link coming up when you search "cluster size" is that windows document.

TLDR: Yes this parameter name SHOULD be clusterSize, but because Windows overloaded the term I think we need the explicit mention of ext4's bigAlloc formatting option.

In that case, I propose the name ext4ClusterSize. If we're going to limit the name of the parameter to only ext4 for whatever reason, we should be explicit about that.

That said, I still do not prefer using a name that is intentionally limiting our future uses.

Suggest ext4ClusterSize as proposed above.

torredil · 2023-08-04T16:57:21Z

/test pull-aws-ebs-csi-driver-test-helm-chart

torredil · 2023-08-04T17:24:51Z

/test pull-aws-ebs-csi-driver-test-helm-chart

torredil

Thanks @AndrewSirenko /lgtm

ConnorJC3 · 2023-08-07T16:54:40Z

/retest

mvladev-skysql · 2023-08-10T08:04:05Z

This change would break the fsresizer:

E0810 07:49:42.399370       1 driver.go:122] "GRPC error" err=<
        rpc error: code = Internal desc = Could not resize volume "vol-***" ("/dev/nvme1n1"):  resize of device /dev/nvme1n1 failed: exit status 1. resize2fs output: resize2fs 1.42.9 (28-Dec-2013)

        Resizing bigalloc file systems has not been fully tested.  Proceed at
        your own risk!  Use the force option if you want to go ahead anyway.

 >

If you look at the resizer it doesn't add the -f option https://github.com/kubernetes/mount-utils/blob/41e8de37ef8a3782f9cd6c915699b98b2b24b2c4/resizefs_linux.go#L71-L80

AndrewSirenko · 2023-09-11T16:26:19Z

Just an update:

The new Linux AL2023 Base image for the driver contains an updated version of e2fsprogs which will prevent the breaking of fsresizer.

I'm working on adding an e2e formatting option parameter test suite to the driver before finishing this PR (so that I can add proper e2e testing of this feature along with the other formatting option parameters).

Thank you again Martin for exposing a gap in our testing.

AndrewSirenko · 2023-09-14T17:42:34Z

/hold

wmesard · 2023-10-11T14:59:02Z

docs/faq.md

+### `ext4BigAlloc` and `ext4ClusterSize`
+
+Warnings:
+- The `bigalloc` feature may not be fully supported with your node's Linux kernel or may have various bugs. Please see the [ext4(5) man-page](https://man7.org/linux/man-pages/man5/ext4.5.html) for details.


I realize that you're just quoting from the linked man page. But this still seems unnecessarily alarming. It makes it sound like the CSI driver's implementation may have various bugs, which is not the case.

Suggest:

Ext4's bigalloc is an experimental feature, under active development. Please pay particular attention to your kernel version. See the ext4(5) man-page for details.

wmesard · 2023-10-11T14:59:53Z

docs/faq.md

+
+Warnings:
+- The `bigalloc` feature may not be fully supported with your node's Linux kernel or may have various bugs. Please see the [ext4(5) man-page](https://man7.org/linux/man-pages/man5/ext4.5.html) for details.
+- Linux kernel release 4.15 added support for resizing ext4 filesystems using clustered allocation. **Resizing volumes mounted to nodes running a Linux kernel version prior to 4.15 will fail and require manual intervention and downtime of the volume.** 


and require manual intervention and downtime of the volume.

Either say what "manual intervention" is required or just delete this part of the sentence.

k8s-ci-robot · 2023-10-11T17:47:11Z

@wmesard: changing LGTM is restricted to collaborators

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

AndrewSirenko · 2023-10-11T18:09:47Z

/retest

AndrewSirenko · 2023-10-12T14:31:11Z

When deciding between customer experience 1 or 2 for:

a. Enabling clustered allocation for an ext4 fs (mkfs.ext4 -O bigalloc)
b. Enabling clustered allocation for an ext4 fs with a specified cluster size (mkfs.ext4 -O bigalloc -C 16384)

1a.

parameters:
  [csi.storage.k8s.io/fstype](http://csi.storage.k8s.io/fstype): ext4
  ext4BigAlloc: "true"

1b.

parameters:
  [csi.storage.k8s.io/fstype](http://csi.storage.k8s.io/fstype): ext4
  ext4BigAlloc: "true"
  ext4ClusterSize: "16384"

2a

parameters:
  [csi.storage.k8s.io/fstype](http://csi.storage.k8s.io/fstype): ext4
  ext4ClusterSize: auto

2b.

parameters:
  [csi.storage.k8s.io/fstype](http://csi.storage.k8s.io/fstype): ext4
  ext4ClusterSize: "16384"

Decided for two separate parameters (option 1) due to:

We are not inventing this interface in a vacuum. We are exposing an ext4 feature. As such, it is more predictable, to remain consistent with the existing ext4 interface as much as possible.
Future proofing in the unlikely case the ext4 interface changes
Explicitly separating -O bigalloc and -C 16384

At the cost of:

Needing two different parameters to be set when specifying cluster size. If 'ext4BigAlloc' is not set to true, but 'ext4ClusterSize' is, there will be an InvalidArgument error, which may inconvenience some customers.

AndrewSirenko · 2023-10-12T14:31:25Z

/unhold

AndrewSirenko · 2023-10-12T14:31:47Z

/approve

k8s-ci-robot · 2023-10-12T14:31:59Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: AndrewSirenko

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [AndrewSirenko]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 1, 2023

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 1, 2023

k8s-ci-robot requested review from gtxu and rdpsin August 1, 2023 19:58

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Aug 1, 2023

hanyuel reviewed Aug 1, 2023

View reviewed changes

pkg/driver/node_test.go Outdated Show resolved Hide resolved

AndrewSirenko requested a review from hanyuel August 4, 2023 15:58

ConnorJC3 reviewed Aug 4, 2023

View reviewed changes

torredil approved these changes Aug 7, 2023

View reviewed changes

k8s-ci-robot assigned torredil Aug 7, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 7, 2023

hanyuel approved these changes Aug 8, 2023

View reviewed changes

k8s-ci-robot assigned hanyuel Aug 8, 2023

AndrewSirenko mentioned this pull request Aug 24, 2023

REQUEST: New membership for AndrewSirenko kubernetes/org#4411

Closed

9 tasks

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 14, 2023

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 29, 2023

AndrewSirenko force-pushed the ext4BigAllocFormatOptions branch from a7de387 to 0333c47 Compare October 4, 2023 14:21

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 4, 2023

k8s-ci-robot requested a review from torredil October 10, 2023 22:45

AndrewSirenko marked this pull request as draft October 10, 2023 23:11

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 10, 2023

AndrewSirenko removed request for ConnorJC3, rdpsin, gtxu and torredil October 10, 2023 23:12

AndrewSirenko force-pushed the ext4BigAllocFormatOptions branch from c24bd3f to 2f8f6f2 Compare October 10, 2023 23:18

AndrewSirenko marked this pull request as ready for review October 10, 2023 23:30

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 10, 2023

k8s-ci-robot requested review from ConnorJC3 and torredil October 10, 2023 23:30

wmesard reviewed Oct 11, 2023

View reviewed changes

Support clustered allocation when formatting ext4 filesystem

0734fbe

AndrewSirenko force-pushed the ext4BigAllocFormatOptions branch from 2f8f6f2 to 0734fbe Compare October 11, 2023 17:15

wmesard approved these changes Oct 11, 2023

View reviewed changes

torredil approved these changes Oct 11, 2023

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 11, 2023

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 12, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 12, 2023

AndrewSirenko removed the request for review from ConnorJC3 October 12, 2023 14:33

k8s-ci-robot merged commit 1b9d078 into kubernetes-sigs:master Oct 12, 2023
17 checks passed

AndrewSirenko deleted the ext4BigAllocFormatOptions branch November 1, 2023 13:49

jsafrane mentioned this pull request Nov 21, 2023

STOR-1400: Rebase to v1.25.0 for OCP 4.15 openshift/aws-ebs-csi-driver#244

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support clustered allocation when formatting ext4 filesystem #1706

Support clustered allocation when formatting ext4 filesystem #1706

AndrewSirenko commented Aug 1, 2023 •

edited

Loading

k8s-ci-robot commented Aug 1, 2023

torredil commented Aug 2, 2023

ConnorJC3 left a comment

ConnorJC3 Aug 4, 2023

AndrewSirenko Aug 4, 2023

AndrewSirenko Aug 4, 2023

ConnorJC3 Aug 7, 2023 •

edited

Loading

torredil Oct 9, 2023

torredil commented Aug 4, 2023

torredil commented Aug 4, 2023

torredil left a comment

ConnorJC3 commented Aug 7, 2023

mvladev-skysql commented Aug 10, 2023 •

edited

Loading

AndrewSirenko commented Sep 11, 2023

AndrewSirenko commented Sep 14, 2023

wmesard Oct 11, 2023

wmesard Oct 11, 2023

k8s-ci-robot commented Oct 11, 2023

AndrewSirenko commented Oct 11, 2023

AndrewSirenko commented Oct 12, 2023

AndrewSirenko commented Oct 12, 2023

AndrewSirenko commented Oct 12, 2023

k8s-ci-robot commented Oct 12, 2023

Support clustered allocation when formatting ext4 filesystem #1706

Support clustered allocation when formatting ext4 filesystem #1706

Conversation

AndrewSirenko commented Aug 1, 2023 • edited Loading

Manual Testing

Happy Path: Storage class with valid ext4ClusterSize parameter value yields properly formatted filesystem on PV

Unhappy Path: Storage class with invalid ext4ClusterSize parameter value yields failed mount and mkfs.ext4 error

k8s-ci-robot commented Aug 1, 2023

torredil commented Aug 2, 2023

ConnorJC3 left a comment

Choose a reason for hiding this comment

ConnorJC3 Aug 4, 2023

Choose a reason for hiding this comment

AndrewSirenko Aug 4, 2023

Choose a reason for hiding this comment

AndrewSirenko Aug 4, 2023

Choose a reason for hiding this comment

ConnorJC3 Aug 7, 2023 • edited Loading

Choose a reason for hiding this comment

torredil Oct 9, 2023

Choose a reason for hiding this comment

torredil commented Aug 4, 2023

torredil commented Aug 4, 2023

torredil left a comment

Choose a reason for hiding this comment

ConnorJC3 commented Aug 7, 2023

mvladev-skysql commented Aug 10, 2023 • edited Loading

AndrewSirenko commented Sep 11, 2023

AndrewSirenko commented Sep 14, 2023

wmesard Oct 11, 2023

Choose a reason for hiding this comment

wmesard Oct 11, 2023

Choose a reason for hiding this comment

k8s-ci-robot commented Oct 11, 2023

AndrewSirenko commented Oct 11, 2023

AndrewSirenko commented Oct 12, 2023

AndrewSirenko commented Oct 12, 2023

AndrewSirenko commented Oct 12, 2023

k8s-ci-robot commented Oct 12, 2023

AndrewSirenko commented Aug 1, 2023 •

edited

Loading

Happy Path: Storage class with valid `ext4ClusterSize` parameter value yields properly formatted filesystem on PV

Unhappy Path: Storage class with invalid `ext4ClusterSize` parameter value yields failed mount and `mkfs.ext4` error

ConnorJC3 Aug 7, 2023 •

edited

Loading

mvladev-skysql commented Aug 10, 2023 •

edited

Loading