Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support clustered allocation when formatting ext4 filesystem #1706

Merged

Conversation

AndrewSirenko
Copy link
Contributor

@AndrewSirenko AndrewSirenko commented Aug 1, 2023

Is this a bug fix or adding new feature?
New feature

What is this PR about? / Why do we need it?
Close #1687

This PR:

  • Adds ext4BigAlloc and ext4ClusterSize storage class parameters that expose enabling clustered allocation on ext4 filesystems.
  • Allows developers to enable torn write prevention on their dynamically provisioned volumes by setting by setting ext4BigAlloc to true and 'ext4ClusterSize` to the desired cluster size.

The ext4BigAlloc parameter enables clustered allocation on ext4 filesystems by enabling the bigalloc feature (ie performing mkfs.ext4 -O bigalloc) when formatting a dynamically-provisioned Persistent Volume (PV). See this ext4 man page for more details about bigalloc.

The ext4ClusterSize parameter allows users to set a custom cluster size when formatting a PV with the ext4 filesystem IF they have set ext4BigAlloc to true. (appends -C ext4ClusterSize to other mkfs.ext4 formatting options).

One benefit to exposing this formatting option is that developers can enable torn write prevention on their volumes to improve the performance of their I/O-intensive relational database workloads and reduce latency without negatively impacting data resiliency. Relational databases that use InnoDB or XtraDB as the database engine, such as MySQL and MariaDB, will benefit from torn write prevention.

What testing is done?
Unit Tests + Manual Testing on EKS Cluster.


Manual Testing

  1. Create an EKS cluster, deploy the ebs-csi-driver a version of the driver with an image built from this PR
  2. Test Happy Path (valid ext4ClusterSize parameter value) and Unhappy Path (Invalid ext4ClusterSize parameter value)

Happy Path: Storage class with valid ext4ClusterSize parameter value yields properly formatted filesystem on PV

  1. Create a pod with the following manifest (Note the ext4ClusterSize: "16384" in the storage class definition):
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ebs-claim
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: ebs-sc
  resources:
    requests:
      storage: 4Gi
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ebs-sc
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer
parameters:
  csi.storage.k8s.io/fstype: ext4
  type: gp3
  ext4BigAlloc: "true"
  ext4ClusterSize: "16384"
---
apiVersion: v1
kind: Pod
metadata:
  name: app
spec:
  containers:
  - name: app
    image: docker.io/ubuntu
    command: ["/bin/sh"]
    args: ["-c", "while true; do echo $(date -u) >> /data/out.txt; sleep 5; done"]
    securityContext:
      privileged: true
    volumeMounts:
    - name: persistent-storage
      mountPath: /data
  volumes:
  - name: persistent-storage
    persistentVolumeClaim:
      claimName: ebs-claim
  1. Run kubectl exec -it app -- tune2fs -l /dev/nvme1n1 which contains the cluster-size parameter:
Cluster size:                   16384
Clusters per group:       16384

Unhappy Path: Storage class with invalid ext4ClusterSize parameter value yields failed mount and mkfs.ext4 error

  1. Create a pod with happy case manifest but change the ext4ClusterSize parameter
-  ext4ClusterSize: "16384"
+  ext4ClusterSize: "-1"
  1. Run kubectl describe pod app to yield the following events:
Events:
  Type     Reason                  Age                From                     Message
  ----     ------                  ----               ----                     -------
  Warning  FailedScheduling        36s                default-scheduler        0/2 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/2 nodes are available: 2 No preemption victims found for incoming pod..
  Normal   Scheduled               31s                default-scheduler        Successfully assigned kube-system/app to i-08389b759ac9c01e8
  Normal   SuccessfulAttachVolume  29s                attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-e99db0dc-7d06-46b7-8789-f8541b54764f"
  Warning  FailedMount             13s (x6 over 29s)  kubelet                  MountVolume.MountDevice failed for volume "pvc-e99db0dc-7d06-46b7-8789-f8541b54764f" : rpc error: code = Internal desc = could not format "/dev/nvme1n1" and mount it at "/var/lib/kubelet/plugins/kubernetes.io/csi/ebs.csi.aws.com/fcceddc2f6cd5b18fd811ba25411e9e130f23a2d461e73daa39658c9011642b0/globalmount": format of disk "/dev/nvme1n1" failed: type:("ext4") target:("/var/lib/kubelet/plugins/kubernetes.io/csi/ebs.csi.aws.com/fcceddc2f6cd5b18fd811ba25411e9e130f23a2d461e73daa39658c9011642b0/globalmount") options:("defaults") errcode:(exit status 1) output:(mkfs.ext4: invalid cluster size - 1)

Note the FailedMount due to the error mkfs.ext4: invalid cluster size - 1 at the end of the last line.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 1, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @AndrewSirenko. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 1, 2023
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Aug 1, 2023
pkg/driver/node_test.go Outdated Show resolved Hide resolved
@torredil
Copy link
Member

torredil commented Aug 2, 2023

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 2, 2023
Copy link
Contributor

@ConnorJC3 ConnorJC3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm other than my one suggestion, open to opinions on that

@@ -18,6 +18,7 @@ The AWS EBS CSI Driver supports [tagging](tagging.md) through `StorageClass.para
| "inodeSize" | | | The inode size to use when formatting the underlying filesystem. Only supported on linux nodes and with fstype `ext2`, `ext3`, `ext4`, or `xfs`. |
| "bytesPerINode" | | | The `bytes-per-inode` to use when formatting the underlying filesystem. Only supported on linux nodes and with fstype `ext2`, `ext3`, `ext4`. |
| "numberOfINodes" | | | The `number-of-inodes` to use when formatting the underlying filesystem. Only supported on linux nodes and with fstype `ext2`, `ext3`, `ext4`. |
| "bigAllocClusterSize | | | The cluster size (in bytes) to use when formatting an `ext4` filesystem with the `bigalloc` option enabled. Specifying a value will enable the `bigalloc` formatting option. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding (and based on the kernel wiki here: https://ext4.wiki.kernel.org/index.php/Bigalloc) bigalloc is just the parameter to enable clustered allocation on ext4.

Thus, I propose this parameter (and associated internal variable/function names) is renamed to 'clusterSize because that is what this is actually doing. This also would make sense if we add support in the future for another filesystem that supports specifying cluster size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree because I think bigAllocClusterSize helps the customer more easily look up what this parameter does. My only grievance with clusterSize is that Windows renamed block size to cluster size. This is why I added the explicit mention of bigalloc in. https://support.microsoft.com/en-us/topic/default-cluster-size-for-ntfs-fat-and-exfat-9772e6f1-e31a-00d7-e18f-73169155af95

This issue is exacerbated by the first link coming up when you search "cluster size" is that windows document.
Screenshot 2023-08-04 at 2 35 56 PM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLDR: Yes this parameter name SHOULD be clusterSize, but because Windows overloaded the term I think we need the explicit mention of ext4's bigAlloc formatting option.

Copy link
Contributor

@ConnorJC3 ConnorJC3 Aug 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, I propose the name ext4ClusterSize. If we're going to limit the name of the parameter to only ext4 for whatever reason, we should be explicit about that.

That said, I still do not prefer using a name that is intentionally limiting our future uses.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest ext4ClusterSize as proposed above.

@torredil
Copy link
Member

torredil commented Aug 4, 2023

/test pull-aws-ebs-csi-driver-test-helm-chart

1 similar comment
@torredil
Copy link
Member

torredil commented Aug 4, 2023

/test pull-aws-ebs-csi-driver-test-helm-chart

Copy link
Member

@torredil torredil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @AndrewSirenko /lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 7, 2023
@ConnorJC3
Copy link
Contributor

/retest

@mvladev-skysql
Copy link

mvladev-skysql commented Aug 10, 2023

This change would break the fsresizer:

E0810 07:49:42.399370       1 driver.go:122] "GRPC error" err=<
        rpc error: code = Internal desc = Could not resize volume "vol-***" ("/dev/nvme1n1"):  resize of device /dev/nvme1n1 failed: exit status 1. resize2fs output: resize2fs 1.42.9 (28-Dec-2013)

        Resizing bigalloc file systems has not been fully tested.  Proceed at
        your own risk!  Use the force option if you want to go ahead anyway.

 >

If you look at the resizer it doesn't add the -f option https://github.com/kubernetes/mount-utils/blob/41e8de37ef8a3782f9cd6c915699b98b2b24b2c4/resizefs_linux.go#L71-L80

@AndrewSirenko
Copy link
Contributor Author

Just an update:

The new Linux AL2023 Base image for the driver contains an updated version of e2fsprogs which will prevent the breaking of fsresizer.

I'm working on adding an e2e formatting option parameter test suite to the driver before finishing this PR (so that I can add proper e2e testing of this feature along with the other formatting option parameters).

Thank you again Martin for exposing a gap in our testing.

@AndrewSirenko
Copy link
Contributor Author

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 14, 2023
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 29, 2023
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 4, 2023
@AndrewSirenko AndrewSirenko marked this pull request as draft October 10, 2023 23:11
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 10, 2023
@AndrewSirenko AndrewSirenko marked this pull request as ready for review October 10, 2023 23:30
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 10, 2023
docs/faq.md Outdated
### `ext4BigAlloc` and `ext4ClusterSize`

Warnings:
- The `bigalloc` feature may not be fully supported with your node's Linux kernel or may have various bugs. Please see the [ext4(5) man-page](https://man7.org/linux/man-pages/man5/ext4.5.html) for details.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize that you're just quoting from the linked man page. But this still seems unnecessarily alarming. It makes it sound like the CSI driver's implementation may have various bugs, which is not the case.

Suggest:

Ext4's bigalloc is an experimental feature, under active development. Please pay particular attention to your kernel version. See the ext4(5) man-page for details.

docs/faq.md Outdated

Warnings:
- The `bigalloc` feature may not be fully supported with your node's Linux kernel or may have various bugs. Please see the [ext4(5) man-page](https://man7.org/linux/man-pages/man5/ext4.5.html) for details.
- Linux kernel release 4.15 added support for resizing ext4 filesystems using clustered allocation. **Resizing volumes mounted to nodes running a Linux kernel version prior to 4.15 will fail and require manual intervention and downtime of the volume.**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and require manual intervention and downtime of the volume.

Either say what "manual intervention" is required or just delete this part of the sentence.

@k8s-ci-robot
Copy link
Contributor

@wmesard: changing LGTM is restricted to collaborators

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@AndrewSirenko
Copy link
Contributor Author

/retest

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 11, 2023
@AndrewSirenko
Copy link
Contributor Author

When deciding between customer experience 1 or 2 for:

a. Enabling clustered allocation for an ext4 fs (mkfs.ext4 -O bigalloc)
b. Enabling clustered allocation for an ext4 fs with a specified cluster size (mkfs.ext4 -O bigalloc -C 16384)

1a.

parameters:
  [csi.storage.k8s.io/fstype](http://csi.storage.k8s.io/fstype): ext4
  ext4BigAlloc: "true"

1b.

parameters:
  [csi.storage.k8s.io/fstype](http://csi.storage.k8s.io/fstype): ext4
  ext4BigAlloc: "true"
  ext4ClusterSize: "16384"

2a

parameters:
  [csi.storage.k8s.io/fstype](http://csi.storage.k8s.io/fstype): ext4
  ext4ClusterSize: auto

2b.

parameters:
  [csi.storage.k8s.io/fstype](http://csi.storage.k8s.io/fstype): ext4
  ext4ClusterSize: "16384"

Decided for two separate parameters (option 1) due to:

  • We are not inventing this interface in a vacuum. We are exposing an ext4 feature. As such, it is more predictable, to remain consistent with the existing ext4 interface as much as possible.
  • Future proofing in the unlikely case the ext4 interface changes
  • Explicitly separating -O bigalloc and -C 16384

At the cost of:

  • Needing two different parameters to be set when specifying cluster size. If 'ext4BigAlloc' is not set to true, but 'ext4ClusterSize' is, there will be an InvalidArgument error, which may inconvenience some customers.

@AndrewSirenko
Copy link
Contributor Author

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 12, 2023
@AndrewSirenko
Copy link
Contributor Author

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: AndrewSirenko

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 12, 2023
@AndrewSirenko AndrewSirenko removed the request for review from ConnorJC3 October 12, 2023 14:33
@k8s-ci-robot k8s-ci-robot merged commit 1b9d078 into kubernetes-sigs:master Oct 12, 2023
17 checks passed
@AndrewSirenko AndrewSirenko deleted the ext4BigAllocFormatOptions branch November 1, 2023 13:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Torn write prevention with formatOptions
7 participants