-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BRT: Linux FICLONE truncates large files with dirty blocks #15728
Comments
Sadly, this issue and #15715 still prevents me from enabling BRT on prod. |
If FICLONE expects the clone to always be atomic, I can see two possible work-arounds (if we can recognize this is FICLONE what is calling zfs_clone_range()):
|
@pjd I investigated more here and upon a closer look FICLONE does not need to be atomic, but it does need to report success or failure for the entire file since FICLONE does not accept ranges. So I believe all that is missing is a check for result < file size after the call to |
Are we sure it doesn't need to be atomic? The documentation doesn't really say. If that's truly the case then we could wait for as many transactions as it takes to clone the entire file. |
Well I'm not sure, though atomic semantics are pretty rare in filesystems. Non-atomic behavior would be the same as The lack of documented semantics for the output upon error implies it does not need to be atomic especially since the atomicity of the contents is documented:
Allowing partial cloning would defeat this since a filesystem that can only partially clone cannot consistently clone an arbitrary range or entire file. Meanwhile none of the defined error codes describe a partial clone outcome... Documentation/filesystems/locking.rst repeats the atomicity of contents copying without describing errors:
Documentation/filesystems/vfs.rst seems to say both that it can return a partial result ("any bytes") and also cannot be fewer unless
Reconciling this, it seems the vfs should not shorten the cloned range absent It seems allowable to treat dirty blocks as uncloneable and thus unavoidable, though it makes the feature less useful. Nothing at all is said about the effect on the destination file upon error. Lacking a way to return a length, it seems reasonable that the output file state is undefined upon error i.e. not atomically overwritten. (It would be surprising to clobber the output in arbitrary ways, so maybe only states consistent with attempting clone should be allowed.) Given all this, it seems adding a simple length check is sufficient, though a better implementation could do more. (Ideally recording pending clones in dirty dbufs - not forcing frequent performance-harming txg syncs.) |
On Linux the ioctl_ficlonerange() and ioctl_ficlone() system calls are expected to either fully clone the specified range or return an error. The range may be for an entire file. While internally ZFS supports cloning partial ranges there's no way to return the length cloned to the caller so we need to make this all or nothing. As part of this change support for the REMAP_FILE_CAN_SHORTEN flag has been added. When REMAP_FILE_CAN_SHORTEN is set zfs_clone_range() will return a shortened range when encountering pending dirty records. When it's clear zfs_clone_range() will block and wait for the records to be written out allowing the blocks to be cloned. Furthermore, the file rangelock is held over the region being cloned to prevent it from being modified while cloning. This doesn't quite provide an atomic semantics since if an error is encountered only a portion of the range may be cloned. This will be converted to an error if REMAP_FILE_CAN_SHORTEN was not provided and returned to the caller. However, the destination file range is left in an undefined state. A test case has been added which exercises this functionality by verifing that `cp --reflink=never|auto|always` works correctly. Signed-off-by: Brian D Behlendorf <[email protected]> Issue openzfs#15728
@rrevans @pjd @robn can you please take a look at the proposed fix in PR #15842. I'd appreciate the feedback. Where I ended up on this is a version of 2. from Pawel's comment above.
My reasoning is that with FICLONE / FICLONERANGE we want to make every reasonable effort to clone the bulks. Particularly given our inability to return any useful progress information to the caller. Waiting on the transaction groups when there's outstanding dirty records may be slow but it's what I'd personally expect in this case. We could consider adding yet-another-module-option if there's a scenario where we think this is less than ideal.
As I mentioned in the PR this implementation doesn't provide strict atomicity but I think it comes close enough. It is possible to leave the destination file range in an undefined state if there's an error part way through the clone. Or if the node crashes during the system call. I considered zeroing that portion of the destination file on error but in the end decided that just added more complexity and it wasn't helpful anyway. |
On Linux the ioctl_ficlonerange() and ioctl_ficlone() system calls are expected to either fully clone the specified range or return an error. The range may be for an entire file. While internally ZFS supports cloning partial ranges there's no way to return the length cloned to the caller so we need to make this all or nothing. As part of this change support for the REMAP_FILE_CAN_SHORTEN flag has been added. When REMAP_FILE_CAN_SHORTEN is set zfs_clone_range() will return a shortened range when encountering pending dirty records. When it's clear zfs_clone_range() will block and wait for the records to be written out allowing the blocks to be cloned. Furthermore, the file rangelock is held over the region being cloned to prevent it from being modified while cloning. This doesn't quite provide an atomic semantics since if an error is encountered only a portion of the range may be cloned. This will be converted to an error if REMAP_FILE_CAN_SHORTEN was not provided and returned to the caller. However, the destination file range is left in an undefined state. A test case has been added which exercises this functionality by verifying that `cp --reflink=never|auto|always` works correctly. Signed-off-by: Brian D Behlendorf <[email protected]> Issue openzfs#15728
Thanks @behlendorf for the fix. I'll rerun my reproducer and post results I'm still concerned that forcing syncs is prohibitively expensive for some workloads like builds that may write and immediately copy various files, though having the feature work without regression is certainly strictly better. |
On Linux the ioctl_ficlonerange() and ioctl_ficlone() system calls are expected to either fully clone the specified range or return an error. The range may be for an entire file. While internally ZFS supports cloning partial ranges there's no way to return the length cloned to the caller so we need to make this all or nothing. As part of this change support for the REMAP_FILE_CAN_SHORTEN flag has been added. When REMAP_FILE_CAN_SHORTEN is set zfs_clone_range() will return a shortened range when encountering pending dirty records. When it's clear zfs_clone_range() will block and wait for the records to be written out allowing the blocks to be cloned. Furthermore, the file rangelock is held over the region being cloned to prevent it from being modified while cloning. This doesn't quite provide an atomic semantics since if an error is encountered only a portion of the range may be cloned. This will be converted to an error if REMAP_FILE_CAN_SHORTEN was not provided and returned to the caller. However, the destination file range is left in an undefined state. A test case has been added which exercises this functionality by verifying that `cp --reflink=never|auto|always` works correctly. Signed-off-by: Brian D Behlendorf <[email protected]> Issue openzfs#15728
On Linux the ioctl_ficlonerange() and ioctl_ficlone() system calls are expected to either fully clone the specified range or return an error. The range may be for an entire file. While internally ZFS supports cloning partial ranges there's no way to return the length cloned to the caller so we need to make this all or nothing. As part of this change support for the REMAP_FILE_CAN_SHORTEN flag has been added. When REMAP_FILE_CAN_SHORTEN is set zfs_clone_range() will return a shortened range when encountering pending dirty records. When it's clear zfs_clone_range() will block and wait for the records to be written out allowing the blocks to be cloned. Furthermore, the file rangelock is held over the region being cloned to prevent it from being modified while cloning. This doesn't quite provide an atomic semantics since if an error is encountered only a portion of the range may be cloned. This will be converted to an error if REMAP_FILE_CAN_SHORTEN was not provided and returned to the caller. However, the destination file range is left in an undefined state. A test case has been added which exercises this functionality by verifying that `cp --reflink=never|auto|always` works correctly. Signed-off-by: Brian D Behlendorf <[email protected]> Issue openzfs#15728
On Linux the ioctl_ficlonerange() and ioctl_ficlone() system calls are expected to either fully clone the specified range or return an error. The range may be for an entire file. While internally ZFS supports cloning partial ranges there's no way to return the length cloned to the caller so we need to make this all or nothing. As part of this change support for the REMAP_FILE_CAN_SHORTEN flag has been added. When REMAP_FILE_CAN_SHORTEN is set zfs_clone_range() will return a shortened range when encountering pending dirty records. When it's clear zfs_clone_range() will block and wait for the records to be written out allowing the blocks to be cloned. Furthermore, the file rangelock is held over the region being cloned to prevent it from being modified while cloning. This doesn't quite provide an atomic semantics since if an error is encountered only a portion of the range may be cloned. This will be converted to an error if REMAP_FILE_CAN_SHORTEN was not provided and returned to the caller. However, the destination file range is left in an undefined state. A test case has been added which exercises this functionality by verifying that `cp --reflink=never|auto|always` works correctly. Signed-off-by: Brian D Behlendorf <[email protected]> Issue openzfs#15728
On Linux the ioctl_ficlonerange() and ioctl_ficlone() system calls are expected to either fully clone the specified range or return an error. The range may be for an entire file. While internally ZFS supports cloning partial ranges there's no way to return the length cloned to the caller so we need to make this all or nothing. As part of this change support for the REMAP_FILE_CAN_SHORTEN flag has been added. When REMAP_FILE_CAN_SHORTEN is set zfs_clone_range() will return a shortened range when encountering pending dirty records. When it's clear zfs_clone_range() will block and wait for the records to be written out allowing the blocks to be cloned. Furthermore, the file rangelock is held over the region being cloned to prevent it from being modified while cloning. This doesn't quite provide an atomic semantics since if an error is encountered only a portion of the range may be cloned. This will be converted to an error if REMAP_FILE_CAN_SHORTEN was not provided and returned to the caller. However, the destination file range is left in an undefined state. A test case has been added which exercises this functionality by verifying that `cp --reflink=never|auto|always` works correctly. Signed-off-by: Brian D Behlendorf <[email protected]> Issue openzfs#15728
On Linux the ioctl_ficlonerange() and ioctl_ficlone() system calls are expected to either fully clone the specified range or return an error. The range may be for an entire file. While internally ZFS supports cloning partial ranges there's no way to return the length cloned to the caller so we need to make this all or nothing. As part of this change support for the REMAP_FILE_CAN_SHORTEN flag has been added. When REMAP_FILE_CAN_SHORTEN is set zfs_clone_range() will return a shortened range when encountering pending dirty records. When it's clear zfs_clone_range() will block and wait for the records to be written out allowing the blocks to be cloned. Furthermore, the file range lock is held over the region being cloned to prevent it from being modified while cloning. This doesn't quite provide an atomic semantics since if an error is encountered only a portion of the range may be cloned. This will be converted to an error if REMAP_FILE_CAN_SHORTEN was not provided and returned to the caller. However, the destination file range is left in an undefined state. A test case has been added which exercises this functionality by verifying that `cp --reflink=never|auto|always` works correctly. Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #15728 Closes #15842
On Linux the ioctl_ficlonerange() and ioctl_ficlone() system calls are expected to either fully clone the specified range or return an error. The range may be for an entire file. While internally ZFS supports cloning partial ranges there's no way to return the length cloned to the caller so we need to make this all or nothing. As part of this change support for the REMAP_FILE_CAN_SHORTEN flag has been added. When REMAP_FILE_CAN_SHORTEN is set zfs_clone_range() will return a shortened range when encountering pending dirty records. When it's clear zfs_clone_range() will block and wait for the records to be written out allowing the blocks to be cloned. Furthermore, the file range lock is held over the region being cloned to prevent it from being modified while cloning. This doesn't quite provide an atomic semantics since if an error is encountered only a portion of the range may be cloned. This will be converted to an error if REMAP_FILE_CAN_SHORTEN was not provided and returned to the caller. However, the destination file range is left in an undefined state. A test case has been added which exercises this functionality by verifying that `cp --reflink=never|auto|always` works correctly. Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes openzfs#15728 Closes openzfs#15842
On Linux the ioctl_ficlonerange() and ioctl_ficlone() system calls are expected to either fully clone the specified range or return an error. The range may be for an entire file. While internally ZFS supports cloning partial ranges there's no way to return the length cloned to the caller so we need to make this all or nothing. As part of this change support for the REMAP_FILE_CAN_SHORTEN flag has been added. When REMAP_FILE_CAN_SHORTEN is set zfs_clone_range() will return a shortened range when encountering pending dirty records. When it's clear zfs_clone_range() will block and wait for the records to be written out allowing the blocks to be cloned. Furthermore, the file range lock is held over the region being cloned to prevent it from being modified while cloning. This doesn't quite provide an atomic semantics since if an error is encountered only a portion of the range may be cloned. This will be converted to an error if REMAP_FILE_CAN_SHORTEN was not provided and returned to the caller. However, the destination file range is left in an undefined state. A test case has been added which exercises this functionality by verifying that `cp --reflink=never|auto|always` works correctly. Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes openzfs#15728 Closes openzfs#15842
On Linux the ioctl_ficlonerange() and ioctl_ficlone() system calls are expected to either fully clone the specified range or return an error. The range may be for an entire file. While internally ZFS supports cloning partial ranges there's no way to return the length cloned to the caller so we need to make this all or nothing. As part of this change support for the REMAP_FILE_CAN_SHORTEN flag has been added. When REMAP_FILE_CAN_SHORTEN is set zfs_clone_range() will return a shortened range when encountering pending dirty records. When it's clear zfs_clone_range() will block and wait for the records to be written out allowing the blocks to be cloned. Furthermore, the file range lock is held over the region being cloned to prevent it from being modified while cloning. This doesn't quite provide an atomic semantics since if an error is encountered only a portion of the range may be cloned. This will be converted to an error if REMAP_FILE_CAN_SHORTEN was not provided and returned to the caller. However, the destination file range is left in an undefined state. A test case has been added which exercises this functionality by verifying that `cp --reflink=never|auto|always` works correctly. Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes openzfs#15728 Closes openzfs#15842
System information
Describe the problem you're observing
cp --reflink=always
succeeds but the output file is truncated for previously synced files that are recently modified at a large enough offset such that the range clone partially succeeds.In this case clone sees one or more transaction's worth of synced block pointers that can be cloned followed by dirty blocks that are unable to be cloned yet. ZFS only makes a partial clone but
cp
later reports success.From a quick read of coreutils, linux, and zfs code:
ioctl(int dest_fd, FICLONE, int src_fd)
zpl_remap_file_range(src, 0, dst, 0, 0, 0)
__zpl_clone_file_range
callszfs_clone_range
ioctl_file_clone
which only fails if negative when len == 0TL;DR the kernel seems to expect that length == 0 implies the whole file cloning operation must succeed or fail atomically. ZFS should return
-EINVAL
when len == 0 and the operation fails to clone all blocks of the file. IOW, it seems that the kernel API makes it the responsibility of the vfs implementation to check, though the documentation does not spell that out.)Describe how to reproduce the problem
recordsize=128k
cp --reflink=always
to copy the fileExpected result: target file is the same as source (or cp fails to clone)
Actual result: target file is a truncated copy of the source file
Example script:
Example output:
I have also reproduced this same issue using
zfs-2.2.0
release (95785196f2
) on kernel6.5.12-100.fc37.x86_64
(Fedora 37).This issue also occurs with sparse files - replacing the initial
count=1022
withcount=2
above has the same result.Note that reproducing this is dependent on
recordsize
, but can be reproduced with other sizes by changing thebs=
argument accordingly.Also YMMV reproducing if the ZIL buffer sizes are different (and thus fit more than or fewer than 1022 block pointers per transaction).
Include any warning/errors/backtraces from the system logs
No warnings or errors in
dmesg
,/proc/spl/kstat/zfs/dbgmsg
(same for debug build), orcp
.The text was updated successfully, but these errors were encountered: