dmu: Allow buffer fills to fail #15665

amotin · 2023-12-12T02:55:19Z

When ZFS overwrites a whole block, it does not bother to read the old content from disk. It is a good optimization, but if the buffer fill fails due to page fault or something else, the buffer ends up corrupted, neither keeping old content, nor getting the new one.

On FreeBSD this is additionally complicated by page faults being blocked by VFS layer, always returning EFAULT on attempt to write from mmap()'ed but not yet cached address range. Normally it is not a big problem, since after original failure VFS will retry the write after reading the required data. The problem becomes worse in specific case when somebody tries to write into a file its own mmap()'ed content from the same location. In that situation the only copy of the data is getting corrupted on the page fault and the following retries only fixate the status quo. Block cloning makes this issue easier to reproduce, since it does not read the old data, unlike traditional file copy, that may work by chance.

This patch provides the fill status to dmu_buf_fill_done(), that in case of error can destroy the corrupted buffer as if no write happened. One more complication in case of block cloning is that if error is possible during fill, dmu_buf_will_fill() must read the data via fall-back to dmu_buf_will_dirty(). It is required to allow in case of error restoring the buffer to a state after the cloning, not not before it, that would happen if we just call dbuf_undirty().

How Has This Been Tested?

This bug is triggered on FreeBSD with the test program from #15654 , that does not happen with this patch. It does not happen on Linux since all written data are pre-faulted before write, that is impossible to do on FreeBSD.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

When ZFS overwrites a whole block, it does not bother to read the old content from disk. It is a good optimization, but if the buffer fill fails due to page fault or something else, the buffer ends up corrupted, neither keeping old content, nor getting the new one. On FreeBSD this is additionally complicated by page faults being blocked by VFS layer, always returning EFAULT on attempt to write from mmap()'ed but not yet cached address range. Normally it is not a big problem, since after original failure VFS will retry the write after reading the required data. The problem becomes worse in specific case when somebody tries to write into a file its own mmap()'ed content from the same location. In that situation the only copy of the data is getting corrupted on the page fault and the following retries only fixate the status quo. Block cloning makes this issue easier to reproduce, since it does not read the old data, unlike traditional file copy, that may work by chance. This patch provides the fill status to dmu_buf_fill_done(), that in case of error can destroy the corrupted buffer as if no write happened. One more complication in case of block cloning is that if error is possible during fill, dmu_buf_will_fill() must read the data via fall-back to dmu_buf_will_dirty(). It is required to allow in case of error restoring the buffer to a state after the cloning, not not before it, that would happen if we just call dbuf_undirty(). Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc.

behlendorf

Since we have a reproducer for this case let's make sure to pull it in as part of this PR. We're already got a mmapwrite.c test binary so we'll need to rename it.

robn · 2023-12-12T23:34:25Z

Well done, this is subtle.

I don't think I love the new third arg to dmu_buf_will_fill(), because it doesn't match how I think of this API. I think of it as "I intend to modify this dbuf in this way"; I don't really know what "can fail" means in that context. Maybe its "... and I can work around it if you can't give me that."? I don't really have a concrete suggestion though. Maybe a separate call rather than a bool? dmu_buf_will_fill_canfail()? And then dmu_buf_fill_failed() to go with it? I dunno. I'm not really sold on my suggestion either. Maybe do nothing, but if there's a way it could be more obvious, that'd be super.

Nit: double "not" in the commit message.

robn · 2023-12-12T23:36:18Z

module/zfs/dbuf.c

 			DTRACE_SET_STATE(db,
 			    "fill done handling freed in flight");
+			failed = B_FALSE;
+		} else if (failed) {
+			VERIFY(!dbuf_undirty(db, tx));


amotin · 2023-12-13T02:03:41Z

@robn dmu_buf_will_fill() means "I will completely overwrite this dbuf", unlike dmu_buf_will_dirty() which means "I intend to modify this dbuf". Partial modification is expected in the second case, but is a serious fault in the first. The added canfail argument, if true, says "I intend to completely overwrite this dbuf, but I may fail to do so". I am open to other ideas, but so far it looks logical to me.

robn · 2023-12-13T04:09:40Z

Oh hmm, "and I may fail to do so". I can live with that. I can't think of anything that would make it significantly clearer. If I do I'll send a patch :D

When ZFS overwrites a whole block, it does not bother to read the old content from disk. It is a good optimization, but if the buffer fill fails due to page fault or something else, the buffer ends up corrupted, neither keeping old content, nor getting the new one. On FreeBSD this is additionally complicated by page faults being blocked by VFS layer, always returning EFAULT on attempt to write from mmap()'ed but not yet cached address range. Normally it is not a big problem, since after original failure VFS will retry the write after reading the required data. The problem becomes worse in specific case when somebody tries to write into a file its own mmap()'ed content from the same location. In that situation the only copy of the data is getting corrupted on the page fault and the following retries only fixate the status quo. Block cloning makes this issue easier to reproduce, since it does not read the old data, unlike traditional file copy, that may work by chance. This patch provides the fill status to dmu_buf_fill_done(), that in case of error can destroy the corrupted buffer as if no write happened. One more complication in case of block cloning is that if error is possible during fill, dmu_buf_will_fill() must read the data via fall-back to dmu_buf_will_dirty(). It is required to allow in case of error restoring the buffer to a state after the cloning, not not before it, that would happen if we just call dbuf_undirty(). Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rob Norris <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes openzfs#15665

When ZFS overwrites a whole block, it does not bother to read the old content from disk. It is a good optimization, but if the buffer fill fails due to page fault or something else, the buffer ends up corrupted, neither keeping old content, nor getting the new one. On FreeBSD this is additionally complicated by page faults being blocked by VFS layer, always returning EFAULT on attempt to write from mmap()'ed but not yet cached address range. Normally it is not a big problem, since after original failure VFS will retry the write after reading the required data. The problem becomes worse in specific case when somebody tries to write into a file its own mmap()'ed content from the same location. In that situation the only copy of the data is getting corrupted on the page fault and the following retries only fixate the status quo. Block cloning makes this issue easier to reproduce, since it does not read the old data, unlike traditional file copy, that may work by chance. This patch provides the fill status to dmu_buf_fill_done(), that in case of error can destroy the corrupted buffer as if no write happened. One more complication in case of block cloning is that if error is possible during fill, dmu_buf_will_fill() must read the data via fall-back to dmu_buf_will_dirty(). It is required to allow in case of error restoring the buffer to a state after the cloning, not not before it, that would happen if we just call dbuf_undirty(). Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rob Norris <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15665

When ZFS overwrites a whole block, it does not bother to read the old content from disk. It is a good optimization, but if the buffer fill fails due to page fault or something else, the buffer ends up corrupted, neither keeping old content, nor getting the new one. On FreeBSD this is additionally complicated by page faults being blocked by VFS layer, always returning EFAULT on attempt to write from mmap()'ed but not yet cached address range. Normally it is not a big problem, since after original failure VFS will retry the write after reading the required data. The problem becomes worse in specific case when somebody tries to write into a file its own mmap()'ed content from the same location. In that situation the only copy of the data is getting corrupted on the page fault and the following retries only fixate the status quo. Block cloning makes this issue easier to reproduce, since it does not read the old data, unlike traditional file copy, that may work by chance. This patch provides the fill status to dmu_buf_fill_done(), that in case of error can destroy the corrupted buffer as if no write happened. One more complication in case of block cloning is that if error is possible during fill, dmu_buf_will_fill() must read the data via fall-back to dmu_buf_will_dirty(). It is required to allow in case of error restoring the buffer to a state after the cloning, not not before it, that would happen if we just call dbuf_undirty(). Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rob Norris <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes openzfs#15665

In case of error dmu_buf_fill_done() returns the buffer back into DB_UNCACHED state. Since during transition from DB_UNCACHED into DB_FILL state dbuf_noread() allocates an ARC buffer, we must free it here, otherwise it will be leaked. Fixes: openzfs#15665 Closes: openzfs#15802 Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc.

In case of error dmu_buf_fill_done() returns the buffer back into DB_UNCACHED state. Since during transition from DB_UNCACHED into DB_FILL state dbuf_noread() allocates an ARC buffer, we must free it here, otherwise it will be leaked. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15665 Closes #15802 Closes #16216

In case of error dmu_buf_fill_done() returns the buffer back into DB_UNCACHED state. Since during transition from DB_UNCACHED into DB_FILL state dbuf_noread() allocates an ARC buffer, we must free it here, otherwise it will be leaked. Fixes: openzfs#15665 Closes: openzfs#15802 Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc.

In case of error dmu_buf_fill_done() returns the buffer back into DB_UNCACHED state. Since during transition from DB_UNCACHED into DB_FILL state dbuf_noread() allocates an ARC buffer, we must free it here, otherwise it will be leaked. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes openzfs#15665 Closes openzfs#15802 Closes openzfs#16216

In case of error dmu_buf_fill_done() returns the buffer back into DB_UNCACHED state. Since during transition from DB_UNCACHED into DB_FILL state dbuf_noread() allocates an ARC buffer, we must free it here, otherwise it will be leaked. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Jorgen Lundman <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes openzfs#15665 Closes openzfs#15802 Closes openzfs#16216 (cherry picked from commit 02c5aa9)

amotin added the Status: Code Review Needed Ready for review and testing label Dec 12, 2023

amotin force-pushed the fill_fail branch from 4f82c54 to aafdc9c Compare December 12, 2023 21:45

behlendorf reviewed Dec 12, 2023

View reviewed changes

robn reviewed Dec 12, 2023

View reviewed changes

robn approved these changes Dec 13, 2023

View reviewed changes

behlendorf approved these changes Dec 13, 2023

View reviewed changes

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Dec 13, 2023

behlendorf merged commit 9b1677f into openzfs:master Dec 15, 2023
24 of 26 checks passed

amotin deleted the fill_fail branch December 17, 2023 18:11

mmatuska mentioned this pull request Dec 26, 2023

[2.2] BRT and other fixes into 2.2.3-staging #15714

Merged

13 tasks

usaleem-ix mentioned this pull request Dec 28, 2023

Test for clone, mmap and write for block cloning #15717

Merged

13 tasks

amotin mentioned this pull request May 22, 2024

Destroy ARC buffer in case of fill error #16216

Merged

13 tasks

amotin mentioned this pull request Jun 14, 2024

Direct IO Support #10018

Merged

17 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dmu: Allow buffer fills to fail #15665

dmu: Allow buffer fills to fail #15665

amotin commented Dec 12, 2023

behlendorf left a comment

robn commented Dec 12, 2023

robn Dec 12, 2023

amotin commented Dec 13, 2023

robn commented Dec 13, 2023

dmu: Allow buffer fills to fail #15665

dmu: Allow buffer fills to fail #15665

Conversation

amotin commented Dec 12, 2023

How Has This Been Tested?

Types of changes

Checklist:

behlendorf left a comment

Choose a reason for hiding this comment

robn commented Dec 12, 2023

robn Dec 12, 2023

Choose a reason for hiding this comment

amotin commented Dec 13, 2023

robn commented Dec 13, 2023