-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix some assign arc buf with brt clone and O_TRUNC
#15139
Conversation
I'll try reproduce it tonight. It sounds plausible; there's a lot of |
Thanks @robn the EDIT: BTW, it looks like a truncate is much faster now? Performance improvement as well? |
Bad news, I think I trapped it again with this fix. I have some analysis: When I started to debug this I checked a lot of states about
The I think this commit 0426e13 is totally fine, but before we changed it, it expected no dirty records left in any So far my thoughts to this topic. I'm left with quite a lot questionmarks. But I hope this helps to dive deeper into this error. |
a1a87b2
to
40a1173
Compare
I have tested this again with the zfs-2.2-release branch and master and rebased it to the current master. I can still reproduce this error. |
Sorry, I got pulled away onto something else and then forgot about this. I'll try to have a look this week. |
I wasn't able to reproduce this, so I'll have to do a code read and see if I can understand what you're seeing. Maybe tomorrow, but more likely next week - really tight on time right now. With |
Oh, that sounds not good. I will try it with different disk speed later or tomorrow. Maybe it depends on disk speed? I have some hardware laying arround or use some RAM to simulate super fast disks. Currently I'm using just one setup (virtual machine) for all tests.
I will add an option later to this |
@oromenahar I just added I've a little time today so I'm gonna keep pushing on this. |
@oromenahar since I can't reproduce it yet, I'm trying to understand it from your descriptions, hopefully leading to a more reliable reproduction. So I might have questions as I go! For now, can you give me a full backtrace from the crash site in |
I can reproduce it and I think I know what's happening. Patch incoming. |
Ok, I somewhat know what's happening. Here's a preliminary patch: cfa6264 Explanation, quoted from the commit:
This only updates assertions though, and I'm not sure that's enough. I suspect this may need a call to I'm also not sure if there's anything that can go wrong if the txg moves in between any of these stages. That one I have thought very hard about and I'm not seeing it, but this stuff is nightmarishly difficult to hold in my head. I'm also curious about patch. I don't think its right on its own, as its causing frees to always be issued and never delayed. If it appeared to solve the problem though, I suppose that must mean that frees also invalidate the dbuf cache in some way. I ran out of time to look into that. I'm also not sure why this requires huge files to produce. At 4G, an L2 indirect is involved, which may have enough overhead to sufficiently delay something to make this easier to catch? It'd be great to have a smaller and faster test case to help test the above conditions. I'll keep thinking on how to produce that. |
Additional: that's now two places where |
Sorry to jump in on this as an ordinary user, who doesn't understand a quarter of the internals. But... the more I think of it... Trying to clone a file and at the same time writing to it... Wouldn't a clone operation never stop if the process writing to the file will never stop? That assuming you can track the changed bits over and over again. Sounds at least like the mother of race conditions to me. I don't know how other filesystems handle this scenario (I tried a web search, but didn't find anything useful), but the thought of an "infinity snapshot" sounds nightmarish to me. Wouldn't it be better to just say: |
First:
I hade the same thoughts while reading the code :D What's a clone and what's not a clone now. Second your patch cfa6264
To be honest I don't fully understand |
40a1173
to
c9ef4ab
Compare
@robn I have added tests to cover this. Basically the four or five shell commands from above to run it on a cli. I also used your I found some more errors which are difficult to reproduce. I remember slightly why I didn't add the asserts you suggest.
|
@AllKind no the kernel panics if we don't fix it and this should work in any kind. If you are using linux several users and processes can access a file at the same time. This shouldn't fail. And it should be possible to truncate a file and write to a file a few moments later.
We are writing to the dst-file. The src-file is never touched, this case is already covered by another patch. |
Yeah, that's leaking an I am pretty sure now that the final version of this is the patch to fix up the assertions, and an then something to undirty the clone, and that's it. |
@AllKind there's your first mistake! 😆 Cloning into and writing into the same block at the same time is conceptually the same as two programs writing at the same time; the difference is just in where we get the data from. The whole trick is making sure that if we haven't written the first change to disk yet, after the second one comes in we forget the first one ever existed. I agree that there's probably not many uses for cloning into and writing into the same block at the same time, but it fits ok into ZFS' data management model. |
I think that's correct ;-) :-p Thanks for explaining! |
@amotin do you have this PR in your watch list? |
Just for a note, I did trigger this assertion on latest ZFS without the patch on FreeBSD with quickly ported version of the test: https://people.freebsd.org/~mav/zzz1.sh . Will look further. |
@amotin don't know if you see the hint in my own review, that the dirty record in the list is from an txg < current txg. I think this is important. Could confirm it several times, while debugging it. |
@oromenahar Would the dirty record txg be always smaller than current txg, it would be fine. But I suspect if it is from the current txg, it may be a mess, since the single dirty record would have to represent both block cloning and overwrite (assign), that can't be. And that I guess may indeed end up in buffer leak, data corruption or both. I think we must call dbuf_undirty() here to clean the mess. I am now testing it and will open the PR. |
Thinking more about this, I guess dbuf_free_range() called by O_TRUNC should already call dbuf_undirty() on all buffers. And if it is the only way to get into this issue in dbuf_assign_arcbuf(), then lack of dbuf_undirty() should not be a problem. But I think it should still be better to have it, just in case. PS: Several times already I've though about removing dbuf_assign_arcbuf() completely. It is used in only one place and IMO its benefits are questionable. |
@amotin FYI I could not trigger anything on linux with your PR #15653 possible a fix. I thinking about the following line: What is your opinion?
PS: If I remember correctly I hade your solution as well and could trigger the bug. But not super often. I try to find anything left in my git stash history while debugging this bug. Maybe it was a little bit diffenrent from your PR. |
@oromenahar I was unable to reproduce it. I am not sure how ctrl+c can change anything.
dbuf_unoverride() converts existing dirty record from override (sync/cloning/etc) into a normal write. dbuf_undirty() removes the dirty record completely.
IIRC there is not significant performance benefit in most cases. It was added to move memory copy with possible page faults out of transaction group in case writing from some slow memory, like mmap'ed NFS file. But in case of Linux there is pre-fault implemented, that should make faults during memory copy unlikely. And in case of FreeBSD there is no page fault waiting at all at that point, VM subsystem just returns EFAULT, ZFS aborts the write and allows VFS to handle the faults out of transaction and any locks. |
1)
To kill the process with SIGKILL maybe have an affect on this, which is different than ctrl+c. But I think it doesn't matter. 1.1)This was just an Information, that my idea and thoughts don't work out:
Sorry to confuse you. 2)Thanks, for explaining dbuf_unoverride() and dbuf_undirty(). 3)
What about the following case:
This case should be possible shouldn't be? In this case we need the undirty in 4)
I'm confused about delaying the frees. In the first version of this patch without @robn fixed asserts I couldn't trap the issue. I just changed the code where frees are delayed. I also got a huge speedup while freeing data, (which can slow down other operations while freeing). Should we dig into this as well? |
|
this adds a truncate option to clonefile, which is useful for the test suite. Signed-off-by: Kay Pedersen <[email protected]> Original-patched-by: Rob Norris <[email protected]>
In some cases dbuf_assign_arcbuf() may be called on a block that was recently cloned. If it happened in current TXG we must undo the block cloning first, since the only one dirty record per TXG can't and shouldn't mean both cloning and overwrite same time. For example this can happen while writting to a file and cloning a file at the same time. This is also covered by a test. The filesize must be huge like 4G to trigger this bug. For that reason a 4G file is created from /dev/urandom and than a file clone loop whith FICLONE starts. After that the whole file is overwritten twice. The test can trigger the bug most times but not every time. Signed-off-by: oromenahar <[email protected]>
c9ef4ab
to
a230b1d
Compare
@behlendorf cleaned up everything added the code of @amotin to this PR. Changed the test a little bit, to be shure the background loop is stopped every time. Hope it don't need that much time on the CI. |
|
||
function loop | ||
{ | ||
while $NO_LOOP_BREAK; do clonefile -c -t -q /$TESTPOOL/file /$TESTPOOL/clone; done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the CI environment it appears this loop can completely starve out the dd
's below. Can we both reduce the size of this file and cap the number of iterations here and still have this test be useful?
@oromenahar I've merged the fix in #15653 to master. It'd be great if you could rebase this PR on master and then iterate on the test case until we're happy with it. |
The problem is fixed by other PR. |
This is definitely some strange bugs I found and I think I don't understand it fully. This is a draft for now.
this fixes some bugs when opening a file wiht O_TRUNC and writing to the same file at the same time with
dd
for example. O_TRUNC in the VFS calls set_attr and set_attr will figure out a trunc based on the O_TRUNC flag.dmu_free_long_range_impl will free the data after that. If a reflink is and write is running at the same time it will fail with:
Motivation and Context
Bug fix, I don't think that this close a open Issue in the list. And I like
while true
loops.Description
How this can be reproduced:
after that open a second shell and use:
some bigger file are necessary to reproduce it.
clonefile.c
: I will improve the path after we understand the bug fully and we are sure what happens here. Than I can provide a path with options and maybe some test as well.@robn @behlendorf maybe you should check it out as well, because we got the reflink stuff off the ground together.
How Has This Been Tested?
By hand, left it for two hours running nothing happened. But maybe that was just luck?
Types of changes
Checklist:
Signed-off-by
.