Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix some assign arc buf with brt clone and O_TRUNC #15139

Closed
wants to merge 2 commits into from

Conversation

oromenahar
Copy link
Contributor

@oromenahar oromenahar commented Aug 1, 2023

This is definitely some strange bugs I found and I think I don't understand it fully. This is a draft for now.

this fixes some bugs when opening a file wiht O_TRUNC and writing to the same file at the same time with dd for example. O_TRUNC in the VFS calls set_attr and set_attr will figure out a trunc based on the O_TRUNC flag.
dmu_free_long_range_impl will free the data after that. If a reflink is and write is running at the same time it will fail with:

kernel:VERIFY(db->db_state == DB_CACHED || db->db_state == DB_UNCACHED) failed
kernel:PANIC at dbuf.c:2925:dbuf_assign_arcbuf()

Motivation and Context

Bug fix, I don't think that this close a open Issue in the list. And I like while true loops.

Description

How this can be reproduced:

zpool create -f tank /dev/sdb && zfs create tank/test
dd if=/dev/random of=/tank/test/test.img bs=4M count=1000 status=progress
zpool sync
while true; do clonefile -c /tank/test/test.img /tank/test/test.img2 && date; done

after that open a second shell and use:

dd if=/dev/random of=/tank/test/test.img2 bs=4M count=1000 status=progress

some bigger file are necessary to reproduce it.

clonefile.c: I will improve the path after we understand the bug fully and we are sure what happens here. Than I can provide a path with options and maybe some test as well.
@robn @behlendorf maybe you should check it out as well, because we got the reflink stuff off the ground together.

How Has This Been Tested?

By hand, left it for two hours running nothing happened. But maybe that was just luck?

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@robn
Copy link
Member

robn commented Aug 1, 2023

I'll try reproduce it tonight. It sounds plausible; there's a lot of db_state assertions and I didn't check (m)any outside the main dbuf paths.

@oromenahar oromenahar mentioned this pull request Aug 1, 2023
7 tasks
@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Aug 1, 2023
@oromenahar
Copy link
Contributor Author

oromenahar commented Aug 1, 2023

Thanks @robn the O_TRUNC in clonefile.c is more like some recent coreutils behavior. For that reason I tested it and checked if I can find any bugs. Sorry for messing up with your clonefile.c. O_WRONLY|O_CREAT|O_TRUNC should be the same like creat() and more or less like O_WRONLY|O_CREAT but I wanted exactly O_WRONLY|O_TRUNC because that's what's happening in coreutils if the file exists and I remember correctly. Don't remember the exact line, I think it was this: coreutils but not sure and it's getting late.

EDIT: BTW, it looks like a truncate is much faster now? Performance improvement as well?

@oromenahar
Copy link
Contributor Author

oromenahar commented Aug 2, 2023

Bad news, I think I trapped it again with this fix.
I'm not sure if this happens because of my virtual machine, which I'm using for the test. It's a rockylinux 9 5 weeks old. I think it's independent of the base os. It would be nice if somebody can reproduce the error and give a quick result about the test. In all of my test I could reproduce the bug pretty quickly.

I have some analysis:
This is definitly trapped by the free_range stuff and clone at the same time.
Maybe this commit should be reconsiderd 0426e13 but later more.

When I started to debug this I checked a lot of states about db->db_state. When I run my while true loops the code runs through the following functions:

zpl_iter_write+0xfb/0x1f0 [zfs]
? sysvec_apic_timer_interrupt+0xb/0x90
zfs_write+0x783/0xde0 [zfs]
dmu_assign_arcbuf_by_dbuf+0x41/0x60 [zfs]
dmu_assign_arcbuf_by_dnode+0x182/0x220 [zfs]
dbuf_assign_arcbuf+0x3c7/0x650 [zfs]

The db->db_state is DB_NOFILL in dbuf_assign_arcbuf, which I would not expect. I mean the ASSERT don't expect it as well.
EDIT: in zfs_write is no dirty record in the list as far as I could debug it. This changes a few function calls later (multi threading and co.). The new dirty record should be independent from the current operation/txg (which is confirmed a little bit later in this analysis).
In dbuf_assign_arcbuf: There is one dirty record in db->db_dirty_records this is a brt clone with dr->dr_txg < txg. The brt clone block is older and the record exists. This shouldn't be a problem and the db->db_buf is NULL.
If the file, which should be cloned, there is no truncated there is no problem, I think. Keep in mind the dd truncates the file when opening it. But that is not that often and maybe the dbuf_assign_arcbuf will be triggerd in some cases (kind of race condition?).
Maybe we are leaking a dirty dbuf from a previous transaction into the current one? Some kind of dbuf_fix_old_data() but I don't think dbuf_fix_old_data should be done in dbuf_assign_arcbuf?

I think this commit 0426e13 is totally fine, but before we changed it, it expected no dirty records left in any txg. It doesn't make sense in combination with the truncate for that reason we changed it...

So far my thoughts to this topic. I'm left with quite a lot questionmarks. But I hope this helps to dive deeper into this error.

@oromenahar
Copy link
Contributor Author

oromenahar commented Sep 19, 2023

I have tested this again with the zfs-2.2-release branch and master and rebased it to the current master. I can still reproduce this error.
@robn @behlendorf could you reproduce and verify this bug? Or have any kind of new information and details? Even if you couldn't reproduce it?

@robn
Copy link
Member

robn commented Sep 20, 2023

Sorry, I got pulled away onto something else and then forgot about this. I'll try to have a look this week.

@robn
Copy link
Member

robn commented Sep 21, 2023

I wasn't able to reproduce this, so I'll have to do a code read and see if I can understand what you're seeing. Maybe tomorrow, but more likely next week - really tight on time right now.

With O_TRUNC, it likely needs to be an option (-t?) to enable it. We still need O_CREAT, but we don't always want to truncate (range clones that aren't the whole file).

@oromenahar
Copy link
Contributor Author

Oh, that sounds not good. I will try it with different disk speed later or tomorrow. Maybe it depends on disk speed? I have some hardware laying arround or use some RAM to simulate super fast disks. Currently I'm using just one setup (virtual machine) for all tests.

With O_TRUNC, it likely needs to be an option (-t?) to enable it. We still need O_CREAT, but we don't always want to truncate (range clones that aren't the whole file).

I will add an option later to this -t sounds good for trunc. It was for fast testing, because I used cp from coreutils and wanted to be sure that this also happen in a testing environment. For that reason there is no option right now.

@robn
Copy link
Member

robn commented Sep 21, 2023

@oromenahar I just added -t to clonefile upstream: https://github.com/robn/clonefile/blob/main/clonefile.c. I'll pull it into the next PR that needs it for a test (ie this one, if I can track it down).

I've a little time today so I'm gonna keep pushing on this.

@robn
Copy link
Member

robn commented Sep 21, 2023

@oromenahar since I can't reproduce it yet, I'm trying to understand it from your descriptions, hopefully leading to a more reliable reproduction. So I might have questions as I go!

For now, can you give me a full backtrace from the crash site in dbuf_assign_arcbuf, and also trace from dmu_free_long_range_impl if you can?

@robn
Copy link
Member

robn commented Sep 22, 2023

I can reproduce it and I think I know what's happening. Patch incoming.

@robn
Copy link
Member

robn commented Sep 22, 2023

Ok, I somewhat know what's happening. Here's a preliminary patch: cfa6264

Explanation, quoted from the commit:

		/*
		 * The only valid way for this dbuf to be DB_NOFILL at this
		 * point is for this sequence to happen:
		 *
		 * 1. zfs_clone_range() clones into this dbuf. This leaves the
		 *    dbuf as DB_NOFILL with a dirty record with brtwrite set.
		 *
		 * 2. dmu_free_long_range() (via zfs_trunc() or
		 *    zfs_free_range()) frees the block under this dbuf to the
		 *    end of the object, such that z_size is now behind the
		 *    start of this dbuf.
		 *
		 * 3. zfs_write() attempts to write a full block to this
		 *    offset. The write is past z_size and for a full block,
		 *    so it ends up on this path; it calls dmu_request_arcbuf(),
		 *    fills the buffer, and then dmu_assign_arcbuf_by_dbuf(),
		 *    which loads the NOFILL dbuf and ends up here.
		 *
		 * It shouldn't be possible for a NOFILL dbuf to arrive here
		 * any other way, so we assert that there's also a dirty
		 * brtwrite record attached.
		 */

This only updates assertions though, and I'm not sure that's enough.

I suspect this may need a call to dbuf_undirty() (or possibly dbuf_unoverride()) first to remove the dirty "brtwrite" record, not unlike what dmu_buf_will_dirty_impl() does. Missing it feels like it would cause a clone to happen (bumping the refcount) and then overwrite it, leaking a reference?

I'm also not sure if there's anything that can go wrong if the txg moves in between any of these stages. That one I have thought very hard about and I'm not seeing it, but this stuff is nightmarishly difficult to hold in my head.

I'm also curious about patch. I don't think its right on its own, as its causing frees to always be issued and never delayed. If it appeared to solve the problem though, I suppose that must mean that frees also invalidate the dbuf cache in some way. I ran out of time to look into that.

I'm also not sure why this requires huge files to produce. At 4G, an L2 indirect is involved, which may have enough overhead to sufficiently delay something to make this easier to catch?

It'd be great to have a smaller and faster test case to help test the above conditions. I'll keep thinking on how to produce that.

@robn
Copy link
Member

robn commented Sep 22, 2023

Additional: that's now two places where DB_NOFILL has additional transitions not covered in the state diagram that only happen when a dbuf is cloned. At least, I want to update the diagram, but I'm also strongly tempted to introduce a DB_CLONE state for this, just to make it easier to follow the code. I wouldn't try such an invasive change without decent test coverage though, so I'm not likely to look into that soon.

@AllKind
Copy link
Contributor

AllKind commented Sep 22, 2023

Sorry to jump in on this as an ordinary user, who doesn't understand a quarter of the internals.

But... the more I think of it...

Trying to clone a file and at the same time writing to it...
Is this even solvable?
Is this even something anyone would want?

Wouldn't a clone operation never stop if the process writing to the file will never stop? That assuming you can track the changed bits over and over again. Sounds at least like the mother of race conditions to me.
Even if the process stops writing to the "source" file, wouldn't it be necessary to restart the whole cloning operation, in order to get a correct result?

I don't know how other filesystems handle this scenario (I tried a web search, but didn't find anything useful), but the thought of an "infinity snapshot" sounds nightmarish to me.

Wouldn't it be better to just say:
Hey, you tried to clone that file, but it already changed before I could complete the operation. Sorry, can't do that.

@oromenahar
Copy link
Contributor Author

First:

but I'm also strongly tempted to introduce a DB_CLONE state for this, just to make it easier to follow the code.

I hade the same thoughts while reading the code :D What's a clone and what's not a clone now.

Second your patch cfa6264
I think added the same asserts like you as well and removed it later for some reasons. But I don't remeber actually why I removed it and changed it to this patch. I will try your patch and read it again and try to remember what I have done and why I have removed the asserts, which you suggest. But need some time.
I think it makes sense to add it to this patch set as well. Let me think about it.

I'm also curious about patch. I don't think its right on its own, as its causing frees to always be issued and never delayed. If it appeared to solve the problem though, I suppose that must mean that frees also invalidate the dbuf cache in some way. I ran out of time to look into that.

To be honest I don't fully understand dnode_free_range(dn, chunk_begin, chunk_len, tx); but there is a much speed improvement when adding this patch while running the test case in the opening of this PR.

@oromenahar
Copy link
Contributor Author

@robn I have added tests to cover this. Basically the four or five shell commands from above to run it on a cli. I also used your clonefile.c patch. I think I forgot a sponsored-by but you can check it out.

I found some more errors which are difficult to reproduce. I remember slightly why I didn't add the asserts you suggest.
When using this patch everything works fine until you kill some of the processes with kill -9 or sometimes with ctrl+c. I'm not 100% sure if this belongs to this error as well here. I try to understand this error a little bit more.
If you kill one of the processes and than export the pool this happens. I couldn't reproduce it with not delaying the free.
But don't know much about it. Try to reproduce it later, hopefully find it.

[ 3018.513124] =============================================================================
[ 3018.513296] BUG arc_buf_hdr_t_full (Tainted: P    B      OE    --------  --- ): Objects remaining in arc_buf_hdr_t_full on __kmem_cache_shutdown()
[ 3018.513478] -----------------------------------------------------------------------------

[ 3018.513855] Slab 0x00000000ef92309f objects=37 used=34 fp=0x00000000ed2ccd2e flags=0x17ffffc0010200(slab|head|node=0|zone=2|lastcpupid=0x1fffff)
[ 3018.514051] CPU: 2 PID: 47138 Comm: rmmod Kdump: loaded Tainted: P    B      OE    --------  ---  5.14.0-284.11.1.el9_2.x86_64 #1
[ 3018.514243] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[ 3018.514450] Call Trace:
[ 3018.514638]  <TASK>
[ 3018.514838]  dump_stack_lvl+0x34/0x48
[ 3018.515025]  slab_err.cold+0x53/0x67
[ 3018.515211]  ? _printk+0x58/0x73
[ 3018.515394]  ? cpumask_next+0x1f/0x30
[ 3018.515574]  __kmem_cache_shutdown+0x16e/0x320
[ 3018.515754]  kmem_cache_destroy+0x51/0x160
[ 3018.515945]  spl_kmem_cache_destroy+0x100/0x490 [spl]
[ 3018.516134]  ? __vunmap+0x2ee/0x340
[ 3018.516321]  arc_fini+0x2d8/0x400 [zfs]
[ 3018.516566]  dmu_fini+0xa/0x40 [zfs]
[ 3018.516820]  spa_fini+0x37/0x230 [zfs]
[ 3018.517084]  zfs_kmod_fini+0x6b/0xc0 [zfs]
[ 3018.517339]  openzfs_fini+0xa/0x1004 [zfs]
[ 3018.517599]  __do_sys_delete_module.constprop.0+0x178/0x280
[ 3018.517807]  ? syscall_trace_enter.constprop.0+0x145/0x1d0
[ 3018.518008]  do_syscall_64+0x5c/0x90
[ 3018.518211]  ? __rseq_handle_notify_resume+0x32/0x50
[ 3018.518418]  ? exit_to_user_mode_loop+0xd0/0x130
[ 3018.518626]  ? exit_to_user_mode_prepare+0xb6/0x100
[ 3018.518850]  ? syscall_exit_to_user_mode+0x12/0x30
[ 3018.519061]  ? do_syscall_64+0x69/0x90
[ 3018.519274]  ? syscall_exit_to_user_mode+0x12/0x30
[ 3018.519490]  ? do_syscall_64+0x69/0x90
[ 3018.519708]  ? exc_page_fault+0x62/0x150
[ 3018.519935]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 3018.520158] RIP: 0033:0x7fb4d243f5ab
[ 3018.520383] Code: 73 01 c3 48 8b 0d 75 a8 1b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 45 a8 1b 00 f7 d8 64 89 01 48
[ 3018.520894] RSP: 002b:00007ffe68d74e78 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[ 3018.521160] RAX: ffffffffffffffda RBX: 00005568be9ff7c0 RCX: 00007fb4d243f5ab
[ 3018.521425] RDX: 000000000000000a RSI: 0000000000000800 RDI: 00005568be9ff828
[ 3018.521692] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[ 3018.521970] R10: 00007fb4d259eac0 R11: 0000000000000206 R12: 00007ffe68d750d0
[ 3018.522242] R13: 00007ffe68d75680 R14: 00005568be9ff2a0 R15: 00005568be9ff7c0
[ 3018.522517]  </TASK>

@oromenahar
Copy link
Contributor Author

oromenahar commented Sep 22, 2023

@AllKind no the kernel panics if we don't fix it and this should work in any kind. If you are using linux several users and processes can access a file at the same time. This shouldn't fail. And it should be possible to truncate a file and write to a file a few moments later.
I don't know if you can use the data after that at all. But it doesn't mater if any process kills your data. The filesystem should stay up and happy running.

Wouldn't a clone operation never stop if the process writing to the file will never stop? That assuming you can track the changed bits over and over again. Sounds at least like the mother of race conditions to me.
Even if the process stops writing to the "source" file, wouldn't it be necessary to restart the whole cloning operation, in order to get a correct result?

We are writing to the dst-file. The src-file is never touched, this case is already covered by another patch.

@robn
Copy link
Member

robn commented Sep 22, 2023

Yeah, that's leaking an arc_buf_hdr_t, which is what I was wondering about. My guess: we assign the arcbuf to the dbuf, but we don't remove the brtwrite dirty state. So we write it down as a clone still, and never bother to clean up the arcbuf, because clones don't have them.

I am pretty sure now that the final version of this is the patch to fix up the assertions, and an then something to undirty the clone, and that's it.

@robn
Copy link
Member

robn commented Sep 22, 2023

But... the more I think of it...

@AllKind there's your first mistake! 😆

Cloning into and writing into the same block at the same time is conceptually the same as two programs writing at the same time; the difference is just in where we get the data from. The whole trick is making sure that if we haven't written the first change to disk yet, after the second one comes in we forget the first one ever existed.

I agree that there's probably not many uses for cloning into and writing into the same block at the same time, but it fits ok into ZFS' data management model.

@AllKind
Copy link
Contributor

AllKind commented Sep 22, 2023

But... the more I think of it...
@AllKind there's your first mistake! 😆

I think that's correct ;-) :-p

Thanks for explaining!

@robn robn mentioned this pull request Nov 15, 2023
13 tasks
@mmatuska
Copy link
Contributor

@amotin do you have this PR in your watch list?

@amotin
Copy link
Member

amotin commented Dec 2, 2023

Just for a note, I did trigger this assertion on latest ZFS without the patch on FreeBSD with quickly ported version of the test: https://people.freebsd.org/~mav/zzz1.sh . Will look further.

@oromenahar
Copy link
Contributor Author

@amotin don't know if you see the hint in my own review, that the dirty record in the list is from an txg < current txg. I think this is important. Could confirm it several times, while debugging it.

@amotin
Copy link
Member

amotin commented Dec 7, 2023

dirty record in the list is from an txg < current txg. I think this is important.

@oromenahar Would the dirty record txg be always smaller than current txg, it would be fine. But I suspect if it is from the current txg, it may be a mess, since the single dirty record would have to represent both block cloning and overwrite (assign), that can't be. And that I guess may indeed end up in buffer leak, data corruption or both. I think we must call dbuf_undirty() here to clean the mess. I am now testing it and will open the PR.

@amotin
Copy link
Member

amotin commented Dec 7, 2023

Thinking more about this, I guess dbuf_free_range() called by O_TRUNC should already call dbuf_undirty() on all buffers. And if it is the only way to get into this issue in dbuf_assign_arcbuf(), then lack of dbuf_undirty() should not be a problem. But I think it should still be better to have it, just in case.

PS: Several times already I've though about removing dbuf_assign_arcbuf() completely. It is used in only one place and IMO its benefits are questionable.

@oromenahar
Copy link
Contributor Author

oromenahar commented Dec 9, 2023

@amotin FYI I could not trigger anything on linux with your PR #15653 possible a fix.

I thinking about the following line:
If we have a dirty record and it's the current txg if (dr->dr_txg == txg) { we call dbuf_unoverride(dr);. If we now apply @robn solution we get an leaked arc header. I quess we don't release all the arcs? This should be fixed by #15656 as well?
E: tried it and got Objects remaining in arc_buf_hdr_t_full on __kmem_cache_shutdown().
I wonder what's the differents between dbuf_unoverride and dbuf_undirty?

What is your opinion?

dbuf_assign_arcbuf() should improve performance or? Are there any other benefits?

PS: If I remember correctly I hade your solution as well and could trigger the bug. But not super often. I try to find anything left in my git stash history while debugging this bug. Maybe it was a little bit diffenrent from your PR.

@amotin
Copy link
Member

amotin commented Dec 10, 2023

tried it and got Objects remaining in arc_buf_hdr_t_full on __kmem_cache_shutdown().

@oromenahar I was unable to reproduce it. I am not sure how ctrl+c can change anything.

I wonder what's the differents between dbuf_unoverride and dbuf_undirty?

dbuf_unoverride() converts existing dirty record from override (sync/cloning/etc) into a normal write. dbuf_undirty() removes the dirty record completely.

dbuf_assign_arcbuf() should improve performance or? Are there any other benefits?

IIRC there is not significant performance benefit in most cases. It was added to move memory copy with possible page faults out of transaction group in case writing from some slow memory, like mmap'ed NFS file. But in case of Linux there is pre-fault implemented, that should make faults during memory copy unlikely. And in case of FreeBSD there is no page fault waiting at all at that point, VM subsystem just returns EFAULT, ZFS aborts the write and allows VFS to handle the faults out of transaction and any locks.

@oromenahar
Copy link
Contributor Author

oromenahar commented Dec 10, 2023

1)

tried it and got Objects remaining in arc_buf_hdr_t_full on __kmem_cache_shutdown().
@oromenahar I was unable to reproduce it. I am not sure how ctrl+c can change anything.

To kill the process with SIGKILL maybe have an affect on this, which is different than ctrl+c. But I think it doesn't matter.

1.1)

This was just an Information, that my idea and thoughts don't work out:

I thinking about the following line:
If we have a dirty record and it's the current txg if (dr->dr_txg == txg) { we call dbuf_unoverride(dr);. If we now apply @robn solution we get an leaked arc header. I quess we don't release all the arcs? This should be fixed by #15656 as well?

Sorry to confuse you.

2)

Thanks, for explaining dbuf_unoverride() and dbuf_undirty().
If I'm understanding it corretly we need the dbuf_undirty() in arc_assign. I read @robn messages again. And he thinks also about the undirty. This should be the right solution.
I can update this PR with your code and my tests? Or do you want to update your PR with my test? I don't like much the test because of the huge 4G files, but don't have a better idea to test this.

3)

Thinking more about this, I guess dbuf_free_range() called by O_TRUNC should already call dbuf_undirty() on all buffers. And if it is the only way to get into this issue in dbuf_assign_arcbuf(), then lack of dbuf_undirty() should not be a problem. But I think it should still be better to have it, just in case.

What about the following case:

dbuf_free_range()
	for (db != NULL)
		mutex()
		dbuf_undirty()
		mutex_exit()
mutex()
dmu_buf_will_clone()
mutex_exit()
mutex()
dbuf_assign_arcbuf()
mutex_exit()

This case should be possible shouldn't be? In this case we need the undirty in dbuf_assign_arcbuf()

4)

I'm also curious about patch. I don't think its right on its own, as its causing frees to always be issued and never delayed. If it appeared to solve the problem though, I suppose that must mean that frees also invalidate the dbuf cache in some way.

I'm confused about delaying the frees. In the first version of this patch without @robn fixed asserts I couldn't trap the issue. I just changed the code where frees are delayed. I also got a huge speedup while freeing data, (which can slow down other operations while freeing). Should we dig into this as well?
I mean the file: module/zfs/dmu.c

@amotin
Copy link
Member

amotin commented Dec 11, 2023

@oromenahar

  1. My primary question was: what syscall can be aborted inside ZFS? I would not expect many of any. Kernel threads can not be killed normally, that should have explicit signal handling for that. At least I briefly looked on some participating here and haven't noticed anything. If it is just a way to affect your test to trigger the problem, then maybe the test itself could be improved to not depend on it.
  2. I generally don't care how exactly the patch land if it gets reviews and land. If you don't like the test, may be it does not worth to waste CI resources on it once the issue is fixed.
  3. In theory it is possible, that is why I'd prefer to have undirty there. But in practice, dbuf_assign_arcbuf() is now used only for file concatenation, and it should not happen after clone.
  4. I don't see how that chunk is related to this issue other than affecting the reproduction. If you think there is an issue -- create separate PR and lets think about it. Otherwise IMO it should not be there. Though I haven't looked too deep on it.

this adds a truncate option to clonefile, which is useful
for the test suite.

Signed-off-by: Kay Pedersen <[email protected]>
Original-patched-by: Rob Norris <[email protected]>
In some cases dbuf_assign_arcbuf() may be called on a block that
was recently cloned. If it happened in current TXG we must undo
the block cloning first, since the only one dirty record per TXG
can't and shouldn't mean both cloning and overwrite same time.
For example this can happen while writting to a file and cloning
a file at the same time.

This is also covered by a test. The filesize must be huge like
4G to trigger this bug. For that reason a 4G file is created
from /dev/urandom and than a file clone loop whith FICLONE
starts. After that the whole file is overwritten twice. The
test can trigger the bug most times but not every time.

Signed-off-by: oromenahar <[email protected]>
@oromenahar oromenahar marked this pull request as ready for review December 12, 2023 00:30
@oromenahar
Copy link
Contributor Author

oromenahar commented Dec 12, 2023

@amotin

  1. my signal test don't trigger them any more (maybe never did and this was just a random side effect). The most/every kill signals for processes in the userspace don't end up in ZFS and this shouldn't effect anything here.

@behlendorf cleaned up everything added the code of @amotin to this PR. Changed the test a little bit, to be shure the background loop is stopped every time. Hope it don't need that much time on the CI.


function loop
{
while $NO_LOOP_BREAK; do clonefile -c -t -q /$TESTPOOL/file /$TESTPOOL/clone; done
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the CI environment it appears this loop can completely starve out the dd's below. Can we both reduce the size of this file and cap the number of iterations here and still have this test be useful?

@behlendorf
Copy link
Contributor

@oromenahar I've merged the fix in #15653 to master. It'd be great if you could rebase this PR on master and then iterate on the test case until we're happy with it.

@amotin
Copy link
Member

amotin commented Nov 1, 2024

The problem is fixed by other PR.

@amotin amotin closed this Nov 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants