Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zvol: ensure device minors are properly cleaned up #16364

Closed
wants to merge 4 commits into from

Conversation

robn
Copy link
Member

@robn robn commented Jul 19, 2024

[Sponsors: Klara, Inc., Wasabi Technology, Inc.]

Motivation and Context

While running the test suite to validate a ZIL change, I kept tripping #14872. Since it was important to me to ensure that I wasn't inadvertently making a bad situation worse, I decided I had to understand that issue first, and ideally fix it.

#14872 is not actually related to the ZIL; the assertion is just thing that noticed a problem at the end of a sequence of "impossible" events. This PR is attempting to tackle the root cause.

I should say that I am pretty unfamiliar with the zvol code, and since this was definitely not on my main quest line, I have not been as diligent with my research as I usually am. So I'll explain what I think is happening first, then talk about the patches and how they attempt to sort it out.

Fixes #14872.
May also fix #12621, #12733.


So as far as ZFS is concerned, a device node in /dev is for zvols what a mount is for filesystems: some connection point that makes a certain type of dataset available to the rest of the system. That is, creating or destroying a device node (what the code calls a "minor") is roughly analogous to mounting or unmounting a filesystem.

The equivalent to “unmounting” is ultimately handled in zvol_remove_minors_impl() and zvol_remove_minor_impl(), which are effectively identical; one just loops over all the zvols, the other just for a single zvol. They remove the node from /dev, remove the associated zvol_state_t from zvol_global_list, free up the block device resources in the kernel and release any other memory.

These functions contain this check:

		/* If in use, leave alone */
		if (zv->zv_open_count > 0 ||
			atomic_read(&zv->zv_suspend_ref)) {
				mutex_exit(&zv->zv_state_lock);
				continue;
		}

As the comment says, if the zvol is in use (someone has called open()) or has been suspended (incoming requests held, used during rollback and recv), then it is skipped.

The key problem here is that nothing ever comes back to actually remove the minor, so there’s a live node in /dev at a time when ZFS believes it’s gone.

Happily, there’s not many cases where this matters in practice. Minors are removed when changing the volmode and snapdev properties, and when renaming a volume. Since the effect of open()'ing a zvol device node is a name->objset_t lookup, the worst case is that you get a handle on a block device that used to be available by that name, or you get nothing because it no longer exists.

At pool export, it’s a different matter. At the top of spa_export_common(), we see:

	if (spa->spa_zvol_taskq) {
		zvol_remove_minors(spa, spa_name(spa), B_TRUE);
		taskq_wait(spa->spa_zvol_taskq);
	}

The zvol_remove_minors() call creates a task that calls zvol_remove_minors_impl(), and then waits until the the taskq it knows it will be dispatched on becomes empty. As above, if the zvol is in use at this moment, it is skipped, and ZFS continues with the export with the device node still visible place, and, crucially, a zvol_state_t for that zvol on the global zvol_state_list.

There is a check a little further down in spa_export_common() to prevent export if the pool is still in use:

        /*
         * A pool cannot be exported or destroyed if there are active
         * references.  If we are resetting a pool, allow references by
         * fault injection handlers.
         */
        if (!spa_refcount_zero(spa) || (spa->spa_inject_ref != 0)) {
                error = SET_ERROR(EBUSY);
                goto fail;
        }

However, it’s trivial to closing the last reference between after minor removal is skipped but before this check; there’s real work and possibly a whole transaction sync in that time.

So at this point there’s a device node in /dev, with the “old” zvol_state_t behind it. Remarkably, an attempt to open it at this point will make it as far as dmu_objset_own(). The pool hold will fail, because the pool isn’t there, so ENOENT will come back up and the open will be rejected.

And if the pool is reimported? Well. The named objset probably actually exists now, so we now have this extraordinarily strange situation where the zvol_state_t from the previous import of the pool is being used for the current import. A very vague description of the result is that the zvol state and the pool state look more-or-less right to each other, so they mostly accept each other, but they are not what they seem, and may perform strangely.


#14872 is merely a symptom of all this; something noticing the strange behaviour. It goes like this.

zvol_misc_trim and zvol_misc_fua both arrange for a write (a TRIM counts as a write for these purposes) to be issued very late in the pool’s lifecycle. In this case the kernel itself is actually the client; the filesystem on top of the zvol has been written to, and the kernel is flushing it.

At export, the minor is removed, but the kernel has the device open, so it gets skipped, and the kernel can continue. It’s issuing sync writes, which of course go to the ZIL. If the timing is just right, these writes can continue after the call to txg_wait_synced() in spa_export_common(). Since the next txg doesn’t pass, these writes are only on the ZIL attached to the zvol dataset.

At import, the ZIL blocks are found on the zvol dataset, and the claim is issued, setting zh_claim_txg. The ZIL now must be replayed before the dataset can be used. The assertion ASSERT(zh->zh_claim_txg == 0) in zil_create() ensures this, as the only thing that can zero zh_claim_txg is zil_sync() with zl_destroy_txg set, and that is only done by zil_replay().

ZIL replay is done at “mount” time, which for a zvol is when the minor is created. This happens in zvol_os_create_minor(), which is called from zvol_create_minors_recursive() during import. Early on, this calls zvol_find_by_name_hash() to see if the zvol_state_t already exists for the name. It finds the one left over from the previous pool run, and so just returns, because the minor already exists. Because of that, it never gets to calling zil_replay().

Soon, a write happens. zvol_write() calls zil_commit(), which calls zil_create(), and the unprocessed claim is discovered, and the assertion tripped.

Description

First things first: the first two commits are trivial cleanup. They could easily stand alone, and I will break them into a separate PR if wanted. Meanwhile, the final commit is just re-enabling the disabled zvol_misc tests that this should fix.

The real meat is in the third commit.

If the minor is not in use at the moment we check it, nothing changes - it just gets cleaned up on the spot.

If it is in use however, the behaviour changes. Instead of skipping the zvol, a new flag ZVOL_REMOVING gets set on the zvol_state_t, and then arrangements are made for something to wait on a new per-zvol condvar zv_removing_cv. When the last user finishes, if the flag is set, this condvar will be signaled and the waiter will go on to do the removal.

Since we know we’re about to wait, we want to throw off any new clients. So ZVOL_REMOVING also causes new open or IO requests to return ENXIO immediately.

For zvol_remove_minor_impl(), the calling thread immediately starts waiting. This is a utility function that implements the minor removal for snapdev and volmode changes and volume rename, which are all admin actions, so blocking the thread won’t affect anything else.

For zvol_remove_minors_impl() this is a little more complicated. This is always called on a task running from spa_zvol_taskq, and it runs through all the matching zvols on a loop. Rather than wait on the spot, we set ZVOL_REMOVING immediately and then dispatched a new task on that queue that immediately goes to sleep waiting for its zvol to finish.

The zvol_remove_minors_impl() task is dispatched from zvol_remove_minors(), which is always called with its async flag set, and only one of its callers actually cares about waiting for the minors to be removed. That caller is the snippet ofspa_export_common() shown earlier, and it waits for spa_zvol_taskq to be empty, and so will end up waiting until all minors are gone before completing the export.

Notes

Having spent some time in this code, there’s a lot I don’t love about about how it does its async work. However, as noted above, I really don’t have time for anything more here at the moment, so what I’ve done is kind of the first step towards the improvements I’d like to do, while still solving the immediate problem.

  • It seems like in general there’s a lot of stuff that can go wrong that isn’t considered at all, usually through void returns. I’d want to check everything and make sure failures are sensible. This seems particularly important when there’s sequences
  • zvol_remove_minors() and zvol_rename_minors() both have an “async” flag option that is always set. This is pointless. This also means that I’ve technically broken sync zvol_remove_minors(), because it only waits for the initial task to complete, not any followups. I decided for this that I didn’t care; the one place that might benefit from it (spa_export_common) is already doing its own thing to sync up. I could have just done a delay_list in zvol_remove_minors_impl() like I did in zvol_remove_minor_impl(), but I suspect in the long run a taskq is more likely to be a better model.
  • While this generally should be more robust, there’s still probably cases where the caller really should wait until the remove is done before proceeding. I have not considered these cases at all; this analysis is hard work and has gone on long enough anyway.
  • If its possible to remove the device nodes without removing the device as such, there might be a better structure where we can do this every time immediately and then clean up at our leisure.
  • Waiting “forever” at export might be too much; its quite possible for a caller to hold the device open. I don’t think a timeout is really the right thing here; maybe it could time out and return EBUSY, or maybe there could be a more forcible detach.

As with everything, I’ll get to it someday. For now, hopefully what’s here is enough to solve the problem at hand. If someone else wants a project, please take it on with my blessing 😬

How Has This Been Tested?

All testing done on Linux.

zvol_misc test tag has been run a few hundred times on kernel 6.1.94. Previously it would trip the assert described in #14872 after zvol_misc_trim and/or zvol_misc_fua maybe four out of five runs. Since then, no hits.

zvol test tag has been run a few dozen times, all passes. Full ZTS run twice, also all passes (modulo normal test suite flakiness).

Meanwhile on FreeBSD, this has been compile checked only. The OS-specific changes are very small and conceptually identical to the Linux one, so I just followed the local style. Obviously needs a test, which I will try to do myself, but I’d appreciate it if someone could take a closer look.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

module/zfs/zvol.c Outdated Show resolved Hide resolved
@tonyhutter
Copy link
Contributor

@robn great detective work on this! I will take a look

@tonyhutter tonyhutter added the Status: Code Review Needed Ready for review and testing label Jul 24, 2024
@tonyhutter
Copy link
Contributor

Hopefully we can get one more reviewer to approve this one

Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar with this revision of the code and will need a deeper look, but I have feeling that both existing and new design are sub-optimal for FreeBSD. Unlike Linux FreeBSD has a way to destroy open device without waiting indefinitely for application to close the device. Application may stay with open device as long as it desire, but any request other than close will fail fast within devfs layer and never reach the ZVOL code, which can safely destroy all it wants. Its being a while since I've done it last time, but whether it is a geom of dev volmode, zvols should be able to signal wanted destruction before waiting for the close at the OS-independent code.

robn added 4 commits July 26, 2024 12:01
ZVOL_DUMPIFIED is a vestigial Solaris leftover, and not used anywhere.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
SET_ERROR is our facility for tracking errors internally. The negation
is to match the what the kernel expects from us. Thus, the negation
should happen outside of the SET_ERROR.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
Currently, if a minor is in use when we try to remove it, we'll skip it
and never come back to it again. Since the zvol state is hung off the
minor in the kernel, this can get us into weird situations if something
tries to use it after the removal fails. It's even worse at pool export,
as there's now a vestigial zvol state with no pool under it. It's
weirder again if the pool is subsequently reimported, as the zvol code
(reasonably) assumes the zvol state has been properly setup, when it's
actually left over from the previous import of the pool.

This commit attempts to tackle that by setting a flag on the zvol if its
minor can't be removed, and then checking that flag when a request is
made and rejecting it, thus stopping new work coming in.

The flag also causes a condvar to be signaled when the last client
finishes. For the case where a single minor is being removed (eg
changing volmode), it will wait for this signal before proceeding.
Meanwhile, when removing all minors, a background task is created for
each minor that couldn't be removed on the spot, and those tasks then
wake and clean up.

Since any new tasks are queued on to the pool's spa_zvol_taskq,
spa_export_common() will continue to wait at export until all minors are
removed.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
Last commit should fix the underlying problem, so these should be
passing reliably again.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
@robn
Copy link
Member Author

robn commented Jul 28, 2024

FreeBSD has a way to destroy open device without waiting indefinitely for application to close the device.

@amotin thanks, I'll look into it!

@behlendorf behlendorf self-requested a review August 6, 2024 00:01
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robn thanks for running this down and the clear description of the issue. Oh and I agree, there's definitely lots of room for improvement here if somebody has the time.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Aug 6, 2024
@behlendorf behlendorf closed this in 9b9a393 Aug 6, 2024
behlendorf pushed a commit that referenced this pull request Aug 6, 2024
SET_ERROR is our facility for tracking errors internally. The negation
is to match the what the kernel expects from us. Thus, the negation
should happen outside of the SET_ERROR.

Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
Closes #16364
behlendorf pushed a commit that referenced this pull request Aug 6, 2024
Currently, if a minor is in use when we try to remove it, we'll skip it
and never come back to it again. Since the zvol state is hung off the
minor in the kernel, this can get us into weird situations if something
tries to use it after the removal fails. It's even worse at pool export,
as there's now a vestigial zvol state with no pool under it. It's
weirder again if the pool is subsequently reimported, as the zvol code
(reasonably) assumes the zvol state has been properly setup, when it's
actually left over from the previous import of the pool.

This commit attempts to tackle that by setting a flag on the zvol if its
minor can't be removed, and then checking that flag when a request is
made and rejecting it, thus stopping new work coming in.

The flag also causes a condvar to be signaled when the last client
finishes. For the case where a single minor is being removed (eg
changing volmode), it will wait for this signal before proceeding.
Meanwhile, when removing all minors, a background task is created for
each minor that couldn't be removed on the spot, and those tasks then
wake and clean up.

Since any new tasks are queued on to the pool's spa_zvol_taskq,
spa_export_common() will continue to wait at export until all minors are
removed.

Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
Closes #14872
Closes #16364
behlendorf pushed a commit that referenced this pull request Aug 6, 2024
Last commit should fix the underlying problem, so these should be
passing reliably again.

Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
Closes #16364
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Sep 4, 2024
ZVOL_DUMPIFIED is a vestigial Solaris leftover, and not used anywhere.

Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
Closes openzfs#16364
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Sep 4, 2024
SET_ERROR is our facility for tracking errors internally. The negation
is to match the what the kernel expects from us. Thus, the negation
should happen outside of the SET_ERROR.

Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
Closes openzfs#16364
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Sep 4, 2024
Currently, if a minor is in use when we try to remove it, we'll skip it
and never come back to it again. Since the zvol state is hung off the
minor in the kernel, this can get us into weird situations if something
tries to use it after the removal fails. It's even worse at pool export,
as there's now a vestigial zvol state with no pool under it. It's
weirder again if the pool is subsequently reimported, as the zvol code
(reasonably) assumes the zvol state has been properly setup, when it's
actually left over from the previous import of the pool.

This commit attempts to tackle that by setting a flag on the zvol if its
minor can't be removed, and then checking that flag when a request is
made and rejecting it, thus stopping new work coming in.

The flag also causes a condvar to be signaled when the last client
finishes. For the case where a single minor is being removed (eg
changing volmode), it will wait for this signal before proceeding.
Meanwhile, when removing all minors, a background task is created for
each minor that couldn't be removed on the spot, and those tasks then
wake and clean up.

Since any new tasks are queued on to the pool's spa_zvol_taskq,
spa_export_common() will continue to wait at export until all minors are
removed.

Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
Closes openzfs#14872
Closes openzfs#16364
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Sep 4, 2024
Last commit should fix the underlying problem, so these should be
passing reliably again.

Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
Closes openzfs#16364
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ZTS zvol_misc_snapdev failure ZTS: zvol_misc_trim VERIFY(zh->zh_claim_txg == 0) on F38
5 participants