Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linux 5.13 compat: retry zvol_open() when contended #12759

Merged
merged 1 commit into from
Dec 2, 2021

Conversation

behlendorf
Copy link
Contributor

Motivation and Context

Issue #12301

Description

Due to a possible lock inversion the zvol open call path, on Linux we
need to be able to retry in the case where the spa_namespace_lock
cannot be acquired.

For Linux 5.12 an older kernel this was accomplished by returning
-ERESTARTSYS from zvol_open() to request that blkdev_get() drop
the bdev->bd_mutex lock, reacquire it, then call the open callback
again. However, as of the 5.13 kernel this behavior was removed.

Therefore, for 5.12 and older kernels we preserved the existing
retry logic, but for 5.13 and newer kernels we retry internally in
zvol_open(). This should always succeed except in the case where
a pool's vdev are layed on zvols, in which case it may fail. To
handle this case vdev_disk_open() has been updated to retry when
opening a device when -ERESTARTSYS is returned.

How Has This Been Tested?

Locally by running the zfs_copies_003_pos tests on Fedora with
in a loop with the 5.14.16-301.fc35.x86_64 kernel. This test would
often failure with the new kernel because the block device could
not be opened and would return an error (ERESTARTSYS). With
this change applied the test now runs reliably.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@lundman
Copy link
Contributor

lundman commented Nov 13, 2021

In macOS we had some locking issues with zvol_open, and that requests come in from diskarbitration. We had to do a fairly ugly thing for it:
https://github.com/openzfsonosx/openzfs/blob/development/module/os/macos/zfs/zvol_os.c#L172-L248

Do we want to think about a shared code solution?

@behlendorf
Copy link
Contributor Author

behlendorf commented Nov 13, 2021

Do we want to think about a shared code solution?

That would be grand, unfortunately after looking at this code for a while I think that the requirements for each platform are just different enough to make it not really worthwhile. But if you see a way to nice unify the FreeBSD, Linux, and MacOS implementations I'm all ears.

@jgottula
Copy link
Contributor

jgottula commented Nov 13, 2021

Glad to see this; will test it out soon. 👍


Also, just double checking to make sure I have this right:

  • For "normal" zvol_open() situations, this should resolve the lock issues 100% of the time; essentially back to pre-v5.13 behavior where things "just worked".
  • For vdev-on-zvol situations, it will do up to 100-ish retries (every ~10ms for 1000ms); and the open isn't 100% guaranteed to succeed every time.
    • However, if the timeout is hit and failure occurs, the open() caller will at the very least get a sane, userspace-friendly errno value. (Er, wait, I guess only the internal zfs vdev code would even see that anyway...?)

Does that all sound correct?

@behlendorf
Copy link
Contributor Author

Not quite. In the normal zvol_open() case we'll retry up to 100 times at 10ms intervals. Since contention on the lock is relatively unlikely the retries should prevent failures during open. However, it is still technically possible, in which case an error will be returned.

When using a zvol as a vdev we'll do additional retries when an open fails with ERESTART. This handles the possible deadlock case mentioned is the comments.

@jgottula
Copy link
Contributor

Okay; thanks for the clarification!

@satmandu satmandu mentioned this pull request Nov 16, 2021
13 tasks
@Bronek
Copy link

Bronek commented Nov 24, 2021

Since I am affected by this as well ( ZVOLs randomly not showing up under /dev/zvol when running ZFS with kernel 5.14 ) I have applied this patch on top of 2.1.1 release ; only had to drop change in tests/test-runner/bin/zts-report.py.in, other changes applied cleanly. The system does not show any signs of regression and my problem with missing /dev/zvol is fixed. Can we please have it included in release 2.1.2 #12718 ?

@Bronek
Copy link

Bronek commented Nov 24, 2021

As for the design of this change, would it be viable to have a r/w lock instead? This way all read accesses to ZVOL will not block each other, only when shared lock needs to be promoted to exclusive (or if there already is exclusive lock when read access is requested) will any failures/blocking happen. Even better if reads could access "old" data concurrently, even when the data is being written, exploiting the COW nature of ZFS.

Admittedly I do not know what I am talking about, since I never worked on ZFS internals, and won't be at all offended if this suggestion is dismissed 🤡

@KevinBuettner
Copy link

FWIW, this PR seems to fix most of the problems that I reported in #12764 in addition to the problem with zvol links going missing after VM shutdown in #12712.

I, too, would like to see this fix included in 2.1.2.

@jgottula
Copy link
Contributor

I wanted to mention that I've been running a system with this patchset, on top of 8ac58acf56, for about 7 days now. (Arch Linux w/ standard Arch kernel config, version 5.15.2.)

Haven't noticed any particularly egregious problems or anything notable.

On the other hand, I haven't gone out of my way to do any particular tests or stressing.

(Also this is not-my-main-system; so there are only a few zvols, rather than literally dozens-to-hundreds of them. 😝 Might be a good idea to stress test on a system with hundreds or thousands of zvols, to see whether it ever does timeout or not... 🤔)

@sempervictus
Copy link
Contributor

This retry mechanism seems like a code path to which someone will one day return and potentially facepalm under 100s of delegated namespaces contending on that lock or some other strange condition producing thrash. Do we know what happened in 5.13 to cause this, and if so, is it feasible to push a fix upstream instead of implementing this workaround instead?

Due to a possible lock inversion the zvol open call path on Linux
needs to be able to retry in the case where the spa_namespace_lock
cannot be acquired.

For Linux 5.12 an older kernel this was accomplished by returning
-ERESTARTSYS from zvol_open() to request that blkdev_get() drop
the bdev->bd_mutex lock, reaquire it, then call the open callback
again.  However, as of the 5.13 kernel this behavior was removed.

Therefore, for 5.12 and older kernels we preserved the existing
retry logic, but for 5.13 and newer kernels we retry internally in
zvol_open().  This should always succeed except in the case where
a pool's vdev are layed on zvols, in which case it may fail.  To
handle this case vdev_disk_open() has been updated to retry when
opening a device when -ERESTARTSYS is returned.

Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#12301
@behlendorf
Copy link
Contributor Author

behlendorf commented Nov 30, 2021

Do we know what happened in 5.13 to cause this

We do. This change to the 5.13 kernel removed the kernel provided mechanism for retrying the open which we were depending on. Since it was explicitly removed I doubt upstream would be receptive to putting it back for us, but you could re-apply the change to a custom kernel.

Longer term I agree we're going to want to find a way to restructure the locking in ZFS to remove the need for this entirely. However, that's going to be a more disruptive change so it's something we're going to want to tackle in a different PR.

@sempervictus
Copy link
Contributor

Thank you sir, looks like they were doing just about the same thing internally though so not much of a "fix" in there either :-.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Dec 1, 2021
@jgottula
Copy link
Contributor

jgottula commented Dec 1, 2021

@behlendorf @tonyhutter Oops, I meant to submit my code review comment a few days back; but only just now realized it was sitting in "pending" state, and therefore not actually visible AFAIK. 🤦‍♂️

Pretty much just poking around at possible edge cases to get your opinion on whether anyone who ends up having both the belt (pre-5.13 ERESTARTSYS goto retry path) and suspenders (new zfs #ifndef HAVE_BLKDEV_GET_ERESTARTSYS code) engaged at the same time in a >=5.13 kernel would hypothetically have anything break or if it wouldn't be a problem.

@tonynguien tonynguien self-assigned this Dec 1, 2021
@tonynguien tonynguien merged commit 77e2756 into openzfs:master Dec 2, 2021
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Dec 6, 2021
Due to a possible lock inversion the zvol open call path on Linux
needs to be able to retry in the case where the spa_namespace_lock
cannot be acquired.

For Linux 5.12 an older kernel this was accomplished by returning
-ERESTARTSYS from zvol_open() to request that blkdev_get() drop
the bdev->bd_mutex lock, reaquire it, then call the open callback
again.  However, as of the 5.13 kernel this behavior was removed.

Therefore, for 5.12 and older kernels we preserved the existing
retry logic, but for 5.13 and newer kernels we retry internally in
zvol_open().  This should always succeed except in the case where
a pool's vdev are layed on zvols, in which case it may fail.  To
handle this case vdev_disk_open() has been updated to retry when
opening a device when -ERESTARTSYS is returned.

Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Tony Nguyen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#12301
Closes openzfs#12759
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Dec 6, 2021
Due to a possible lock inversion the zvol open call path on Linux
needs to be able to retry in the case where the spa_namespace_lock
cannot be acquired.

For Linux 5.12 an older kernel this was accomplished by returning
-ERESTARTSYS from zvol_open() to request that blkdev_get() drop
the bdev->bd_mutex lock, reaquire it, then call the open callback
again.  However, as of the 5.13 kernel this behavior was removed.

Therefore, for 5.12 and older kernels we preserved the existing
retry logic, but for 5.13 and newer kernels we retry internally in
zvol_open().  This should always succeed except in the case where
a pool's vdev are layed on zvols, in which case it may fail.  To
handle this case vdev_disk_open() has been updated to retry when
opening a device when -ERESTARTSYS is returned.

Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Tony Nguyen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#12301
Closes openzfs#12759
nicman23 pushed a commit to nicman23/zfs that referenced this pull request Aug 22, 2022
Due to a possible lock inversion the zvol open call path on Linux
needs to be able to retry in the case where the spa_namespace_lock
cannot be acquired.

For Linux 5.12 an older kernel this was accomplished by returning
-ERESTARTSYS from zvol_open() to request that blkdev_get() drop
the bdev->bd_mutex lock, reaquire it, then call the open callback
again.  However, as of the 5.13 kernel this behavior was removed.

Therefore, for 5.12 and older kernels we preserved the existing
retry logic, but for 5.13 and newer kernels we retry internally in
zvol_open().  This should always succeed except in the case where
a pool's vdev are layed on zvols, in which case it may fail.  To
handle this case vdev_disk_open() has been updated to retry when
opening a device when -ERESTARTSYS is returned.

Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Tony Nguyen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#12301
Closes openzfs#12759
nicman23 pushed a commit to nicman23/zfs that referenced this pull request Aug 22, 2022
Due to a possible lock inversion the zvol open call path on Linux
needs to be able to retry in the case where the spa_namespace_lock
cannot be acquired.

For Linux 5.12 an older kernel this was accomplished by returning
-ERESTARTSYS from zvol_open() to request that blkdev_get() drop
the bdev->bd_mutex lock, reaquire it, then call the open callback
again.  However, as of the 5.13 kernel this behavior was removed.

Therefore, for 5.12 and older kernels we preserved the existing
retry logic, but for 5.13 and newer kernels we retry internally in
zvol_open().  This should always succeed except in the case where
a pool's vdev are layed on zvols, in which case it may fail.  To
handle this case vdev_disk_open() has been updated to retry when
opening a device when -ERESTARTSYS is returned.

Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Tony Nguyen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#12301
Closes openzfs#12759
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: ZVOL ZFS Volumes Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants