Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zfs-2.2.3 patchset #15836

Merged
merged 91 commits into from
Feb 22, 2024
Merged

Conversation

tonyhutter
Copy link
Contributor

Motivation and Context

Support 6.7 kernel and various other fixes.

Description

Proposed patchset for zfs-2.2.3

How Has This Been Tested?

ZTS will test

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

robn and others added 30 commits December 21, 2023 11:03
6.7 changed the names of the time members in struct inode, so we can't
assign back to it because we don't know its name. In practice this
doesn't matter though - if we're missing current_time(), then we must be
on <4.9, and we know our fallback will need to return timespec.

Signed-off-by: Rob Norris <[email protected]>
Sponsored-by: https://github.com/sponsors/robn
6.6 made i_ctime inaccessible; 6.7 has done the same for i_atime and
i_mtime. This extends the method used for ctime in b37f293 to atime
and mtime as well.

Signed-off-by: Rob Norris <[email protected]>
Sponsored-by: https://github.com/sponsors/robn
In 6.7 the superblock shrinker member s_shrink has changed from being an
embedded struct to a pointer. Detect this, and don't take a reference if
it already is one.

Signed-off-by: Rob Norris <[email protected]>
Sponsored-by: https://github.com/sponsors/robn
6.7 changes the shrinker API such that shrinkers must be allocated
dynamically by the kernel. To accomodate this, this commit reworks
spl_register_shrinker() to do something similar against earlier kernels.

Signed-off-by: Rob Norris <[email protected]>
Sponsored-by: https://github.com/sponsors/robn
The io_uring test fails on CentOS 9 with the following fio error.
Disable the test for the benefit of the CI until this can be fully
investigated.  This basic test passes as expected on newer kernels.

Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#15636
On some systems we already have blkdev_get_by_path() with 4 args
but still the old FMODE_EXCL and not BLK_OPEN_EXCL defined.
The vdev_bdev_mode() function was added to handle this case
but there was no generic way to specify exclusive access.

Reviewed-by: Brian Atkinson <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#15692
We are finding that as customers get larger and faster machines
(hundreds of cores, large NVMe-backed pools) they keep hitting
relatively low performance ceilings. Our profiling work almost always
finds that they're running into bottlenecks on the SPA IO taskqs.
Unfortunately there's often little we can advise at that point, because
there's very few ways to change behaviour without patching.

This commit adds two load-time parameters `zio_taskq_read` and
`zio_taskq_write` that can configure the READ and WRITE IO taskqs
directly.

This achieves two goals: it gives operators (and those that support
them) a way to tune things without requiring a custom build of OpenZFS,
which is often not possible, and it lets us easily try different config
variations in a variety of environments to inform the development of
better defaults for these kind of systems.

Because tuning the IO taskqs really requires a fairly deep understanding
of how IO in ZFS works, and generally isn't needed without a pretty
serious workload and an ability to identify bottlenecks, only minimal
documentation is provided. Its expected that anyone using this is going
to have the source code there as well.

Signed-off-by: Rob Norris <[email protected]>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Once we verified the ABDs and asserted the sizes we should never
see premature ABDs ends.  Assert that and remove extra branches
from production builds.

Reviewed-by: Brian Atkinson <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by:	Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
Closes openzfs#15428
Add a dataset_kstats_rename function, and call it when renaming
a zvol on FreeBSD and Linux.

Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Alan Somers <[email protected]>
Sponsored-by: Axcient
Closes openzfs#15482
Closes openzfs#15486
- Use sbuf_new_for_sysctl() to reduce double-buffering on sysctl
output.
- Use much faster sbuf_cat() instead of sbuf_printf("%s").

Together it reduces `sysctl kstat.zfs.misc.dbufs` time from minutes
to seconds, making dbufstat almost usable.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Sponsored by: iXsystems, Inc.
Closes openzfs#15495
It is unused for 3 years since openzfs#10576.

Reviewed-by: George Melikov <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by:	Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
Closes openzfs#15507
PR openzfs#15457 exposed weird logic in L2ARC write sizing. If it appeared
bigger than device size, instead of liming write it reset all the
system-wide tunables to their default.  Aside of being excessive,
it did not actually help with the problem, still allowing infinite
loop to happen.

This patch removes the tunables reverting logic, but instead limits
L2ARC writes (or at least eviction/trim) to 1/4 of the capacity.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: George Amanakis <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Sponsored by: iXsystems, Inc.
Closes openzfs#15519
This should make sure we have log written without overflows.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by:	Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
Closes openzfs#15517
Since we use a limited set of kmem caches, quite often we have unused
memory after the end of the buffer.  Put there up to a 512-byte canary
when built with debug to detect buffer overflows at the free time.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by:	Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
Closes openzfs#15553
zil_claim_clone_range() takes references on cloned blocks before ZIL
replay.  Later zil_free_clone_range() drops them after replay or on
dataset destroy.  The total balance is neutral.  It means we do not
need to do anything (drop the references) for not implemented yet
TX_CLONE_RANGE replay for ZVOLs.

This is a logical follow up to openzfs#15603.

Reviewed-by: Kay Pedersen <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by:	Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
Closes openzfs#15612
ZIL claim can not handle block pointers cloned from the future,
since they are not yet allocated at that point.  It may happen
either if the block was just written when it was cloned, or if
the pool was frozen or somehow else rewound on import.

Handle it from two sides: prevent cloning of blocks with physical
birth time from not yet synced or frozen TXG, and abort ZIL claim
if we still detect such blocks due to rewind or something else.

While there, assert that any cloned blocks we claim are really
allocated by calling metaslab_check_free().

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
Closes openzfs#15617
When two datasets share the same master encryption key, it is safe
to clone encrypted blocks. Currently only snapshots and clones
of a dataset share with it the same encryption key.

Added a test for:
- Clone from encrypted sibling to encrypted sibling with
  non encrypted parent
- Clone from encrypted parent to inherited encrypted child
- Clone from child to sibling with encrypted parent
- Clone from snapshot to the original datasets
- Clone from foreign snapshot to a foreign dataset
- Cloning from non-encrypted to encrypted datasets
- Cloning from encrypted to non-encrypted datasets

Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Original-patch-by: Pawel Jakub Dawidek <[email protected]>
Signed-off-by: Kay Pedersen <[email protected]>
Closes openzfs#15544
Block pointers are not encrypted in TX_WRITE and TX_CLONE_RANGE
records, so we can dump them, that may be useful for debugging.

Related to openzfs#15543.
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
Closes openzfs#15629
To improve 128KB block write performance in case of multiple VDEVs
ZIL used to spit those writes into two 64KB ones.  Unfortunately it
was found to cause LWB buffer overflow, trying to write maximum-
sizes 128KB TX_CLONE_RANGE record with 1022 block pointers into
68KB buffer, since unlike TX_WRITE ZIL code can't split it.

This is a minimally-invasive temporary block cloning fix until the
following more invasive prediction code refactoring.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Ameer Hamza <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
Closes openzfs#15634
Without this patch on pool of 60 vdevs with ZFS_DEBUG enabled clone
takes much more time than copy, while heavily trashing dbgmsg for
no good reason, repeatedly dumping all vdevs BRTs again and again,
even unmodified ones.

I am generally not sure this dumping is not excessive, but decided
to keep it for now, just restricting its scope to more reasonable.

Reviewed-by: Kay Pedersen <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
Closes openzfs#15625
dmu_assign_arcbuf_by_dnode() should drop dn_struct_rwlock lock in
case dbuf_hold() failed.  I don't have reproduction for this, but
it looks inconsistent with dmu_buf_hold_noread_by_dnode() and co.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Sponsored by: iXsystems, Inc.
Closes openzfs#15644
In some cases dbuf_assign_arcbuf() may be called on a block that
was recently cloned.  If it happened in current TXG we must undo
the block cloning first, since the only one dirty record per TXG
can't and shouldn't mean both cloning and overwrite same time.

Reviewed-by: Kay Pedersen <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by:	Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
Closes openzfs#15653
Block cloning normally creates dirty record without dr_data.  But if
the block is read after cloning, it is moved into DB_CACHED state and
receives the data buffer.  If after that we call dbuf_unoverride()
to convert the dirty record into normal write, we should give it the
data buffer from dbuf and release one.

Reviewed-by: Kay Pedersen <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Sponsored by: iXsystems, Inc.
Closes openzfs#15654
Closes openzfs#15656
While 763ca47 closes the situation of block cloning creating
unencrypted records in encrypted datasets, existing data still causes
panic on read. Setting zfs_recover bypasses this but at the cost of
potentially ignoring more serious issues.

Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Chris Peredun <[email protected]>
Closes openzfs#15677
This patch adds check for `kernel_neon_*` symbols on arm and arm64
platforms to address the following issues:

1. Linux 6.2+ on arm64 has exported them with `EXPORT_SYMBOL_GPL`, so
   license compatibility must be checked before use.
2. On both arm and arm64, the definitions of these symbols are guarded
   by `CONFIG_KERNEL_MODE_NEON`, but their declarations are still
   present. Checking in configuration phase only leads to MODPOST
   errors (undefined references).

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Shengqi Chen <[email protected]>
Closes openzfs#15711 
Closes openzfs#14555 
Closes: openzfs#15401
- Fail if source block is smaller than destination.  We can only
grow blocks, not shrink them.
 - Fail if we do not have full znode range lock.  In that case grow
is not even called.  We should improve zfs_rangelock_cb() somehow
to know when cloning needs to grow the block size unlike write.
 - Fail of we tried to resize, but failed.  There are many reasons
for it to fail that we can not predict at this level, so be ready
for them.  Unlike write, that may proceed after growth failure,
block cloning can't and must return error.

This fixes assertion inside dmu_brt_clone() when it sees different
number of blocks held in destination than it got block pointers.
Builds without ZFS_DEBUG returned EXDEV, so are not affected much.

Reviewed-by: Pawel Jakub Dawidek <[email protected]>
Reviewed-by: Brian Atkinson <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by:	Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
Closes openzfs#15724 
Closes openzfs#15735
Two block pointers in livelist pointing to the same location may
be caused not only by dedup, but also by block cloning. We should
not assert D bit set in them.

Two block pointers in livelist pointing to the same location may
have different logical birth time in case of dedup or cloning. We
should assert identical physical birth time instead.

Assert identical physical block size between pointers in addition
to checksum, since that is what checksums are calculated on.

Reviewed-by: Matthew Ahrens <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by:	Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
Closes openzfs#15732
sbuf_cpy() resets the sbuf state, which is wrong for sbufs allocated by
sbuf_new_for_sysctl().  In particular, this code triggers an assertion
failure in sbuf_clear().

Simplify by just using sysctl_handle_string() for both reading and
setting the tunable.

Fixes: 6930ecb ("spa: make read/write queues configurable")
Reviewed-by: Rob Norris <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reported-by: Peter Holm <[email protected]>
Signed-off-by: Mark Johnston <[email protected]>
Closes openzfs#15719
For FreeBSD sysctls, we don't want the extra newline, since the
sysctl(8) utility will format strings appropriately.

Reviewed-by: Rob Norris <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reported-by: Peter Holm <[email protected]>
Signed-off-by: Mark Johnston <[email protected]>
Closes openzfs#15719
@mmatuska
Copy link
Contributor

The newly posted #15874 is an important BRT fix as well

@mmatuska
Copy link
Contributor

Please note #15883 was opened as well

@Vlad1mir-D
Copy link

2.2.3 is going to be such a great!

@tonyhutter
Copy link
Contributor Author

@mmatuska thanks, I'll pull those two in once they're merged.

@tonyhutter
Copy link
Contributor Author

All the builders are failing cp_files_002_pos. I'm looking into it.

@tonyhutter
Copy link
Contributor Author

All the builders are failing cp_files_002_pos

It was failing because zfs_bclone_enabled=0 by default on the 2.2.x branch. I just added a commit to enable it when running the tests.

tonyhutter and others added 3 commits February 12, 2024 14:04
cp_files_002_pos uses BRT, so enable block cloning in setup/cleanup.
This is only something we need to do in zfs-2.2.3, since 2.2.x ships
with block cloning disabled by default.

Signed-off-by: Tony Hutter <[email protected]>
Because "filesystem" and "volume" are just too long!

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes openzfs#15864
(cherry picked from commit a5a7254)
Similar to deduplication, the size of data duplicated by block cloning
should not be included in the slop space calculation.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Yuxin Wang <[email protected]>
Closes openzfs#15874
This commit adds the zed_notify_ntfy() function and hooks it
into zed_notify(). This will allow ZED to send notifications
to ntfy.sh or a self-hosted Ntfy service, which can be received
on a desktop or mobile device. It is configured with ZED_NTFY_TOPIC,
ZED_NTFY_URL, and ZED_NTFY_ACCESS_TOKEN variables in zed.rc.

Reviewed-by: @classabbyamp
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Dex Wood <[email protected]>
Closes openzfs#15584
Fix a misreport in 'zdb -d' where it falsely marked
BRT objects as leaked.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Yuxin Wang <[email protected]>
Closes openzfs#15882
@tonyhutter tonyhutter force-pushed the zfs-2.2.3-hutter branch 2 times, most recently from e695a02 to 5ee4020 Compare February 13, 2024 17:26
@Vlad1mir-D
Copy link

Release! Release! 👏

@tonyhutter
Copy link
Contributor Author

Looks like we're passing on everything except FreeBSD, which is failing on the block cloning tests:

Tests with results other than PASS that are unexpected:
    FAIL bclone/bclone_crossfs_data (expected PASS)
    FAIL bclone/bclone_diffprops_all (expected PASS)
    FAIL bclone/bclone_diffprops_checksum (expected PASS)
    FAIL bclone/bclone_diffprops_compress (expected PASS)
    FAIL bclone/bclone_diffprops_copies (expected PASS)
    FAIL bclone/bclone_diffprops_recordsize (expected PASS)
    FAIL bclone/bclone_prop_sync (expected PASS)
    FAIL bclone/bclone_samefs_data (expected PASS)
    FAIL block_cloning/block_cloning_clone_mmap_cached (expected PASS)
    FAIL block_cloning/block_cloning_clone_mmap_write (expected PASS)
    FAIL block_cloning/block_cloning_copyfilerange (expected PASS)
    FAIL block_cloning/block_cloning_copyfilerange_cross_dataset (expected PASS)
    FAIL block_cloning/block_cloning_copyfilerange_fallback (expected PASS)
    FAIL block_cloning/block_cloning_copyfilerange_partial (expected PASS)
    FAIL block_cloning/block_cloning_cross_enc_dataset (expected PASS)
    FAIL block_cloning/block_cloning_lwb_buffer_overflow (expected PASS)
    FAIL block_cloning/block_cloning_replay (expected PASS)
    FAIL block_cloning/block_cloning_replay_encrypted (expected PASS)

https://build.openzfs.org/builders/FreeBSD%20stable%2F13%20amd64%20%28TEST%29/builds/1584

The bclone module names are not prefixed with 'zfs' on FreeBSD.
This was causing test failues.

Signed-off-by: Tony Hutter <[email protected]>
Skip cross filesystem block cloning tests on FreeBSD if running
less than version 14.0.  Cross filesystem copy_file_range() was
added in FreeBSD 14.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tony Hutter <[email protected]>
Closes openzfs#15901
@tonyhutter tonyhutter force-pushed the zfs-2.2.3-hutter branch 2 times, most recently from 484f200 to 9f747a5 Compare February 20, 2024 19:26
When ZFS overwrites a whole block, it does not bother to read the
old content from disk. It is a good optimization, but if the buffer
fill fails due to page fault or something else, the buffer ends up
corrupted, neither keeping old content, nor getting the new one.

On FreeBSD this is additionally complicated by page faults being
blocked by VFS layer, always returning EFAULT on attempt to write
from mmap()'ed but not yet cached address range.  Normally it is
not a big problem, since after original failure VFS will retry the
write after reading the required data.  The problem becomes worse
in specific case when somebody tries to write into a file its own
mmap()'ed content from the same location.  In that situation the
only copy of the data is getting corrupted on the page fault and
the following retries only fixate the status quo.  Block cloning
makes this issue easier to reproduce, since it does not read the
old data, unlike traditional file copy, that may work by chance.

This patch provides the fill status to dmu_buf_fill_done(), that
in case of error can destroy the corrupted buffer as if no write
happened.  One more complication in case of block cloning is that
if error is possible during fill, dmu_buf_will_fill() must read
the data via fall-back to dmu_buf_will_dirty().  It is required
to allow in case of error restoring the buffer to a state after
the cloning, not not before it, that would happen if we just call
dbuf_undirty().

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Rob Norris <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Sponsored by: iXsystems, Inc.
Closes openzfs#15665
META file and changelog updated.

Signed-off-by: Tony Hutter <[email protected]>
@tonyhutter tonyhutter merged commit c883088 into openzfs:zfs-2.2-release Feb 22, 2024
22 of 26 checks passed
@tonyhutter
Copy link
Contributor Author

Released: https://github.com/openzfs/zfs/releases/tag/zfs-2.2.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.