zfs-2.2.3 patchset #15836

tonyhutter · 2024-01-29T23:18:16Z

Motivation and Context

Support 6.7 kernel and various other fixes.

Description

Proposed patchset for zfs-2.2.3

How Has This Been Tested?

ZTS will test

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

6.7 changed the names of the time members in struct inode, so we can't assign back to it because we don't know its name. In practice this doesn't matter though - if we're missing current_time(), then we must be on <4.9, and we know our fallback will need to return timespec. Signed-off-by: Rob Norris <[email protected]> Sponsored-by: https://github.com/sponsors/robn

6.6 made i_ctime inaccessible; 6.7 has done the same for i_atime and i_mtime. This extends the method used for ctime in b37f293 to atime and mtime as well. Signed-off-by: Rob Norris <[email protected]> Sponsored-by: https://github.com/sponsors/robn

In 6.7 the superblock shrinker member s_shrink has changed from being an embedded struct to a pointer. Detect this, and don't take a reference if it already is one. Signed-off-by: Rob Norris <[email protected]> Sponsored-by: https://github.com/sponsors/robn

6.7 changes the shrinker API such that shrinkers must be allocated dynamically by the kernel. To accomodate this, this commit reworks spl_register_shrinker() to do something similar against earlier kernels. Signed-off-by: Rob Norris <[email protected]> Sponsored-by: https://github.com/sponsors/robn

The io_uring test fails on CentOS 9 with the following fio error. Disable the test for the benefit of the CI until this can be fully investigated. This basic test passes as expected on newer kernels. Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes openzfs#15636

On some systems we already have blkdev_get_by_path() with 4 args but still the old FMODE_EXCL and not BLK_OPEN_EXCL defined. The vdev_bdev_mode() function was added to handle this case but there was no generic way to specify exclusive access. Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes openzfs#15692

We are finding that as customers get larger and faster machines (hundreds of cores, large NVMe-backed pools) they keep hitting relatively low performance ceilings. Our profiling work almost always finds that they're running into bottlenecks on the SPA IO taskqs. Unfortunately there's often little we can advise at that point, because there's very few ways to change behaviour without patching. This commit adds two load-time parameters `zio_taskq_read` and `zio_taskq_write` that can configure the READ and WRITE IO taskqs directly. This achieves two goals: it gives operators (and those that support them) a way to tune things without requiring a custom build of OpenZFS, which is often not possible, and it lets us easily try different config variations in a variety of environments to inform the development of better defaults for these kind of systems. Because tuning the IO taskqs really requires a fairly deep understanding of how IO in ZFS works, and generally isn't needed without a pretty serious workload and an ability to identify bottlenecks, only minimal documentation is provided. Its expected that anyone using this is going to have the source code there as well. Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc.

Once we verified the ABDs and asserted the sizes we should never see premature ABDs ends. Assert that and remove extra branches from production builds. Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes openzfs#15428

Add a dataset_kstats_rename function, and call it when renaming a zvol on FreeBSD and Linux. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alan Somers <[email protected]> Sponsored-by: Axcient Closes openzfs#15482 Closes openzfs#15486

- Use sbuf_new_for_sysctl() to reduce double-buffering on sysctl output. - Use much faster sbuf_cat() instead of sbuf_printf("%s"). Together it reduces `sysctl kstat.zfs.misc.dbufs` time from minutes to seconds, making dbufstat almost usable. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes openzfs#15495

It is unused for 3 years since openzfs#10576. Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes openzfs#15507

PR openzfs#15457 exposed weird logic in L2ARC write sizing. If it appeared bigger than device size, instead of liming write it reset all the system-wide tunables to their default. Aside of being excessive, it did not actually help with the problem, still allowing infinite loop to happen. This patch removes the tunables reverting logic, but instead limits L2ARC writes (or at least eviction/trim) to 1/4 of the capacity. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Amanakis <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes openzfs#15519

This should make sure we have log written without overflows. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes openzfs#15517

Since we use a limited set of kmem caches, quite often we have unused memory after the end of the buffer. Put there up to a 512-byte canary when built with debug to detect buffer overflows at the free time. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes openzfs#15553

zil_claim_clone_range() takes references on cloned blocks before ZIL replay. Later zil_free_clone_range() drops them after replay or on dataset destroy. The total balance is neutral. It means we do not need to do anything (drop the references) for not implemented yet TX_CLONE_RANGE replay for ZVOLs. This is a logical follow up to openzfs#15603. Reviewed-by: Kay Pedersen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes openzfs#15612

ZIL claim can not handle block pointers cloned from the future, since they are not yet allocated at that point. It may happen either if the block was just written when it was cloned, or if the pool was frozen or somehow else rewound on import. Handle it from two sides: prevent cloning of blocks with physical birth time from not yet synced or frozen TXG, and abort ZIL claim if we still detect such blocks due to rewind or something else. While there, assert that any cloned blocks we claim are really allocated by calling metaslab_check_free(). Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes openzfs#15617

When two datasets share the same master encryption key, it is safe to clone encrypted blocks. Currently only snapshots and clones of a dataset share with it the same encryption key. Added a test for: - Clone from encrypted sibling to encrypted sibling with non encrypted parent - Clone from encrypted parent to inherited encrypted child - Clone from child to sibling with encrypted parent - Clone from snapshot to the original datasets - Clone from foreign snapshot to a foreign dataset - Cloning from non-encrypted to encrypted datasets - Cloning from encrypted to non-encrypted datasets Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Original-patch-by: Pawel Jakub Dawidek <[email protected]> Signed-off-by: Kay Pedersen <[email protected]> Closes openzfs#15544

Block pointers are not encrypted in TX_WRITE and TX_CLONE_RANGE records, so we can dump them, that may be useful for debugging. Related to openzfs#15543. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes openzfs#15629

To improve 128KB block write performance in case of multiple VDEVs ZIL used to spit those writes into two 64KB ones. Unfortunately it was found to cause LWB buffer overflow, trying to write maximum- sizes 128KB TX_CLONE_RANGE record with 1022 block pointers into 68KB buffer, since unlike TX_WRITE ZIL code can't split it. This is a minimally-invasive temporary block cloning fix until the following more invasive prediction code refactoring. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ameer Hamza <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes openzfs#15634

Without this patch on pool of 60 vdevs with ZFS_DEBUG enabled clone takes much more time than copy, while heavily trashing dbgmsg for no good reason, repeatedly dumping all vdevs BRTs again and again, even unmodified ones. I am generally not sure this dumping is not excessive, but decided to keep it for now, just restricting its scope to more reasonable. Reviewed-by: Kay Pedersen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes openzfs#15625

dmu_assign_arcbuf_by_dnode() should drop dn_struct_rwlock lock in case dbuf_hold() failed. I don't have reproduction for this, but it looks inconsistent with dmu_buf_hold_noread_by_dnode() and co. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes openzfs#15644

In some cases dbuf_assign_arcbuf() may be called on a block that was recently cloned. If it happened in current TXG we must undo the block cloning first, since the only one dirty record per TXG can't and shouldn't mean both cloning and overwrite same time. Reviewed-by: Kay Pedersen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes openzfs#15653

Block cloning normally creates dirty record without dr_data. But if the block is read after cloning, it is moved into DB_CACHED state and receives the data buffer. If after that we call dbuf_unoverride() to convert the dirty record into normal write, we should give it the data buffer from dbuf and release one. Reviewed-by: Kay Pedersen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes openzfs#15654 Closes openzfs#15656

While 763ca47 closes the situation of block cloning creating unencrypted records in encrypted datasets, existing data still causes panic on read. Setting zfs_recover bypasses this but at the cost of potentially ignoring more serious issues. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Chris Peredun <[email protected]> Closes openzfs#15677

This patch adds check for `kernel_neon_*` symbols on arm and arm64 platforms to address the following issues: 1. Linux 6.2+ on arm64 has exported them with `EXPORT_SYMBOL_GPL`, so license compatibility must be checked before use. 2. On both arm and arm64, the definitions of these symbols are guarded by `CONFIG_KERNEL_MODE_NEON`, but their declarations are still present. Checking in configuration phase only leads to MODPOST errors (undefined references). Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Shengqi Chen <[email protected]> Closes openzfs#15711 Closes openzfs#14555 Closes: openzfs#15401

- Fail if source block is smaller than destination. We can only grow blocks, not shrink them. - Fail if we do not have full znode range lock. In that case grow is not even called. We should improve zfs_rangelock_cb() somehow to know when cloning needs to grow the block size unlike write. - Fail of we tried to resize, but failed. There are many reasons for it to fail that we can not predict at this level, so be ready for them. Unlike write, that may proceed after growth failure, block cloning can't and must return error. This fixes assertion inside dmu_brt_clone() when it sees different number of blocks held in destination than it got block pointers. Builds without ZFS_DEBUG returned EXDEV, so are not affected much. Reviewed-by: Pawel Jakub Dawidek <[email protected]> Reviewed-by: Brian Atkinson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes openzfs#15724 Closes openzfs#15735

Two block pointers in livelist pointing to the same location may be caused not only by dedup, but also by block cloning. We should not assert D bit set in them. Two block pointers in livelist pointing to the same location may have different logical birth time in case of dedup or cloning. We should assert identical physical birth time instead. Assert identical physical block size between pointers in addition to checksum, since that is what checksums are calculated on. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes openzfs#15732

Missed in openzfs#15695, backporting openzfs#15675. Signed-off-by: Rob Norris <[email protected]>

sbuf_cpy() resets the sbuf state, which is wrong for sbufs allocated by sbuf_new_for_sysctl(). In particular, this code triggers an assertion failure in sbuf_clear(). Simplify by just using sysctl_handle_string() for both reading and setting the tunable. Fixes: 6930ecb ("spa: make read/write queues configurable") Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reported-by: Peter Holm <[email protected]> Signed-off-by: Mark Johnston <[email protected]> Closes openzfs#15719

For FreeBSD sysctls, we don't want the extra newline, since the sysctl(8) utility will format strings appropriately. Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reported-by: Peter Holm <[email protected]> Signed-off-by: Mark Johnston <[email protected]> Closes openzfs#15719

mmatuska · 2024-02-10T09:15:48Z

The newly posted #15874 is an important BRT fix as well

mmatuska · 2024-02-12T09:19:41Z

Please note #15883 was opened as well

Vlad1mir-D · 2024-02-12T12:38:53Z

2.2.3 is going to be such a great!

tonyhutter · 2024-02-12T17:31:33Z

@mmatuska thanks, I'll pull those two in once they're merged.

tonyhutter · 2024-02-12T17:33:08Z

All the builders are failing cp_files_002_pos. I'm looking into it.

tonyhutter · 2024-02-12T21:41:31Z

All the builders are failing cp_files_002_pos

It was failing because zfs_bclone_enabled=0 by default on the 2.2.x branch. I just added a commit to enable it when running the tests.

cp_files_002_pos uses BRT, so enable block cloning in setup/cleanup. This is only something we need to do in zfs-2.2.3, since 2.2.x ships with block cloning disabled by default. Signed-off-by: Tony Hutter <[email protected]>

Because "filesystem" and "volume" are just too long! Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#15864 (cherry picked from commit a5a7254)

Similar to deduplication, the size of data duplicated by block cloning should not be included in the slop space calculation. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Yuxin Wang <[email protected]> Closes openzfs#15874

@classabbyamp

This commit adds the zed_notify_ntfy() function and hooks it into zed_notify(). This will allow ZED to send notifications to ntfy.sh or a self-hosted Ntfy service, which can be received on a desktop or mobile device. It is configured with ZED_NTFY_TOPIC, ZED_NTFY_URL, and ZED_NTFY_ACCESS_TOKEN variables in zed.rc. Reviewed-by: @classabbyamp Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Dex Wood <[email protected]> Closes openzfs#15584

Fix a misreport in 'zdb -d' where it falsely marked BRT objects as leaked. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Yuxin Wang <[email protected]> Closes openzfs#15882

Vlad1mir-D · 2024-02-14T11:43:29Z

Release! Release! 👏

tonyhutter · 2024-02-14T20:03:26Z

Looks like we're passing on everything except FreeBSD, which is failing on the block cloning tests:

Tests with results other than PASS that are unexpected:
    FAIL bclone/bclone_crossfs_data (expected PASS)
    FAIL bclone/bclone_diffprops_all (expected PASS)
    FAIL bclone/bclone_diffprops_checksum (expected PASS)
    FAIL bclone/bclone_diffprops_compress (expected PASS)
    FAIL bclone/bclone_diffprops_copies (expected PASS)
    FAIL bclone/bclone_diffprops_recordsize (expected PASS)
    FAIL bclone/bclone_prop_sync (expected PASS)
    FAIL bclone/bclone_samefs_data (expected PASS)
    FAIL block_cloning/block_cloning_clone_mmap_cached (expected PASS)
    FAIL block_cloning/block_cloning_clone_mmap_write (expected PASS)
    FAIL block_cloning/block_cloning_copyfilerange (expected PASS)
    FAIL block_cloning/block_cloning_copyfilerange_cross_dataset (expected PASS)
    FAIL block_cloning/block_cloning_copyfilerange_fallback (expected PASS)
    FAIL block_cloning/block_cloning_copyfilerange_partial (expected PASS)
    FAIL block_cloning/block_cloning_cross_enc_dataset (expected PASS)
    FAIL block_cloning/block_cloning_lwb_buffer_overflow (expected PASS)
    FAIL block_cloning/block_cloning_replay (expected PASS)
    FAIL block_cloning/block_cloning_replay_encrypted (expected PASS)

https://build.openzfs.org/builders/FreeBSD%20stable%2F13%20amd64%20%28TEST%29/builds/1584

The bclone module names are not prefixed with 'zfs' on FreeBSD. This was causing test failues. Signed-off-by: Tony Hutter <[email protected]>

Skip cross filesystem block cloning tests on FreeBSD if running less than version 14.0. Cross filesystem copy_file_range() was added in FreeBSD 14. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes openzfs#15901

When ZFS overwrites a whole block, it does not bother to read the old content from disk. It is a good optimization, but if the buffer fill fails due to page fault or something else, the buffer ends up corrupted, neither keeping old content, nor getting the new one. On FreeBSD this is additionally complicated by page faults being blocked by VFS layer, always returning EFAULT on attempt to write from mmap()'ed but not yet cached address range. Normally it is not a big problem, since after original failure VFS will retry the write after reading the required data. The problem becomes worse in specific case when somebody tries to write into a file its own mmap()'ed content from the same location. In that situation the only copy of the data is getting corrupted on the page fault and the following retries only fixate the status quo. Block cloning makes this issue easier to reproduce, since it does not read the old data, unlike traditional file copy, that may work by chance. This patch provides the fill status to dmu_buf_fill_done(), that in case of error can destroy the corrupted buffer as if no write happened. One more complication in case of block cloning is that if error is possible during fill, dmu_buf_will_fill() must read the data via fall-back to dmu_buf_will_dirty(). It is required to allow in case of error restoring the buffer to a state after the cloning, not not before it, that would happen if we just call dbuf_undirty(). Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rob Norris <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes openzfs#15665

META file and changelog updated. Signed-off-by: Tony Hutter <[email protected]>

tonyhutter · 2024-02-22T21:13:38Z

Released: https://github.com/openzfs/zfs/releases/tag/zfs-2.2.3

robn and others added 30 commits December 21, 2023 11:03

ZIL: Assert record sizes in different places

ad47eca

This should make sure we have log written without overflows. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes openzfs#15517

freebsd: fix compile for spa_taskq_read/spa_taskq_write params

3bd23fd

Missed in openzfs#15695, backporting openzfs#15675. Signed-off-by: Rob Norris <[email protected]>

tonyhutter force-pushed the zfs-2.2.3-hutter branch from 871c8ae to f5b79f5 Compare February 12, 2024 22:00

tonyhutter and others added 3 commits February 12, 2024 14:04

[zfs-2.2.3] Enable zfs_bclone_enabled on cp_files tests

b699dac

cp_files_002_pos uses BRT, so enable block cloning in setup/cleanup. This is only something we need to do in zfs-2.2.3, since 2.2.x ships with block cloning disabled by default. Signed-off-by: Tony Hutter <[email protected]>

tonyhutter force-pushed the zfs-2.2.3-hutter branch from f5b79f5 to 917bc33 Compare February 12, 2024 22:04

tonyhutter force-pushed the zfs-2.2.3-hutter branch from 917bc33 to 1b8a34d Compare February 12, 2024 22:32

zdb: Fix false leak report for BRT objects

a4978d2

Fix a misreport in 'zdb -d' where it falsely marked BRT objects as leaked. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Yuxin Wang <[email protected]> Closes openzfs#15882

tonyhutter force-pushed the zfs-2.2.3-hutter branch 2 times, most recently from e695a02 to 5ee4020 Compare February 13, 2024 17:26

behlendorf approved these changes Feb 13, 2024

View reviewed changes

tonyhutter force-pushed the zfs-2.2.3-hutter branch from 5ee4020 to e90acba Compare February 14, 2024 22:31

tonyhutter added 2 commits February 16, 2024 09:33

[zfs-2.2.3] ZTS: Use correct bclone module param name on FreeBSD

d92fbe2

The bclone module names are not prefixed with 'zfs' on FreeBSD. This was causing test failues. Signed-off-by: Tony Hutter <[email protected]>

tonyhutter force-pushed the zfs-2.2.3-hutter branch 2 times, most recently from 484f200 to 9f747a5 Compare February 20, 2024 19:26

tonyhutter force-pushed the zfs-2.2.3-hutter branch from 9f747a5 to 114727d Compare February 20, 2024 23:54

Tag zfs-2.2.3

c883088

META file and changelog updated. Signed-off-by: Tony Hutter <[email protected]>

tonyhutter force-pushed the zfs-2.2.3-hutter branch from 114727d to c883088 Compare February 21, 2024 17:27

tonyhutter merged commit c883088 into openzfs:zfs-2.2-release Feb 22, 2024
22 of 26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zfs-2.2.3 patchset #15836

zfs-2.2.3 patchset #15836

tonyhutter commented Jan 29, 2024

mmatuska commented Feb 10, 2024

mmatuska commented Feb 12, 2024

Vlad1mir-D commented Feb 12, 2024

tonyhutter commented Feb 12, 2024

tonyhutter commented Feb 12, 2024

tonyhutter commented Feb 12, 2024

Vlad1mir-D commented Feb 14, 2024

tonyhutter commented Feb 14, 2024

tonyhutter commented Feb 22, 2024

zfs-2.2.3 patchset #15836

zfs-2.2.3 patchset #15836

Conversation

tonyhutter commented Jan 29, 2024

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

mmatuska commented Feb 10, 2024

mmatuska commented Feb 12, 2024

Vlad1mir-D commented Feb 12, 2024

tonyhutter commented Feb 12, 2024

tonyhutter commented Feb 12, 2024

tonyhutter commented Feb 12, 2024

Vlad1mir-D commented Feb 14, 2024

tonyhutter commented Feb 14, 2024

tonyhutter commented Feb 22, 2024