Bug fix: Speed up WB_SYNC_NONE writebacks when a WB_SYNC_ALL writeback occurs simultaneously #12790

shaan1337 · 2021-11-24T20:30:41Z

Motivation and Context

Description

Page writebacks with WB_SYNC_NONE can take several seconds to complete since they wait for the transaction group to close before being committed. This is usually not a problem since the caller does not need to wait. However, if we're simultaneously doing a writeback with WB_SYNC_ALL (e.g via msync), the latter can block for several seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE writeback since it needs to wait for the transaction to complete and the PG_writeback bit to be cleared.

This PR deals with 2 cases:

No page writeback is active. A WB_SYNC_ALL page writeback starts and even completes. But when it's about to check if the PG_writeback bit has been cleared, another writeback with WB_SYNC_NONE starts. The sync page writeback ends up waiting for the non-sync page writeback to complete.

One such example is a race condition that can occur in the linux kernel in filemap_write_and_wait_range(). Between the call to __filemap_fdatawrite_range and the call to filemap_fdatawait_range, it is possible that the WB_SYNC_ALL writeback completes, and another WB_SYNC_NONE writeback starts (e.g a background writeback by the kernel). The WB_SYNC_ALL writeback then ends up waiting for the WB_SYNC_NONE's writeback to complete.

A page writeback with WB_SYNC_NONE is already active when a WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up waiting for the WB_SYNC_NONE writeback.

One such example can occur if a WB_SYNC_NONE writeback starts at this point:

zfs/module/os/linux/zfs/zpl_file.c

Line 167 in ddc026f

If a WB_SYNC_NONE writeback starts before entering filemap_write_and_wait_range and manages to clear the dirty flags on the page, the filemap_write_and_wait_range call will make an early return here in __filemap_fdatawrite_range:
https://github.com/torvalds/linux/blob/cb690f5238d71f543f4ce874aa59237cf53a877c/mm/filemap.c#L403

The function then calls filemap_fdatawait_range() and it'll wait for several seconds.

Both of the above cases have been identified with the repro application (included as a failing test) and through tracing.

The fix works by carefully keeping track of active sync/non-sync writebacks and makes sure that:
i) a sync writeback will speed up any active non-sync writeback by doing a commit
ii) a non-sync writeback will speed itself up by doing a commit if there are any active sync writebacks

How Has This Been Tested?

By following reproduction steps: msync occasionally takes several seconds (up to ~15 seconds) #12662 (comment)
With the included failing test

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

shaan1337 · 2021-12-03T04:57:29Z

ZFS test suite results:
20211202T174007.tar.gz

Results Summary
PASS	 1392
FAIL	 106
SKIP	  87

Running Time:	04:27:40
Percent passed:	87.8%
Log directory:	/var/tmp/test_results/20211202T174007

Tests with results other than PASS that are expected:
    FAIL alloc_class/alloc_class_009_pos (Known issue)
    FAIL alloc_class/alloc_class_010_pos (Known issue)
    FAIL alloc_class/alloc_class_013_pos (https://github.com/openzfs/zfs/issues/11888)
    FAIL casenorm/mixed_formd_delete (https://github.com/openzfs/zfs/issues/7633)
    FAIL casenorm/mixed_formd_lookup (https://github.com/openzfs/zfs/issues/7633)
    FAIL casenorm/mixed_formd_lookup_ci (https://github.com/openzfs/zfs/issues/7633)
    FAIL casenorm/mixed_none_lookup_ci (https://github.com/openzfs/zfs/issues/7633)
    FAIL casenorm/sensitive_formd_delete (https://github.com/openzfs/zfs/issues/7633)
    FAIL casenorm/sensitive_formd_lookup (https://github.com/openzfs/zfs/issues/7633)
    SKIP chattr/setup (Test user execute permissions required for utilities)
    FAIL cli_root/zfs_copies/zfs_copies_003_pos (https://github.com/openzfs/zfs/issues/12301)
    FAIL cli_root/zfs_rename/zfs_rename_002_pos (Known issue)
    FAIL cli_root/zfs_rollback/zfs_rollback_001_pos (Known issue)
    FAIL cli_root/zfs_rollback/zfs_rollback_002_pos (Known issue)
    FAIL cli_root/zpool_add/zpool_add_004_pos (Known issue)
    FAIL cli_root/zpool_expand/zpool_expand_001_pos (Known issue)
    FAIL cli_root/zpool_import/import_rewind_device_replaced (Arbitrary pool rewind is not guaranteed)
    SKIP cli_root/zpool_import/zpool_import_missing_003_pos (https://github.com/openzfs/zfs/issues/6839)
    SKIP crtime/crtime_001_pos (Kernel statx(2) system call required on Linux)
    SKIP delegate/setup (Test user execute permissions required for utilities)
    FAIL history/history_006_neg (https://github.com/openzfs/zfs/issues/5657)
    FAIL history/history_008_pos (Known issue)
    SKIP history/history_010_pos (Test user execute permissions required for utilities)
    FAIL largest_pool/largest_pool_001_pos (Known issue)
    FAIL no_space/enospc_002_pos (Exact free space reporting is not guaranteed)
    SKIP pam/setup (pamtester might be not available)
    SKIP projectquota/setup (Test user execute permissions required for utilities)
    SKIP pyzfs/pyzfs_unittest (Python modules missing: python-cffi)
    FAIL refreserv/refreserv_004_pos (Known issue)
    FAIL refreserv/refreserv_raidz (Known issue)
    SKIP removal/removal_with_zdb (Known issue)
    FAIL rsend/rsend_007_pos (Known issue)
    SKIP rsend/rsend_008_pos (https://github.com/openzfs/zfs/issues/6066)
    FAIL rsend/rsend_010_pos (Known issue)
    FAIL rsend/rsend_011_pos (Known issue)
    FAIL rsend/send-c_volume (https://github.com/openzfs/zfs/issues/6087)
    FAIL snapshot/clone_001_pos (Known issue)
    FAIL snapshot/rollback_003_pos (Known issue)
    SKIP userquota/setup (Test user execute permissions required for utilities)
    FAIL vdev_zaps/vdev_zaps_007_pos (Known issue)
    FAIL zvol/zvol_misc/zvol_misc_snapdev (https://github.com/openzfs/zfs/issues/12621)
    FAIL zvol/zvol_misc/zvol_misc_volmode (Known issue)

Tests with result of PASS that are unexpected:

Tests with results other than PASS that are unexpected:
    FAIL cli_root/zfs_copies/zfs_copies_006_pos (expected PASS)
    FAIL cli_root/zfs_destroy/zfs_destroy_001_pos (expected PASS)
    FAIL cli_root/zfs_destroy/zfs_destroy_005_neg (expected PASS)
    FAIL cli_root/zfs_destroy/zfs_destroy_008_pos (expected PASS)
    FAIL cli_root/zfs_destroy/zfs_destroy_009_pos (expected PASS)
    FAIL cli_root/zfs_destroy/zfs_destroy_010_pos (expected PASS)
    FAIL cli_root/zfs_destroy/zfs_destroy_011_pos (expected PASS)
    FAIL cli_root/zfs_destroy/zfs_destroy_012_pos (expected PASS)
    FAIL cli_root/zfs_destroy/zfs_destroy_013_neg (expected PASS)
    FAIL cli_root/zfs_mount/zfs_mount_013_pos (expected PASS)
    FAIL cli_root/zfs_mount/zfs_multi_mount (expected PASS)
    FAIL cli_root/zfs_program/zfs_program_json (expected PASS)
    FAIL cli_root/zfs_rename/zfs_rename_001_pos (expected PASS)
    FAIL cli_root/zfs_rename/zfs_rename_004_neg (expected PASS)
    FAIL cli_root/zfs_rename/zfs_rename_005_neg (expected PASS)
    FAIL cli_root/zfs_rename/zfs_rename_006_pos (expected PASS)
    FAIL cli_root/zfs_rename/zfs_rename_007_pos (expected PASS)
    FAIL cli_root/zfs_set/readonly_001_pos (expected PASS)
    FAIL cli_root/zfs_unmount/zfs_unmount_nested (expected PASS)
    FAIL cli_root/zpool_create/zpool_create_014_neg (expected PASS)
    FAIL cli_root/zpool_create/zpool_create_015_neg (expected PASS)
    FAIL cli_root/zpool_create/zpool_create_features_007_pos (expected PASS)
    FAIL cli_root/zpool_create/zpool_create_features_008_pos (expected PASS)
    FAIL cli_root/zpool_destroy/zpool_destroy_001_pos (expected PASS)
    FAIL cli_root/zpool_import/zpool_import_errata3 (expected PASS)
    FAIL cli_root/zpool_import/zpool_import_errata4 (expected PASS)
    FAIL cli_root/zpool_status/zpool_status_features_001_pos (expected PASS)
    FAIL cli_root/zpool_upgrade/zpool_upgrade_features_001_pos (expected PASS)
    FAIL cli_user/misc/zfs_allow_001_neg (expected PASS)
    FAIL cli_user/misc/zfs_get_001_neg (expected PASS)
    FAIL cli_user/misc/zfs_set_001_neg (expected PASS)
    FAIL cli_user/misc/zfs_unallow_001_neg (expected PASS)
    SKIP cli_user/misc/zfs_upgrade_001_neg (expected PASS)
    FAIL cli_user/misc/zpool_detach_001_neg (expected PASS)
    FAIL cli_user/misc/zpool_export_001_neg (expected PASS)
    FAIL cli_user/misc/zpool_get_001_neg (expected PASS)
    FAIL cli_user/misc/zpool_remove_001_neg (expected PASS)
    FAIL cli_user/misc/zpool_status_001_neg (expected PASS)
    FAIL cli_user/misc/zpool_upgrade_001_neg (expected PASS)
    FAIL cli_user/misc/zpool_wait_privilege (expected PASS)
    FAIL cli_user/zfs_list/zfs_list_001_pos (expected PASS)
    FAIL cli_user/zfs_list/zfs_list_002_pos (expected PASS)
    FAIL cli_user/zfs_list/zfs_list_003_pos (expected PASS)
    FAIL cli_user/zfs_list/zfs_list_007_pos (expected PASS)
    FAIL cli_user/zpool_iostat/zpool_iostat_-c_disable (expected PASS)
    FAIL cli_user/zpool_iostat/zpool_iostat_-c_homedir (expected PASS)
    FAIL cli_user/zpool_iostat/zpool_iostat_-c_searchpath (expected PASS)
    FAIL cli_user/zpool_iostat/zpool_iostat_001_neg (expected PASS)
    FAIL cli_user/zpool_iostat/zpool_iostat_002_pos (expected PASS)
    FAIL cli_user/zpool_iostat/zpool_iostat_004_pos (expected PASS)
    FAIL cli_user/zpool_iostat/zpool_iostat_005_pos (expected PASS)
    FAIL cli_user/zpool_list/zpool_list_001_pos (expected PASS)
    FAIL cli_user/zpool_status/zpool_status_-c_disable (expected PASS)
    FAIL cli_user/zpool_status/zpool_status_-c_homedir (expected PASS)
    FAIL cli_user/zpool_status/zpool_status_-c_searchpath (expected PASS)
    FAIL cli_user/zpool_status/zpool_status_003_pos (expected PASS)
    FAIL devices/devices_001_pos (expected PASS)
    FAIL events/zed_fd_spill (expected PASS)
    FAIL fault/auto_replace_001_pos (expected PASS)
    FAIL limits/filesystem_limit (expected PASS)
    FAIL limits/snapshot_limit (expected PASS)
    FAIL nopwrite/nopwrite_volume (expected PASS)
    FAIL redacted_send/redacted_mounts (expected PASS)
    FAIL redacted_send/redacted_volume (expected PASS)
    FAIL rsend/recv_dedup_encrypted_zvol (expected PASS)
    FAIL rsend/rsend_004_pos (expected PASS)
    FAIL rsend/send-c_stream_size_estimate (expected PASS)
    FAIL rsend/send-wR_encrypted_zvol (expected PASS)
    FAIL slog/slog_replay_volume (expected PASS)
    FAIL xattr/xattr_004_pos (expected PASS)
    FAIL zvol/zvol_ENOSPC/setup (expected PASS)
    SKIP zvol/zvol_ENOSPC/zvol_ENOSPC_001_pos (expected PASS)
    FAIL zvol/zvol_misc/zvol_misc_002_pos (expected PASS)
    FAIL zvol/zvol_misc/zvol_misc_rename_inuse (expected PASS)
    FAIL zvol/zvol_swap/zvol_swap_001_pos (expected PASS)
    FAIL zvol/zvol_swap/zvol_swap_002_pos (expected PASS)
    FAIL zvol/zvol_swap/zvol_swap_004_pos (expected PASS)

shaan1337 · 2021-12-03T07:26:31Z

ok, ready for review. the last push just added tracepoints for the new counters so it shouldn't affect the zfs test suite results.

some things I've not tested yet:

the changes to FreeBSD fsync
the new tracepoints (not sure what the full tracepoint names are)

sempervictus · 2021-12-13T03:12:11Z

@shaan1337: in practice, does this end up increasing the number of IOPs issued to backing drives, and is there any increase in the number of partial-dnode-writes in the event that there are more sync-outs of small data volumes which may impact space efficiency over time?

shaan1337 · 2021-12-13T12:33:33Z

@sempervictus in the absence of a concurrent WB_SYNC_NONE page writeback (usually due to a kernel-triggered background page writeback), a WB_SYNC_ALL page writeback (e.g triggered by msync) would have called zil_commit() anyway for data integrity purposes.

The fix works by calling zil_commit() whenever there are concurrent WB_SYNC_ALL and WB_SYNC_NONE writebacks. So in principle, it should not affect the layers underneath in any negative way. Even if there are extra zil_commit() calls, they would be without effect most of the time since the transactions in the queue would have already been committed.

Also, in practice the issue itself (simultaneous WB_SYNC_NONE and WB_SYNC_ALL writebacks) occurs very rarely (~once per day on a busy system with continuous writes), so the extra zil_commit(), if any, will be negligible.

shaan1337 · 2021-12-17T04:45:50Z

@sempervictus one important case worth mentioning though is when you're doing a partial msync of a file. since we keep track of active sync/non-sync page writes at the znode level (not at the page level), this means that all dirty pages related to the file will be written to stable storage in the event of a concurrent WB_SYNC_NONE and WB_SYNC_ALL writeback.

This would apply mainly when one or more parts of a file are msync-ed for data integrity purposes while the other parts are not (they are left to the OS to write back). I'm not sure how common this pattern is.

Signed-off-by: Shaan Nobee <[email protected]>

shaan1337 · 2022-05-03T09:49:08Z

Can you just rebase this against the current master branch and for update the PR. That should get us a mostly clean CI run and this should be good to go.

Yup, i've just rebased it.

You should also be able to use ./scripts/zfs-tests.sh -v ./scripts/zfs-tests.sh -v -T mmap to run all the mmap tests.

noted 👍

Page writebacks with WB_SYNC_NONE can take several seconds to complete since they wait for the transaction group to close before being committed. This is usually not a problem since the caller does not need to wait. However, if we're simultaneously doing a writeback with WB_SYNC_ALL (e.g via msync), the latter can block for several seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE writeback since it needs to wait for the transaction to complete and the PG_writeback bit to be cleared. This commit deals with 2 cases: - No page writeback is active. A WB_SYNC_ALL page writeback starts and even completes. But when it's about to check if the PG_writeback bit has been cleared, another writeback with WB_SYNC_NONE starts. The sync page writeback ends up waiting for the non-sync page writeback to complete. - A page writeback with WB_SYNC_NONE is already active when a WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up waiting for the WB_SYNC_NONE writeback. The fix works by carefully keeping track of active sync/non-sync writebacks and committing when beneficial. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Shaan Nobee <[email protected]> Closes openzfs#12662 Closes openzfs#12790

Page writebacks with WB_SYNC_NONE can take several seconds to complete since they wait for the transaction group to close before being committed. This is usually not a problem since the caller does not need to wait. However, if we're simultaneously doing a writeback with WB_SYNC_ALL (e.g via msync), the latter can block for several seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE writeback since it needs to wait for the transaction to complete and the PG_writeback bit to be cleared. This commit deals with 2 cases: - No page writeback is active. A WB_SYNC_ALL page writeback starts and even completes. But when it's about to check if the PG_writeback bit has been cleared, another writeback with WB_SYNC_NONE starts. The sync page writeback ends up waiting for the non-sync page writeback to complete. - A page writeback with WB_SYNC_NONE is already active when a WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up waiting for the WB_SYNC_NONE writeback. The fix works by carefully keeping track of active sync/non-sync writebacks and committing when beneficial. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Shaan Nobee <[email protected]> Closes #12662 Closes #12790

shaan1337 force-pushed the fix-zfs-putpage branch 12 times, most recently from 8466b2d to 8970d11 Compare December 1, 2021 12:47

behlendorf added the Status: Code Review Needed Ready for review and testing label Dec 1, 2021

shaan1337 force-pushed the fix-zfs-putpage branch 7 times, most recently from 6ae2d88 to 1ea40f7 Compare December 2, 2021 16:58

shaan1337 force-pushed the fix-zfs-putpage branch from 1ea40f7 to 8e9b1a2 Compare December 3, 2021 07:21

shaan1337 marked this pull request as ready for review December 3, 2021 07:27

shaan1337 force-pushed the fix-zfs-putpage branch from 8e9b1a2 to f02f991 Compare December 5, 2021 08:21

behlendorf self-requested a review December 17, 2021 00:16

ahrens assigned tonynguien Jan 4, 2022

Add failing test

92a3cac

Signed-off-by: Shaan Nobee <[email protected]>

shaan1337 force-pushed the fix-zfs-putpage branch from 5d39f75 to 92a3cac Compare May 3, 2022 08:37

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels May 3, 2022

behlendorf merged commit 411f4a0 into openzfs:master May 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug fix: Speed up WB_SYNC_NONE writebacks when a WB_SYNC_ALL writeback occurs simultaneously #12790

Bug fix: Speed up WB_SYNC_NONE writebacks when a WB_SYNC_ALL writeback occurs simultaneously #12790

shaan1337 commented Nov 24, 2021 •

edited

Loading

shaan1337 commented Dec 3, 2021

shaan1337 commented Dec 3, 2021 •

edited

Loading

sempervictus commented Dec 13, 2021

shaan1337 commented Dec 13, 2021 •

edited

Loading

shaan1337 commented Dec 17, 2021 •

edited

Loading

shaan1337 commented May 3, 2022

Bug fix: Speed up WB_SYNC_NONE writebacks when a WB_SYNC_ALL writeback occurs simultaneously #12790

Bug fix: Speed up WB_SYNC_NONE writebacks when a WB_SYNC_ALL writeback occurs simultaneously #12790

Conversation

shaan1337 commented Nov 24, 2021 • edited Loading

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

shaan1337 commented Dec 3, 2021

shaan1337 commented Dec 3, 2021 • edited Loading

sempervictus commented Dec 13, 2021

shaan1337 commented Dec 13, 2021 • edited Loading

shaan1337 commented Dec 17, 2021 • edited Loading

shaan1337 commented May 3, 2022

shaan1337 commented Nov 24, 2021 •

edited

Loading

shaan1337 commented Dec 3, 2021 •

edited

Loading

shaan1337 commented Dec 13, 2021 •

edited

Loading

shaan1337 commented Dec 17, 2021 •

edited

Loading