Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug fix: Speed up WB_SYNC_NONE writebacks when a WB_SYNC_ALL writeback occurs simultaneously #12790

Merged
merged 2 commits into from
May 3, 2022

Conversation

shaan1337
Copy link
Contributor

@shaan1337 shaan1337 commented Nov 24, 2021

Motivation and Context

Fixes #12662

Description

Page writebacks with WB_SYNC_NONE can take several seconds to complete since they wait for the transaction group to close before being committed. This is usually not a problem since the caller does not need to wait. However, if we're simultaneously doing a writeback with WB_SYNC_ALL (e.g via msync), the latter can block for several seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE writeback since it needs to wait for the transaction to complete and the PG_writeback bit to be cleared.

This PR deals with 2 cases:

  1. No page writeback is active. A WB_SYNC_ALL page writeback starts and even completes. But when it's about to check if the PG_writeback bit has been cleared, another writeback with WB_SYNC_NONE starts. The sync page writeback ends up waiting for the non-sync page writeback to complete.

One such example is a race condition that can occur in the linux kernel in filemap_write_and_wait_range(). Between the call to __filemap_fdatawrite_range and the call to filemap_fdatawait_range, it is possible that the WB_SYNC_ALL writeback completes, and another WB_SYNC_NONE writeback starts (e.g a background writeback by the kernel). The WB_SYNC_ALL writeback then ends up waiting for the WB_SYNC_NONE's writeback to complete.

  1. A page writeback with WB_SYNC_NONE is already active when a WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up waiting for the WB_SYNC_NONE writeback.

One such example can occur if a WB_SYNC_NONE writeback starts at this point:

If a WB_SYNC_NONE writeback starts before entering filemap_write_and_wait_range and manages to clear the dirty flags on the page, the filemap_write_and_wait_range call will make an early return here in __filemap_fdatawrite_range:
https://github.com/torvalds/linux/blob/cb690f5238d71f543f4ce874aa59237cf53a877c/mm/filemap.c#L403

The function then calls filemap_fdatawait_range() and it'll wait for several seconds.

Both of the above cases have been identified with the repro application (included as a failing test) and through tracing.

The fix works by carefully keeping track of active sync/non-sync writebacks and makes sure that:
i) a sync writeback will speed up any active non-sync writeback by doing a commit
ii) a non-sync writeback will speed itself up by doing a commit if there are any active sync writebacks

How Has This Been Tested?

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@shaan1337 shaan1337 force-pushed the fix-zfs-putpage branch 12 times, most recently from 8466b2d to 8970d11 Compare December 1, 2021 12:47
@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Dec 1, 2021
@shaan1337 shaan1337 force-pushed the fix-zfs-putpage branch 7 times, most recently from 6ae2d88 to 1ea40f7 Compare December 2, 2021 16:58
@shaan1337
Copy link
Contributor Author

ZFS test suite results:
20211202T174007.tar.gz

Results Summary
PASS	 1392
FAIL	 106
SKIP	  87

Running Time:	04:27:40
Percent passed:	87.8%
Log directory:	/var/tmp/test_results/20211202T174007

Tests with results other than PASS that are expected:
    FAIL alloc_class/alloc_class_009_pos (Known issue)
    FAIL alloc_class/alloc_class_010_pos (Known issue)
    FAIL alloc_class/alloc_class_013_pos (https://github.com/openzfs/zfs/issues/11888)
    FAIL casenorm/mixed_formd_delete (https://github.com/openzfs/zfs/issues/7633)
    FAIL casenorm/mixed_formd_lookup (https://github.com/openzfs/zfs/issues/7633)
    FAIL casenorm/mixed_formd_lookup_ci (https://github.com/openzfs/zfs/issues/7633)
    FAIL casenorm/mixed_none_lookup_ci (https://github.com/openzfs/zfs/issues/7633)
    FAIL casenorm/sensitive_formd_delete (https://github.com/openzfs/zfs/issues/7633)
    FAIL casenorm/sensitive_formd_lookup (https://github.com/openzfs/zfs/issues/7633)
    SKIP chattr/setup (Test user execute permissions required for utilities)
    FAIL cli_root/zfs_copies/zfs_copies_003_pos (https://github.com/openzfs/zfs/issues/12301)
    FAIL cli_root/zfs_rename/zfs_rename_002_pos (Known issue)
    FAIL cli_root/zfs_rollback/zfs_rollback_001_pos (Known issue)
    FAIL cli_root/zfs_rollback/zfs_rollback_002_pos (Known issue)
    FAIL cli_root/zpool_add/zpool_add_004_pos (Known issue)
    FAIL cli_root/zpool_expand/zpool_expand_001_pos (Known issue)
    FAIL cli_root/zpool_import/import_rewind_device_replaced (Arbitrary pool rewind is not guaranteed)
    SKIP cli_root/zpool_import/zpool_import_missing_003_pos (https://github.com/openzfs/zfs/issues/6839)
    SKIP crtime/crtime_001_pos (Kernel statx(2) system call required on Linux)
    SKIP delegate/setup (Test user execute permissions required for utilities)
    FAIL history/history_006_neg (https://github.com/openzfs/zfs/issues/5657)
    FAIL history/history_008_pos (Known issue)
    SKIP history/history_010_pos (Test user execute permissions required for utilities)
    FAIL largest_pool/largest_pool_001_pos (Known issue)
    FAIL no_space/enospc_002_pos (Exact free space reporting is not guaranteed)
    SKIP pam/setup (pamtester might be not available)
    SKIP projectquota/setup (Test user execute permissions required for utilities)
    SKIP pyzfs/pyzfs_unittest (Python modules missing: python-cffi)
    FAIL refreserv/refreserv_004_pos (Known issue)
    FAIL refreserv/refreserv_raidz (Known issue)
    SKIP removal/removal_with_zdb (Known issue)
    FAIL rsend/rsend_007_pos (Known issue)
    SKIP rsend/rsend_008_pos (https://github.com/openzfs/zfs/issues/6066)
    FAIL rsend/rsend_010_pos (Known issue)
    FAIL rsend/rsend_011_pos (Known issue)
    FAIL rsend/send-c_volume (https://github.com/openzfs/zfs/issues/6087)
    FAIL snapshot/clone_001_pos (Known issue)
    FAIL snapshot/rollback_003_pos (Known issue)
    SKIP userquota/setup (Test user execute permissions required for utilities)
    FAIL vdev_zaps/vdev_zaps_007_pos (Known issue)
    FAIL zvol/zvol_misc/zvol_misc_snapdev (https://github.com/openzfs/zfs/issues/12621)
    FAIL zvol/zvol_misc/zvol_misc_volmode (Known issue)

Tests with result of PASS that are unexpected:

Tests with results other than PASS that are unexpected:
    FAIL cli_root/zfs_copies/zfs_copies_006_pos (expected PASS)
    FAIL cli_root/zfs_destroy/zfs_destroy_001_pos (expected PASS)
    FAIL cli_root/zfs_destroy/zfs_destroy_005_neg (expected PASS)
    FAIL cli_root/zfs_destroy/zfs_destroy_008_pos (expected PASS)
    FAIL cli_root/zfs_destroy/zfs_destroy_009_pos (expected PASS)
    FAIL cli_root/zfs_destroy/zfs_destroy_010_pos (expected PASS)
    FAIL cli_root/zfs_destroy/zfs_destroy_011_pos (expected PASS)
    FAIL cli_root/zfs_destroy/zfs_destroy_012_pos (expected PASS)
    FAIL cli_root/zfs_destroy/zfs_destroy_013_neg (expected PASS)
    FAIL cli_root/zfs_mount/zfs_mount_013_pos (expected PASS)
    FAIL cli_root/zfs_mount/zfs_multi_mount (expected PASS)
    FAIL cli_root/zfs_program/zfs_program_json (expected PASS)
    FAIL cli_root/zfs_rename/zfs_rename_001_pos (expected PASS)
    FAIL cli_root/zfs_rename/zfs_rename_004_neg (expected PASS)
    FAIL cli_root/zfs_rename/zfs_rename_005_neg (expected PASS)
    FAIL cli_root/zfs_rename/zfs_rename_006_pos (expected PASS)
    FAIL cli_root/zfs_rename/zfs_rename_007_pos (expected PASS)
    FAIL cli_root/zfs_set/readonly_001_pos (expected PASS)
    FAIL cli_root/zfs_unmount/zfs_unmount_nested (expected PASS)
    FAIL cli_root/zpool_create/zpool_create_014_neg (expected PASS)
    FAIL cli_root/zpool_create/zpool_create_015_neg (expected PASS)
    FAIL cli_root/zpool_create/zpool_create_features_007_pos (expected PASS)
    FAIL cli_root/zpool_create/zpool_create_features_008_pos (expected PASS)
    FAIL cli_root/zpool_destroy/zpool_destroy_001_pos (expected PASS)
    FAIL cli_root/zpool_import/zpool_import_errata3 (expected PASS)
    FAIL cli_root/zpool_import/zpool_import_errata4 (expected PASS)
    FAIL cli_root/zpool_status/zpool_status_features_001_pos (expected PASS)
    FAIL cli_root/zpool_upgrade/zpool_upgrade_features_001_pos (expected PASS)
    FAIL cli_user/misc/zfs_allow_001_neg (expected PASS)
    FAIL cli_user/misc/zfs_get_001_neg (expected PASS)
    FAIL cli_user/misc/zfs_set_001_neg (expected PASS)
    FAIL cli_user/misc/zfs_unallow_001_neg (expected PASS)
    SKIP cli_user/misc/zfs_upgrade_001_neg (expected PASS)
    FAIL cli_user/misc/zpool_detach_001_neg (expected PASS)
    FAIL cli_user/misc/zpool_export_001_neg (expected PASS)
    FAIL cli_user/misc/zpool_get_001_neg (expected PASS)
    FAIL cli_user/misc/zpool_remove_001_neg (expected PASS)
    FAIL cli_user/misc/zpool_status_001_neg (expected PASS)
    FAIL cli_user/misc/zpool_upgrade_001_neg (expected PASS)
    FAIL cli_user/misc/zpool_wait_privilege (expected PASS)
    FAIL cli_user/zfs_list/zfs_list_001_pos (expected PASS)
    FAIL cli_user/zfs_list/zfs_list_002_pos (expected PASS)
    FAIL cli_user/zfs_list/zfs_list_003_pos (expected PASS)
    FAIL cli_user/zfs_list/zfs_list_007_pos (expected PASS)
    FAIL cli_user/zpool_iostat/zpool_iostat_-c_disable (expected PASS)
    FAIL cli_user/zpool_iostat/zpool_iostat_-c_homedir (expected PASS)
    FAIL cli_user/zpool_iostat/zpool_iostat_-c_searchpath (expected PASS)
    FAIL cli_user/zpool_iostat/zpool_iostat_001_neg (expected PASS)
    FAIL cli_user/zpool_iostat/zpool_iostat_002_pos (expected PASS)
    FAIL cli_user/zpool_iostat/zpool_iostat_004_pos (expected PASS)
    FAIL cli_user/zpool_iostat/zpool_iostat_005_pos (expected PASS)
    FAIL cli_user/zpool_list/zpool_list_001_pos (expected PASS)
    FAIL cli_user/zpool_status/zpool_status_-c_disable (expected PASS)
    FAIL cli_user/zpool_status/zpool_status_-c_homedir (expected PASS)
    FAIL cli_user/zpool_status/zpool_status_-c_searchpath (expected PASS)
    FAIL cli_user/zpool_status/zpool_status_003_pos (expected PASS)
    FAIL devices/devices_001_pos (expected PASS)
    FAIL events/zed_fd_spill (expected PASS)
    FAIL fault/auto_replace_001_pos (expected PASS)
    FAIL limits/filesystem_limit (expected PASS)
    FAIL limits/snapshot_limit (expected PASS)
    FAIL nopwrite/nopwrite_volume (expected PASS)
    FAIL redacted_send/redacted_mounts (expected PASS)
    FAIL redacted_send/redacted_volume (expected PASS)
    FAIL rsend/recv_dedup_encrypted_zvol (expected PASS)
    FAIL rsend/rsend_004_pos (expected PASS)
    FAIL rsend/send-c_stream_size_estimate (expected PASS)
    FAIL rsend/send-wR_encrypted_zvol (expected PASS)
    FAIL slog/slog_replay_volume (expected PASS)
    FAIL xattr/xattr_004_pos (expected PASS)
    FAIL zvol/zvol_ENOSPC/setup (expected PASS)
    SKIP zvol/zvol_ENOSPC/zvol_ENOSPC_001_pos (expected PASS)
    FAIL zvol/zvol_misc/zvol_misc_002_pos (expected PASS)
    FAIL zvol/zvol_misc/zvol_misc_rename_inuse (expected PASS)
    FAIL zvol/zvol_swap/zvol_swap_001_pos (expected PASS)
    FAIL zvol/zvol_swap/zvol_swap_002_pos (expected PASS)
    FAIL zvol/zvol_swap/zvol_swap_004_pos (expected PASS)

@shaan1337
Copy link
Contributor Author

shaan1337 commented Dec 3, 2021

ok, ready for review. the last push just added tracepoints for the new counters so it shouldn't affect the zfs test suite results.

some things I've not tested yet:

  • the changes to FreeBSD fsync
  • the new tracepoints (not sure what the full tracepoint names are)

@shaan1337 shaan1337 marked this pull request as ready for review December 3, 2021 07:27
@sempervictus
Copy link
Contributor

@shaan1337: in practice, does this end up increasing the number of IOPs issued to backing drives, and is there any increase in the number of partial-dnode-writes in the event that there are more sync-outs of small data volumes which may impact space efficiency over time?

@shaan1337
Copy link
Contributor Author

shaan1337 commented Dec 13, 2021

@sempervictus in the absence of a concurrent WB_SYNC_NONE page writeback (usually due to a kernel-triggered background page writeback), a WB_SYNC_ALL page writeback (e.g triggered by msync) would have called zil_commit() anyway for data integrity purposes.

The fix works by calling zil_commit() whenever there are concurrent WB_SYNC_ALL and WB_SYNC_NONE writebacks. So in principle, it should not affect the layers underneath in any negative way. Even if there are extra zil_commit() calls, they would be without effect most of the time since the transactions in the queue would have already been committed.

Also, in practice the issue itself (simultaneous WB_SYNC_NONE and WB_SYNC_ALL writebacks) occurs very rarely (~once per day on a busy system with continuous writes), so the extra zil_commit(), if any, will be negligible.

@behlendorf behlendorf self-requested a review December 17, 2021 00:16
@shaan1337
Copy link
Contributor Author

shaan1337 commented Dec 17, 2021

@sempervictus one important case worth mentioning though is when you're doing a partial msync of a file. since we keep track of active sync/non-sync page writes at the znode level (not at the page level), this means that all dirty pages related to the file will be written to stable storage in the event of a concurrent WB_SYNC_NONE and WB_SYNC_ALL writeback.

This would apply mainly when one or more parts of a file are msync-ed for data integrity purposes while the other parts are not (they are left to the OS to write back). I'm not sure how common this pattern is.

Signed-off-by: Shaan Nobee <[email protected]>
@shaan1337
Copy link
Contributor Author

Can you just rebase this against the current master branch and for update the PR. That should get us a mostly clean CI run and this should be good to go.

Yup, i've just rebased it.

You should also be able to use ./scripts/zfs-tests.sh -v ./scripts/zfs-tests.sh -v -T mmap to run all the mmap tests.

noted 👍

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels May 3, 2022
@behlendorf behlendorf merged commit 411f4a0 into openzfs:master May 3, 2022
nicman23 pushed a commit to nicman23/zfs that referenced this pull request Aug 22, 2022
Page writebacks with WB_SYNC_NONE can take several seconds to complete 
since they wait for the transaction group to close before being 
committed. This is usually not a problem since the caller does not 
need to wait. However, if we're simultaneously doing a writeback 
with WB_SYNC_ALL (e.g via msync), the latter can block for several 
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE 
writeback since it needs to wait for the transaction to complete 
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts 
  and even completes. But when it's about to check if the PG_writeback 
  bit has been cleared, another writeback with WB_SYNC_NONE starts. 
  The sync page writeback ends up waiting for the non-sync page 
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a 
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up 
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync 
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Shaan Nobee <[email protected]>
Closes openzfs#12662
Closes openzfs#12790
nicman23 pushed a commit to nicman23/zfs that referenced this pull request Aug 22, 2022
Page writebacks with WB_SYNC_NONE can take several seconds to complete 
since they wait for the transaction group to close before being 
committed. This is usually not a problem since the caller does not 
need to wait. However, if we're simultaneously doing a writeback 
with WB_SYNC_ALL (e.g via msync), the latter can block for several 
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE 
writeback since it needs to wait for the transaction to complete 
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts 
  and even completes. But when it's about to check if the PG_writeback 
  bit has been cleared, another writeback with WB_SYNC_NONE starts. 
  The sync page writeback ends up waiting for the non-sync page 
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a 
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up 
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync 
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Shaan Nobee <[email protected]>
Closes openzfs#12662
Closes openzfs#12790
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Sep 12, 2022
Page writebacks with WB_SYNC_NONE can take several seconds to complete
since they wait for the transaction group to close before being
committed. This is usually not a problem since the caller does not
need to wait. However, if we're simultaneously doing a writeback
with WB_SYNC_ALL (e.g via msync), the latter can block for several
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE
writeback since it needs to wait for the transaction to complete
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts
  and even completes. But when it's about to check if the PG_writeback
  bit has been cleared, another writeback with WB_SYNC_NONE starts.
  The sync page writeback ends up waiting for the non-sync page
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Shaan Nobee <[email protected]>
Closes openzfs#12662
Closes openzfs#12790
beren12 pushed a commit to beren12/zfs that referenced this pull request Sep 19, 2022
Page writebacks with WB_SYNC_NONE can take several seconds to complete 
since they wait for the transaction group to close before being 
committed. This is usually not a problem since the caller does not 
need to wait. However, if we're simultaneously doing a writeback 
with WB_SYNC_ALL (e.g via msync), the latter can block for several 
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE 
writeback since it needs to wait for the transaction to complete 
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts 
  and even completes. But when it's about to check if the PG_writeback 
  bit has been cleared, another writeback with WB_SYNC_NONE starts. 
  The sync page writeback ends up waiting for the non-sync page 
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a 
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up 
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync 
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Shaan Nobee <[email protected]>
Closes openzfs#12662
Closes openzfs#12790
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Sep 23, 2022
Page writebacks with WB_SYNC_NONE can take several seconds to complete
since they wait for the transaction group to close before being
committed. This is usually not a problem since the caller does not
need to wait. However, if we're simultaneously doing a writeback
with WB_SYNC_ALL (e.g via msync), the latter can block for several
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE
writeback since it needs to wait for the transaction to complete
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts
  and even completes. But when it's about to check if the PG_writeback
  bit has been cleared, another writeback with WB_SYNC_NONE starts.
  The sync page writeback ends up waiting for the non-sync page
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Shaan Nobee <[email protected]>
Closes openzfs#12662
Closes openzfs#12790
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Sep 23, 2022
Page writebacks with WB_SYNC_NONE can take several seconds to complete
since they wait for the transaction group to close before being
committed. This is usually not a problem since the caller does not
need to wait. However, if we're simultaneously doing a writeback
with WB_SYNC_ALL (e.g via msync), the latter can block for several
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE
writeback since it needs to wait for the transaction to complete
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts
  and even completes. But when it's about to check if the PG_writeback
  bit has been cleared, another writeback with WB_SYNC_NONE starts.
  The sync page writeback ends up waiting for the non-sync page
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Shaan Nobee <[email protected]>
Closes openzfs#12662
Closes openzfs#12790
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Sep 23, 2022
Page writebacks with WB_SYNC_NONE can take several seconds to complete
since they wait for the transaction group to close before being
committed. This is usually not a problem since the caller does not
need to wait. However, if we're simultaneously doing a writeback
with WB_SYNC_ALL (e.g via msync), the latter can block for several
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE
writeback since it needs to wait for the transaction to complete
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts
  and even completes. But when it's about to check if the PG_writeback
  bit has been cleared, another writeback with WB_SYNC_NONE starts.
  The sync page writeback ends up waiting for the non-sync page
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Shaan Nobee <[email protected]>
Closes openzfs#12662
Closes openzfs#12790
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Sep 23, 2022
Page writebacks with WB_SYNC_NONE can take several seconds to complete
since they wait for the transaction group to close before being
committed. This is usually not a problem since the caller does not
need to wait. However, if we're simultaneously doing a writeback
with WB_SYNC_ALL (e.g via msync), the latter can block for several
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE
writeback since it needs to wait for the transaction to complete
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts
  and even completes. But when it's about to check if the PG_writeback
  bit has been cleared, another writeback with WB_SYNC_NONE starts.
  The sync page writeback ends up waiting for the non-sync page
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Shaan Nobee <[email protected]>
Closes openzfs#12662
Closes openzfs#12790
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Sep 23, 2022
Page writebacks with WB_SYNC_NONE can take several seconds to complete
since they wait for the transaction group to close before being
committed. This is usually not a problem since the caller does not
need to wait. However, if we're simultaneously doing a writeback
with WB_SYNC_ALL (e.g via msync), the latter can block for several
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE
writeback since it needs to wait for the transaction to complete
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts
  and even completes. But when it's about to check if the PG_writeback
  bit has been cleared, another writeback with WB_SYNC_NONE starts.
  The sync page writeback ends up waiting for the non-sync page
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Shaan Nobee <[email protected]>
Closes openzfs#12662
Closes openzfs#12790
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Sep 23, 2022
Page writebacks with WB_SYNC_NONE can take several seconds to complete
since they wait for the transaction group to close before being
committed. This is usually not a problem since the caller does not
need to wait. However, if we're simultaneously doing a writeback
with WB_SYNC_ALL (e.g via msync), the latter can block for several
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE
writeback since it needs to wait for the transaction to complete
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts
  and even completes. But when it's about to check if the PG_writeback
  bit has been cleared, another writeback with WB_SYNC_NONE starts.
  The sync page writeback ends up waiting for the non-sync page
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Shaan Nobee <[email protected]>
Closes openzfs#12662
Closes openzfs#12790
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Sep 23, 2022
Page writebacks with WB_SYNC_NONE can take several seconds to complete
since they wait for the transaction group to close before being
committed. This is usually not a problem since the caller does not
need to wait. However, if we're simultaneously doing a writeback
with WB_SYNC_ALL (e.g via msync), the latter can block for several
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE
writeback since it needs to wait for the transaction to complete
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts
  and even completes. But when it's about to check if the PG_writeback
  bit has been cleared, another writeback with WB_SYNC_NONE starts.
  The sync page writeback ends up waiting for the non-sync page
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Shaan Nobee <[email protected]>
Closes openzfs#12662
Closes openzfs#12790
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Sep 23, 2022
Page writebacks with WB_SYNC_NONE can take several seconds to complete
since they wait for the transaction group to close before being
committed. This is usually not a problem since the caller does not
need to wait. However, if we're simultaneously doing a writeback
with WB_SYNC_ALL (e.g via msync), the latter can block for several
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE
writeback since it needs to wait for the transaction to complete
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts
  and even completes. But when it's about to check if the PG_writeback
  bit has been cleared, another writeback with WB_SYNC_NONE starts.
  The sync page writeback ends up waiting for the non-sync page
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Shaan Nobee <[email protected]>
Closes openzfs#12662
Closes openzfs#12790
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Sep 23, 2022
Page writebacks with WB_SYNC_NONE can take several seconds to complete
since they wait for the transaction group to close before being
committed. This is usually not a problem since the caller does not
need to wait. However, if we're simultaneously doing a writeback
with WB_SYNC_ALL (e.g via msync), the latter can block for several
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE
writeback since it needs to wait for the transaction to complete
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts
  and even completes. But when it's about to check if the PG_writeback
  bit has been cleared, another writeback with WB_SYNC_NONE starts.
  The sync page writeback ends up waiting for the non-sync page
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Shaan Nobee <[email protected]>
Closes openzfs#12662
Closes openzfs#12790
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Sep 23, 2022
Page writebacks with WB_SYNC_NONE can take several seconds to complete
since they wait for the transaction group to close before being
committed. This is usually not a problem since the caller does not
need to wait. However, if we're simultaneously doing a writeback
with WB_SYNC_ALL (e.g via msync), the latter can block for several
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE
writeback since it needs to wait for the transaction to complete
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts
  and even completes. But when it's about to check if the PG_writeback
  bit has been cleared, another writeback with WB_SYNC_NONE starts.
  The sync page writeback ends up waiting for the non-sync page
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Shaan Nobee <[email protected]>
Closes openzfs#12662
Closes openzfs#12790
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Sep 23, 2022
Page writebacks with WB_SYNC_NONE can take several seconds to complete
since they wait for the transaction group to close before being
committed. This is usually not a problem since the caller does not
need to wait. However, if we're simultaneously doing a writeback
with WB_SYNC_ALL (e.g via msync), the latter can block for several
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE
writeback since it needs to wait for the transaction to complete
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts
  and even completes. But when it's about to check if the PG_writeback
  bit has been cleared, another writeback with WB_SYNC_NONE starts.
  The sync page writeback ends up waiting for the non-sync page
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Shaan Nobee <[email protected]>
Closes openzfs#12662
Closes openzfs#12790
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Sep 23, 2022
Page writebacks with WB_SYNC_NONE can take several seconds to complete
since they wait for the transaction group to close before being
committed. This is usually not a problem since the caller does not
need to wait. However, if we're simultaneously doing a writeback
with WB_SYNC_ALL (e.g via msync), the latter can block for several
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE
writeback since it needs to wait for the transaction to complete
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts
  and even completes. But when it's about to check if the PG_writeback
  bit has been cleared, another writeback with WB_SYNC_NONE starts.
  The sync page writeback ends up waiting for the non-sync page
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Shaan Nobee <[email protected]>
Closes openzfs#12662
Closes openzfs#12790
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Sep 23, 2022
Page writebacks with WB_SYNC_NONE can take several seconds to complete
since they wait for the transaction group to close before being
committed. This is usually not a problem since the caller does not
need to wait. However, if we're simultaneously doing a writeback
with WB_SYNC_ALL (e.g via msync), the latter can block for several
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE
writeback since it needs to wait for the transaction to complete
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts
  and even completes. But when it's about to check if the PG_writeback
  bit has been cleared, another writeback with WB_SYNC_NONE starts.
  The sync page writeback ends up waiting for the non-sync page
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Shaan Nobee <[email protected]>
Closes openzfs#12662
Closes openzfs#12790
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Sep 23, 2022
Page writebacks with WB_SYNC_NONE can take several seconds to complete
since they wait for the transaction group to close before being
committed. This is usually not a problem since the caller does not
need to wait. However, if we're simultaneously doing a writeback
with WB_SYNC_ALL (e.g via msync), the latter can block for several
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE
writeback since it needs to wait for the transaction to complete
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts
  and even completes. But when it's about to check if the PG_writeback
  bit has been cleared, another writeback with WB_SYNC_NONE starts.
  The sync page writeback ends up waiting for the non-sync page
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Shaan Nobee <[email protected]>
Closes openzfs#12662
Closes openzfs#12790
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Sep 23, 2022
Page writebacks with WB_SYNC_NONE can take several seconds to complete
since they wait for the transaction group to close before being
committed. This is usually not a problem since the caller does not
need to wait. However, if we're simultaneously doing a writeback
with WB_SYNC_ALL (e.g via msync), the latter can block for several
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE
writeback since it needs to wait for the transaction to complete
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts
  and even completes. But when it's about to check if the PG_writeback
  bit has been cleared, another writeback with WB_SYNC_NONE starts.
  The sync page writeback ends up waiting for the non-sync page
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Shaan Nobee <[email protected]>
Closes openzfs#12662
Closes openzfs#12790
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Sep 23, 2022
Page writebacks with WB_SYNC_NONE can take several seconds to complete
since they wait for the transaction group to close before being
committed. This is usually not a problem since the caller does not
need to wait. However, if we're simultaneously doing a writeback
with WB_SYNC_ALL (e.g via msync), the latter can block for several
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE
writeback since it needs to wait for the transaction to complete
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts
  and even completes. But when it's about to check if the PG_writeback
  bit has been cleared, another writeback with WB_SYNC_NONE starts.
  The sync page writeback ends up waiting for the non-sync page
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Shaan Nobee <[email protected]>
Closes openzfs#12662
Closes openzfs#12790
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Sep 23, 2022
Page writebacks with WB_SYNC_NONE can take several seconds to complete
since they wait for the transaction group to close before being
committed. This is usually not a problem since the caller does not
need to wait. However, if we're simultaneously doing a writeback
with WB_SYNC_ALL (e.g via msync), the latter can block for several
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE
writeback since it needs to wait for the transaction to complete
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts
  and even completes. But when it's about to check if the PG_writeback
  bit has been cleared, another writeback with WB_SYNC_NONE starts.
  The sync page writeback ends up waiting for the non-sync page
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Shaan Nobee <[email protected]>
Closes openzfs#12662
Closes openzfs#12790
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Sep 23, 2022
Page writebacks with WB_SYNC_NONE can take several seconds to complete
since they wait for the transaction group to close before being
committed. This is usually not a problem since the caller does not
need to wait. However, if we're simultaneously doing a writeback
with WB_SYNC_ALL (e.g via msync), the latter can block for several
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE
writeback since it needs to wait for the transaction to complete
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts
  and even completes. But when it's about to check if the PG_writeback
  bit has been cleared, another writeback with WB_SYNC_NONE starts.
  The sync page writeback ends up waiting for the non-sync page
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Shaan Nobee <[email protected]>
Closes openzfs#12662
Closes openzfs#12790
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Jun 2, 2023
Page writebacks with WB_SYNC_NONE can take several seconds to complete
since they wait for the transaction group to close before being
committed. This is usually not a problem since the caller does not
need to wait. However, if we're simultaneously doing a writeback
with WB_SYNC_ALL (e.g via msync), the latter can block for several
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE
writeback since it needs to wait for the transaction to complete
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts
  and even completes. But when it's about to check if the PG_writeback
  bit has been cleared, another writeback with WB_SYNC_NONE starts.
  The sync page writeback ends up waiting for the non-sync page
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Shaan Nobee <[email protected]>
Closes openzfs#12662
Closes openzfs#12790
tonyhutter pushed a commit that referenced this pull request Jun 5, 2023
Page writebacks with WB_SYNC_NONE can take several seconds to complete
since they wait for the transaction group to close before being
committed. This is usually not a problem since the caller does not
need to wait. However, if we're simultaneously doing a writeback
with WB_SYNC_ALL (e.g via msync), the latter can block for several
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE
writeback since it needs to wait for the transaction to complete
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts
  and even completes. But when it's about to check if the PG_writeback
  bit has been cleared, another writeback with WB_SYNC_NONE starts.
  The sync page writeback ends up waiting for the non-sync page
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Shaan Nobee <[email protected]>
Closes #12662
Closes #12790
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

msync occasionally takes several seconds (up to ~15 seconds)
6 participants