Large sync writes perform worse with slog #14378

jwk404 · 2023-01-11T20:34:57Z

Motivation and Context

For synchronous write workloads with large IO sizes, a pool configured with a slog performs worse than one with an embedded zil:

sequential_writes 1m sync ios, 16 threads
  Write IOPS:              1292          438   -66.10%
  Write Bandwidth:      1323570       448910   -66.08%
  Write Latency:       12128400     36330970      3.0x

sequential_writes 1m sync ios, 32 threads
  Write IOPS:              1293          430   -66.74%
  Write Bandwidth:      1324184       441188   -66.68%
  Write Latency:       24486278     74028536      3.0x

Description

The reason is the zil_slog_bulk variable. In zil_lwb_write_open, if a zil block is greater than 768K, the priority of the write is downgraded from sync to async. Increasing the value allows greater throughput. To select a value for this PR, I ran an fio workload with the following values for zil_slog_bulk:

zil_slog_bulk    KiB/s
1048576         422132
2097152         478935
4194304         533645
8388608         623031
12582912        827158
16777216       1038359
25165824       1142210
33554432       1211472
50331648       1292847
67108864       1308506
100663296      1306821
134217728      1304998

How Has This Been Tested?

At 64M, the results with a slog are now improved to parity with an embedded zil:

sequential_writes 1m sync ios, 16 threads
  Write IOPS:               438         1288      2.9x
  Write Bandwidth:       448910      1319062      2.9x
  Write Latency:       36330970     12163408   -66.52%

sequential_writes 1m sync ios, 32 threads
  Write IOPS:               430         1290      3.0x
  Write Bandwidth:       441188      1321693      3.0x
  Write Latency:       74028536     24519698   -66.88%

None of the other tests in the performance suite (run with a zil or slog) had a significant change, including the random_write_zil tests, which use multiple datasets.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

For synchronous write workloads with large IO sizes, a pool configured with a slog performs worse than one with an embedded zil: sequential_writes 1m sync ios, 16 threads Write IOPS: 1292 438 -66.10% Write Bandwidth: 1323570 448910 -66.08% Write Latency: 12128400 36330970 3.0x sequential_writes 1m sync ios, 32 threads Write IOPS: 1293 430 -66.74% Write Bandwidth: 1324184 441188 -66.68% Write Latency: 24486278 74028536 3.0x The reason is the `zil_slog_bulk` variable. In `zil_lwb_write_open`, if a zil block is greater than 768K, the priority of the write is downgraded from sync to async. Increasing the value allows greater throughput. To select a value for this PR, I ran an fio workload with the following values for `zil_slog_bulk`: zil_slog_bulk KiB/s 1048576 422132 2097152 478935 4194304 533645 8388608 623031 12582912 827158 16777216 1038359 25165824 1142210 33554432 1211472 50331648 1292847 67108864 1308506 100663296 1306821 134217728 1304998 At 64M, the results with a slog are now improved to parity with an embedded zil: sequential_writes 1m sync ios, 16 threads Write IOPS: 438 1288 2.9x Write Bandwidth: 448910 1319062 2.9x Write Latency: 36330970 12163408 -66.52% sequential_writes 1m sync ios, 32 threads Write IOPS: 430 1290 3.0x Write Bandwidth: 441188 1321693 3.0x Write Latency: 74028536 24519698 -66.88% None of the other tests in the performance suite (run with a zil or slog) had a significant change, including the random_write_zil tests, which use multiple datasets. Signed-off-by: John Wren Kennedy <[email protected]>

tonynguien

Raising the limit makes sense given improvement in disk write performance since the tuneable was introduced. Though I wonder if it's necessary to still keep this limit. Perhaps, we can revisit this another time.

Change looks great! Thanks for chasing down this problem.

amotin · 2023-01-19T13:41:17Z

@jwk404 I think your test is flawed at least in point that 64MB is the point. Having 32 1MB writes you can never get more than 32MB of ZIL log, so setting zil_slog_bulk simply disables the functionality. Would you use 4MB writes, for example, I bet bigger values could have effect. The problem with 64MB is that it can barely be reachable for sync=always workloads. I think it can only be reached if some application would do plenty of writes and then requested sync().

But the real question is why does it really help? What so bad is done for async write priority that hurts performance so much in your case? My first guess is that it can make latency less fair by switching from FIFO to elevator sorting, but not sure why it should cause 3x slowdown. The second guess is that async priority is limited in number of concurrent requests by I/O scheduler. It would be good to find it out for better understanding. May be my idea of using async priority there was not the best, or may be it could be improved.

jwk404 · 2023-01-26T00:15:40Z

With respect to why this helps, I believe your second guess is the correct one. When comparing the output of zpool_influxdb between runs with zil vs. slog without this change, I noticed that the pool with the slog had significantly fewer IOs in the active sync write vdev IO queue. It was roughly a 10x difference.

I agree that a setting of 64MB effectively disables the functionality. I considered removing it, but thought it better to leave the tunable in case some other workload regressed with the new value. I tested workloads where we have multiple datasets being written synchronously to see if changing the tunable had any effect, but it didn't. Can you describe a workload where this tunable improves performance? I wasn't able to create one.

jwk404 · 2023-02-07T23:22:51Z

@amotin Do you have thoughts on this?

amotin · 2023-02-14T20:27:29Z

Can you describe a workload where this tunable improves performance? I wasn't able to create one.

The idea behind this was that if you have two datasets, one of which is more latency sensitive, like iSCSI with sync=always, while another like async NFS, generating huge sync bursts from time to time, the first should not suffer from the bursts of the second. With the growth of I/O sizes, queue depths and speeds it may be that 768K is indeed too small. I don't think I'd have objections to bump it up, just not sure how high. My unmotivated feeling is ~16MB, but it would probably be good to check how big sync bursts are created by modern NFS clients, properly supporting server cache control (Linux, FreeBSD, etc).

As alternative or addition to this, I think we may investigate completely disabling offset sorting in I/O scheduler for non-rotational (SSD) vdevs even for async priorities, while keeping aggregation. SSDs should not care much about offsets if the writes are large, while it should make latency more stable and predictable. IIRC at the time I/O scheduler was written ZFS had no information whether the device is rotational or not.

jwk404 · 2023-02-22T17:31:24Z

With the growth of I/O sizes, queue depths and speeds it may be that 768K is indeed too small. I don't think I'd have objections to bump it up, just not sure how high. My unmotivated feeling is ~16MB, but it would probably be good to check how big sync bursts are created by modern NFS clients, properly supporting server cache control (Linux, FreeBSD, etc).

I did some further testing along these lines. Using 4 NFS clients, I was able to create relatively high bursts of sync IO. Using a bcc script (and the 64MB change suggested in this PR), I able to monitor zilog->zl_cur_used and saw it hovering under 100MB, with peaks in the 200-300MB range. With the default, it averaged in the 380MB range with a peak of almost 2GB. So I think 64MB is a reasonable default; I don't think a complex config (mixed workloads in a pool) should be prioritized over a simple config (any pool with a slog).

amotin · 2023-02-22T22:03:10Z

@jwk404 What burst sizes do you see with one client doing sequential writes? I'd prefer the tunable to be set below that level.

jwk404 · 2023-02-27T19:26:58Z

I did a series of experiments where I launched a 60 second local sync workload, and after 15 seconds, launced a similar 30 second workload over NFS. The table below shows avg throughput in MB/s at varying tunable sizes.

zil_slog_bulk	1 local fs	1 NFS	local + NFS
768K	418	459	367 / 116
16M	686	802	706 / 618
32M	919	811	831 / 633
64M	1281	828	1093 / 463

Given that, what do you think about 32M for the tunable?

amotin · 2023-02-28T15:27:53Z

I asked to monitor zilog->zl_cur_used when only NFS client is used, may be with different intensity. From the "1 NFS" column I can guess only that it is likely somewhere between 768K and 16MB. Also, what was your "local" workload, that it required 64MB of simultaneous requests? The whole idea of this tunable is to separate "bulk" and "interactive" workloads, giving the last a higher priority. 64MB does not sound interactive to me.

jwk404 · 2023-02-28T16:25:45Z

zilog->zl_cur_used varied from a few MB to peaks as high as 320MB. The average over 3 runs was 42MB.
Both workloads were sequential 1MB writes over 64 threads.

amotin · 2023-02-28T19:06:50Z

"1MB writes over 64 threads" is not exactly an interactive workload, especially if all the threads are writing to the same dataset. It may not depend on individual request latency as much as would one thread doing 16KB writes.

amotin · 2023-10-09T02:36:16Z

@jwk404 OK. Cold you please update or remove the date change in zfs.4 to resolve the conflict and lets merge it.

Signed-off-by: Brian Behlendorf <[email protected]>

For synchronous write workloads with large IO sizes, a pool configured with a slog performs worse than one with an embedded zil: sequential_writes 1m sync ios, 16 threads Write IOPS: 1292 438 -66.10% Write Bandwidth: 1323570 448910 -66.08% Write Latency: 12128400 36330970 3.0x sequential_writes 1m sync ios, 32 threads Write IOPS: 1293 430 -66.74% Write Bandwidth: 1324184 441188 -66.68% Write Latency: 24486278 74028536 3.0x The reason is the `zil_slog_bulk` variable. In `zil_lwb_write_open`, if a zil block is greater than 768K, the priority of the write is downgraded from sync to async. Increasing the value allows greater throughput. To select a value for this PR, I ran an fio workload with the following values for `zil_slog_bulk`: zil_slog_bulk KiB/s 1048576 422132 2097152 478935 4194304 533645 8388608 623031 12582912 827158 16777216 1038359 25165824 1142210 33554432 1211472 50331648 1292847 67108864 1308506 100663296 1306821 134217728 1304998 At 64M, the results with a slog are now improved to parity with an embedded zil: sequential_writes 1m sync ios, 16 threads Write IOPS: 438 1288 2.9x Write Bandwidth: 448910 1319062 2.9x Write Latency: 36330970 12163408 -66.52% sequential_writes 1m sync ios, 32 threads Write IOPS: 430 1290 3.0x Write Bandwidth: 441188 1321693 3.0x Write Latency: 74028536 24519698 -66.88% None of the other tests in the performance suite (run with a zil or slog) had a significant change, including the random_write_zil tests, which use multiple datasets. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Nguyen <[email protected]> Signed-off-by: John Wren Kennedy <[email protected]> Closes openzfs#14378

behlendorf added Type: Performance Performance improvement or performance problem Status: Code Review Needed Ready for review and testing labels Jan 11, 2023

jwk404 requested review from grwilson, tonynguien and behlendorf January 12, 2023 21:39

tonynguien self-assigned this Jan 13, 2023

tonynguien approved these changes Jan 19, 2023

View reviewed changes

amotin approved these changes Oct 9, 2023

View reviewed changes

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Oct 13, 2023

Merge branch 'master' into slog-perf

c0629f0

Signed-off-by: Brian Behlendorf <[email protected]>

behlendorf merged commit c0e5899 into openzfs:master Oct 13, 2023
5 of 8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large sync writes perform worse with slog #14378

Large sync writes perform worse with slog #14378

jwk404 commented Jan 11, 2023 •

edited

Loading

tonynguien left a comment

amotin commented Jan 19, 2023

jwk404 commented Jan 26, 2023

jwk404 commented Feb 7, 2023 •

edited

Loading

amotin commented Feb 14, 2023

jwk404 commented Feb 22, 2023

amotin commented Feb 22, 2023 •

edited

Loading

jwk404 commented Feb 27, 2023

amotin commented Feb 28, 2023 •

edited

Loading

jwk404 commented Feb 28, 2023

amotin commented Feb 28, 2023

amotin commented Oct 9, 2023

Large sync writes perform worse with slog #14378

Large sync writes perform worse with slog #14378

Conversation

jwk404 commented Jan 11, 2023 • edited Loading

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

tonynguien left a comment

Choose a reason for hiding this comment

amotin commented Jan 19, 2023

jwk404 commented Jan 26, 2023

jwk404 commented Feb 7, 2023 • edited Loading

amotin commented Feb 14, 2023

jwk404 commented Feb 22, 2023

amotin commented Feb 22, 2023 • edited Loading

jwk404 commented Feb 27, 2023

amotin commented Feb 28, 2023 • edited Loading

jwk404 commented Feb 28, 2023

amotin commented Feb 28, 2023

amotin commented Oct 9, 2023

jwk404 commented Jan 11, 2023 •

edited

Loading

jwk404 commented Feb 7, 2023 •

edited

Loading

amotin commented Feb 22, 2023 •

edited

Loading

amotin commented Feb 28, 2023 •

edited

Loading