Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large sync writes perform worse with slog #14378

Merged
merged 2 commits into from
Oct 13, 2023
Merged

Conversation

jwk404
Copy link
Contributor

@jwk404 jwk404 commented Jan 11, 2023

Motivation and Context

For synchronous write workloads with large IO sizes, a pool configured with a slog performs worse than one with an embedded zil:

sequential_writes 1m sync ios, 16 threads
  Write IOPS:              1292          438   -66.10%
  Write Bandwidth:      1323570       448910   -66.08%
  Write Latency:       12128400     36330970      3.0x

sequential_writes 1m sync ios, 32 threads
  Write IOPS:              1293          430   -66.74%
  Write Bandwidth:      1324184       441188   -66.68%
  Write Latency:       24486278     74028536      3.0x

Description

The reason is the zil_slog_bulk variable. In zil_lwb_write_open, if a zil block is greater than 768K, the priority of the write is downgraded from sync to async. Increasing the value allows greater throughput. To select a value for this PR, I ran an fio workload with the following values for zil_slog_bulk:

zil_slog_bulk    KiB/s
1048576         422132
2097152         478935
4194304         533645
8388608         623031
12582912        827158
16777216       1038359
25165824       1142210
33554432       1211472
50331648       1292847
67108864       1308506
100663296      1306821
134217728      1304998

How Has This Been Tested?

At 64M, the results with a slog are now improved to parity with an embedded zil:

sequential_writes 1m sync ios, 16 threads
  Write IOPS:               438         1288      2.9x
  Write Bandwidth:       448910      1319062      2.9x
  Write Latency:       36330970     12163408   -66.52%

sequential_writes 1m sync ios, 32 threads
  Write IOPS:               430         1290      3.0x
  Write Bandwidth:       441188      1321693      3.0x
  Write Latency:       74028536     24519698   -66.88%

None of the other tests in the performance suite (run with a zil or slog) had a significant change, including the random_write_zil tests, which use multiple datasets.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

For synchronous write workloads with large IO sizes, a pool configured
with a slog performs worse than one with an embedded zil:

sequential_writes 1m sync ios, 16 threads
  Write IOPS:              1292          438   -66.10%
  Write Bandwidth:      1323570       448910   -66.08%
  Write Latency:       12128400     36330970      3.0x

sequential_writes 1m sync ios, 32 threads
  Write IOPS:              1293          430   -66.74%
  Write Bandwidth:      1324184       441188   -66.68%
  Write Latency:       24486278     74028536      3.0x

The reason is the `zil_slog_bulk` variable. In `zil_lwb_write_open`,
if a zil block is greater than 768K, the priority of the write is
downgraded from sync to async. Increasing the value allows greater
throughput. To select a value for this PR, I ran an fio workload with
the following values for `zil_slog_bulk`:

    zil_slog_bulk    KiB/s
    1048576         422132
    2097152         478935
    4194304         533645
    8388608         623031
    12582912        827158
    16777216       1038359
    25165824       1142210
    33554432       1211472
    50331648       1292847
    67108864       1308506
    100663296      1306821
    134217728      1304998

At 64M, the results with a slog are now improved to parity with an
embedded zil:

sequential_writes 1m sync ios, 16 threads
  Write IOPS:               438         1288      2.9x
  Write Bandwidth:       448910      1319062      2.9x
  Write Latency:       36330970     12163408   -66.52%

sequential_writes 1m sync ios, 32 threads
  Write IOPS:               430         1290      3.0x
  Write Bandwidth:       441188      1321693      3.0x
  Write Latency:       74028536     24519698   -66.88%

None of the other tests in the performance suite (run with a zil or
slog) had a significant change, including the random_write_zil tests,
which use multiple datasets.

Signed-off-by: John Wren Kennedy <[email protected]>
@behlendorf behlendorf added Type: Performance Performance improvement or performance problem Status: Code Review Needed Ready for review and testing labels Jan 11, 2023
@tonynguien tonynguien self-assigned this Jan 13, 2023
Copy link
Contributor

@tonynguien tonynguien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raising the limit makes sense given improvement in disk write performance since the tuneable was introduced. Though I wonder if it's necessary to still keep this limit. Perhaps, we can revisit this another time.

Change looks great! Thanks for chasing down this problem.

@amotin
Copy link
Member

amotin commented Jan 19, 2023

@jwk404 I think your test is flawed at least in point that 64MB is the point. Having 32 1MB writes you can never get more than 32MB of ZIL log, so setting zil_slog_bulk simply disables the functionality. Would you use 4MB writes, for example, I bet bigger values could have effect. The problem with 64MB is that it can barely be reachable for sync=always workloads. I think it can only be reached if some application would do plenty of writes and then requested sync().

But the real question is why does it really help? What so bad is done for async write priority that hurts performance so much in your case? My first guess is that it can make latency less fair by switching from FIFO to elevator sorting, but not sure why it should cause 3x slowdown. The second guess is that async priority is limited in number of concurrent requests by I/O scheduler. It would be good to find it out for better understanding. May be my idea of using async priority there was not the best, or may be it could be improved.

@jwk404
Copy link
Contributor Author

jwk404 commented Jan 26, 2023

With respect to why this helps, I believe your second guess is the correct one. When comparing the output of zpool_influxdb between runs with zil vs. slog without this change, I noticed that the pool with the slog had significantly fewer IOs in the active sync write vdev IO queue. It was roughly a 10x difference.

I agree that a setting of 64MB effectively disables the functionality. I considered removing it, but thought it better to leave the tunable in case some other workload regressed with the new value. I tested workloads where we have multiple datasets being written synchronously to see if changing the tunable had any effect, but it didn't. Can you describe a workload where this tunable improves performance? I wasn't able to create one.

@jwk404
Copy link
Contributor Author

jwk404 commented Feb 7, 2023

@amotin Do you have thoughts on this?

@amotin
Copy link
Member

amotin commented Feb 14, 2023

Can you describe a workload where this tunable improves performance? I wasn't able to create one.

The idea behind this was that if you have two datasets, one of which is more latency sensitive, like iSCSI with sync=always, while another like async NFS, generating huge sync bursts from time to time, the first should not suffer from the bursts of the second. With the growth of I/O sizes, queue depths and speeds it may be that 768K is indeed too small. I don't think I'd have objections to bump it up, just not sure how high. My unmotivated feeling is ~16MB, but it would probably be good to check how big sync bursts are created by modern NFS clients, properly supporting server cache control (Linux, FreeBSD, etc).

As alternative or addition to this, I think we may investigate completely disabling offset sorting in I/O scheduler for non-rotational (SSD) vdevs even for async priorities, while keeping aggregation. SSDs should not care much about offsets if the writes are large, while it should make latency more stable and predictable. IIRC at the time I/O scheduler was written ZFS had no information whether the device is rotational or not.

@jwk404
Copy link
Contributor Author

jwk404 commented Feb 22, 2023

With the growth of I/O sizes, queue depths and speeds it may be that 768K is indeed too small. I don't think I'd have objections to bump it up, just not sure how high. My unmotivated feeling is ~16MB, but it would probably be good to check how big sync bursts are created by modern NFS clients, properly supporting server cache control (Linux, FreeBSD, etc).

I did some further testing along these lines. Using 4 NFS clients, I was able to create relatively high bursts of sync IO. Using a bcc script (and the 64MB change suggested in this PR), I able to monitor zilog->zl_cur_used and saw it hovering under 100MB, with peaks in the 200-300MB range. With the default, it averaged in the 380MB range with a peak of almost 2GB. So I think 64MB is a reasonable default; I don't think a complex config (mixed workloads in a pool) should be prioritized over a simple config (any pool with a slog).

@amotin
Copy link
Member

amotin commented Feb 22, 2023

@jwk404 What burst sizes do you see with one client doing sequential writes? I'd prefer the tunable to be set below that level.

@jwk404
Copy link
Contributor Author

jwk404 commented Feb 27, 2023

I did a series of experiments where I launched a 60 second local sync workload, and after 15 seconds, launced a similar 30 second workload over NFS. The table below shows avg throughput in MB/s at varying tunable sizes.

zil_slog_bulk 1 local fs 1 NFS local + NFS
768K 418 459 367 / 116
16M 686 802 706 / 618
32M 919 811 831 / 633
64M 1281 828 1093 / 463

Given that, what do you think about 32M for the tunable?

@amotin
Copy link
Member

amotin commented Feb 28, 2023

I asked to monitor zilog->zl_cur_used when only NFS client is used, may be with different intensity. From the "1 NFS" column I can guess only that it is likely somewhere between 768K and 16MB. Also, what was your "local" workload, that it required 64MB of simultaneous requests? The whole idea of this tunable is to separate "bulk" and "interactive" workloads, giving the last a higher priority. 64MB does not sound interactive to me.

@jwk404
Copy link
Contributor Author

jwk404 commented Feb 28, 2023

zilog->zl_cur_used varied from a few MB to peaks as high as 320MB. The average over 3 runs was 42MB.
Both workloads were sequential 1MB writes over 64 threads.

@amotin
Copy link
Member

amotin commented Feb 28, 2023

"1MB writes over 64 threads" is not exactly an interactive workload, especially if all the threads are writing to the same dataset. It may not depend on individual request latency as much as would one thread doing 16KB writes.

@amotin
Copy link
Member

amotin commented Oct 9, 2023

@jwk404 OK. Cold you please update or remove the date change in zfs.4 to resolve the conflict and lets merge it.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Oct 13, 2023
@behlendorf behlendorf merged commit c0e5899 into openzfs:master Oct 13, 2023
5 of 8 checks passed
ixhamza pushed a commit to truenas/zfs that referenced this pull request Nov 20, 2023
For synchronous write workloads with large IO sizes, a pool configured
with a slog performs worse than one with an embedded zil:

sequential_writes 1m sync ios, 16 threads
  Write IOPS:              1292          438   -66.10%
  Write Bandwidth:      1323570       448910   -66.08%
  Write Latency:       12128400     36330970      3.0x

sequential_writes 1m sync ios, 32 threads
  Write IOPS:              1293          430   -66.74%
  Write Bandwidth:      1324184       441188   -66.68%
  Write Latency:       24486278     74028536      3.0x

The reason is the `zil_slog_bulk` variable. In `zil_lwb_write_open`,
if a zil block is greater than 768K, the priority of the write is
downgraded from sync to async. Increasing the value allows greater
throughput. To select a value for this PR, I ran an fio workload with
the following values for `zil_slog_bulk`:

    zil_slog_bulk    KiB/s
    1048576         422132
    2097152         478935
    4194304         533645
    8388608         623031
    12582912        827158
    16777216       1038359
    25165824       1142210
    33554432       1211472
    50331648       1292847
    67108864       1308506
    100663296      1306821
    134217728      1304998

At 64M, the results with a slog are now improved to parity with an
embedded zil:

sequential_writes 1m sync ios, 16 threads
  Write IOPS:               438         1288      2.9x
  Write Bandwidth:       448910      1319062      2.9x
  Write Latency:       36330970     12163408   -66.52%

sequential_writes 1m sync ios, 32 threads
  Write IOPS:               430         1290      3.0x
  Write Bandwidth:       441188      1321693      3.0x
  Write Latency:       74028536     24519698   -66.88%

None of the other tests in the performance suite (run with a zil or
slog) had a significant change, including the random_write_zil tests,
which use multiple datasets.

Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Tony Nguyen <[email protected]>
Signed-off-by: John Wren Kennedy <[email protected]>
Closes openzfs#14378
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Dec 12, 2023
For synchronous write workloads with large IO sizes, a pool configured
with a slog performs worse than one with an embedded zil:

sequential_writes 1m sync ios, 16 threads
  Write IOPS:              1292          438   -66.10%
  Write Bandwidth:      1323570       448910   -66.08%
  Write Latency:       12128400     36330970      3.0x

sequential_writes 1m sync ios, 32 threads
  Write IOPS:              1293          430   -66.74%
  Write Bandwidth:      1324184       441188   -66.68%
  Write Latency:       24486278     74028536      3.0x

The reason is the `zil_slog_bulk` variable. In `zil_lwb_write_open`,
if a zil block is greater than 768K, the priority of the write is
downgraded from sync to async. Increasing the value allows greater
throughput. To select a value for this PR, I ran an fio workload with
the following values for `zil_slog_bulk`:

    zil_slog_bulk    KiB/s
    1048576         422132
    2097152         478935
    4194304         533645
    8388608         623031
    12582912        827158
    16777216       1038359
    25165824       1142210
    33554432       1211472
    50331648       1292847
    67108864       1308506
    100663296      1306821
    134217728      1304998

At 64M, the results with a slog are now improved to parity with an
embedded zil:

sequential_writes 1m sync ios, 16 threads
  Write IOPS:               438         1288      2.9x
  Write Bandwidth:       448910      1319062      2.9x
  Write Latency:       36330970     12163408   -66.52%

sequential_writes 1m sync ios, 32 threads
  Write IOPS:               430         1290      3.0x
  Write Bandwidth:       441188      1321693      3.0x
  Write Latency:       74028536     24519698   -66.88%

None of the other tests in the performance suite (run with a zil or
slog) had a significant change, including the random_write_zil tests,
which use multiple datasets.

Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Tony Nguyen <[email protected]>
Signed-off-by: John Wren Kennedy <[email protected]>
Closes openzfs#14378
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested) Type: Performance Performance improvement or performance problem
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants