zio_compress: introduce max_size threshold #9416

gmelikov · 2019-10-05T19:38:07Z

Created brand new PR because of some problems with rebasing old one #9311, + updated description.

Motivation and Context

87.5% minimum compressratio is way to large, on 1TB with recordsize 16M such artifical limitation may lead to up to 125GB of wasted space in worst case (but you must be very, very unlucky).

So i propose to check only for max ashift of pool, it may save you nearly 99% of compressratio on large recordsizes.

And I think if you enable slow compression on dataset you will prefer better compressratio over decompression absence for data nowadays.

So, pros:

better compressratio
better IO if decompress is faster than disks
we're already compressing everything
it may be tuned for not so compressible data via module parameter

Cons:

if you have data, which used to be written uncompressible due to previous code - you will see better compressratio at the price of more CPU load

Description

Now default compression is lz4, which can stop
compression process by itself on incompressible data.
So we can check only for common sector size.

How Has This Been Tested?

Generate 12% (just as needed to get into 87.5% threshold) compressible file with fio:

# cat fio.conf
[workload]
bs=128k
ioengine=libaio
iodepth=1
numjobs=1
direct=1
runtime=10
size=100m
filename=bigfile
rw=read
thread
unified_rw_reporting=1
group_reporting=1
buffer_compress_percentage=12
refill_buffers
buffer_pattern=0xdeadbeef

zfs set compression=lz4 rpool/test

version	recordsize	ashift	written	diff (from original)
0.7.13	1M	12	100M
master_patched	1M	12	88.8M	~11.2%

So there may be up to ~12.5% compression gain for hardly compressible files.

Tested manually in VM, didn't run actual system on this code yet.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the ZFS on Linux code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
All new and existing tests passed.
All commit messages are properly formatted and contain Signed-off-by.

codecov · 2019-10-06T04:05:10Z

Codecov Report

Merging #9416 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #9416      +/-   ##
==========================================
+ Coverage   79.15%   79.15%   +<.01%     
==========================================
  Files         416      416              
  Lines      123655   123658       +3     
==========================================
+ Hits        97876    97880       +4     
+ Misses      25779    25778       -1

Flag	Coverage Δ
#kernel	`79.75% <85.71%> (+0.01%)`	⬆️
#user	`66.64% <85.71%> (-0.54%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 685abd5...39d1434. Read the comment docs.

GregorKopka

See inline comments.

Also it should be verified that changing the module parameter won't modify the result for existing data when called from https://github.com/zfsonlinux/zfs/blob/13a4027a7cd68069cb252e94c18ba1e5eb5af1cd/module/zfs/arc.c#L1625 as that would break encryption MAC validation.

Possibly this needs to be a feature tracking the TXG it's activated on to behave differently depending on the age of the data.

man/man5/zfs-module-parameters.5

module/zfs/zio_compress.c

PrivatePuffin · 2019-12-25T20:08:03Z

@gmelikov Thanks for pointing me here.
I can at least confirm that 12% compressible writes still don't get compressed as today on MASTER, so this is not just an issue with 0.7.13.

Good work on this, LGTM, this PR deserves some more love! 👍

edit
To be more clear:
This issue isn't just with extremely large record sizes, with 1M record sizes this can already contribute to significant savings!

module/zfs/zio_compress.c

ahrens · 2020-04-14T15:42:29Z

I would prefer to eliminate the module parameter, and instead "use sector size of this pool as threshold". I.e. always make the threshold 1 sector (spa_max_ashift).

gmelikov · 2020-04-14T17:36:26Z

@ahrens hmm, I agree with you, we don't need to use larger threshold here with lz4 and future zstd.

Will get back to this PR in a week.

codecov-io · 2020-04-18T02:07:28Z

Codecov Report

Merging #9416 into master will decrease coverage by 0.00%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #9416      +/-   ##
==========================================
- Coverage   79.38%   79.38%   -0.01%     
==========================================
  Files         388      388              
  Lines      123392   123396       +4     
==========================================
  Hits        97953    97953              
- Misses      25439    25443       +4

Flag	Coverage Δ
#kernel	`79.83% <87.50%> (-0.10%)`	⬇️
#user	`66.07% <75.00%> (+0.56%)`	⬆️

Impacted Files	Coverage Δ
module/zfs/arc.c	`85.65% <100.00%> (+0.44%)`	⬆️
module/zfs/zio.c	`88.76% <100.00%> (+0.25%)`	⬆️
module/zfs/zio_compress.c	`92.85% <100.00%> (+0.54%)`	⬆️
module/os/linux/spl/spl-zlib.c	`55.35% <0.00%> (-28.58%)`	⬇️
module/os/linux/spl/spl-kmem-cache.c	`75.58% <0.00%> (-8.14%)`	⬇️
module/zfs/dsl_synctask.c	`92.40% <0.00%> (-2.54%)`	⬇️
module/zfs/zil.c	`91.25% <0.00%> (-1.87%)`	⬇️
module/zfs/vdev_indirect.c	`73.83% <0.00%> (-0.84%)`	⬇️
module/os/linux/zfs/vdev_disk.c	`83.27% <0.00%> (-0.73%)`	⬇️
module/zfs/vdev_trim.c	`95.66% <0.00%> (-0.73%)`	⬇️
... and 51 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a7929f3...f7523c3. Read the comment docs.

gmelikov · 2020-04-18T11:22:36Z

Updated,

I've removed module parameter and use max_ashift of pool;
tested interoperability with old code, manually imported pool with new code, created data with 89% effective compression, imported it with old code, scrub/read/write is ok, rewrote it, imported with new code, scrub/read/write ok;
nightly ztest is almost all green, I've got just one ASSERT at ztest.c:6050:ztest_scrub()/home/debian/zfs/cmd/ztest/.libs/ztest(+0x9444)[0x564c51a51444], but looks like it's not related to this patch
I've found only one place in code where we recompress data to checksum it - https://github.com/openzfs/zfs/pull/9416/files#diff-163982413ed6b6861e08633c89a77395R1759 , so looks like encrypted dataset and ARC without compression may trigger a checksum error on pool with 89+% compressed data on old code, @tcaputi could you please help me here and prove it or give some information. I can't reproduce this case, looks like I'm wrong;
I looked through all zio_compress_data() usages and can conclude that if data wasn't compressed - it will be saved as ZIO_COMPRESS_OFF, so if we really need to recompress data without ashift info and check anything - we can just compress it with zero threshold;
Old code (unless of uncompressed ARC and encryption case) doesn't recompress data for checksums, so if that case will be resolved - we can use this patch without feature flag!

tcaputi · 2020-05-07T10:11:52Z

@gmelikov without looking at it too closely i think i might know whats going on. I think there are 2 things as play which are combining to result in the ARC's "recompress before checksumming" to work ok, but I think it is a pretty brittle dependency.

For some quick background, we need to store data in the L2ARC exactly the same as it is on disk in order for the decryption to work. This is why we do the re-compression here in the ARC layer.

When the data gets recompressed, the ARC uses the compression algorithm that it stored in the ARC header. This info comes from the bp when the block first enters the ARC. This patch doesn't change the way compression is done (AFAICT), it just changes the threshold at which we decide the compression is "worth it". However, no matter what that threshold is, if ZFS decides not to compress the block it sets the compression algorithm to ZIO_COMPRESS_OFF. So the encryption code will only recompress it if it was actually compressed. This code does NOT have the same check for whether or not the compression was worth it. It simply relies on the value from the bp, so it ends up doing the right thing in all these cases.

That being said, this discussion is making me wonder if QAT compression support breaks L2ARC with encryption + no compression. I suspect it does, but that is a very niche bug with very minor repercussions in that case (the data simply fails to read from the L2ARC) and checks the main pool instead. Probably a problem for another time.

Anyway, hope that helps.

gmelikov · 2020-05-07T10:33:24Z

@tcaputi Thank you for insight, you're right, if it was compressed - we can use zero threshold here too, so I was wrong here (how glad I am)!

So, this patch is ready to review, I've rebased it.

gmelikov · 2020-05-07T12:59:58Z

module/zfs/arc.c

@@ -1755,7 +1756,7 @@ arc_hdr_authenticate(arc_buf_hdr_t *hdr, spa_t *spa, uint64_t dsobj)
 		abd_take_ownership_of_buf(abd, B_TRUE);

 		csize = zio_compress_data(HDR_GET_COMPRESS(hdr),
-		    hdr->b_l1hdr.b_pabd, tmpbuf, lsize);


Nope, steps to get problem here:

Use code from this PR

get 89+% compressed encrypted block

L2ARC attached to pool

Load pool with pre-PR code

zio_compress_data will not compress this block again here due to old 12.5% threshold inside it

Heck, because it's the problem - I'll add a read-only feature flag only to make it safe for this case. In future zio_compress_data will have threshold argument so we won't get in this probem again.

I see, if you move the pool to a system with older software, it doesn't correctly write to the L2ARC, because even though the block was compressed, the old L2ARC-writing code will not compress it if it doesn't save at least 1/8th of the space. This only applies if the old system has disabled compressed ARC, right?

Introducing a readonly-compatible feature flag is a bit unfortunate. It would mean maintaining the old 1/8th code forever. I wonder how common it is to disable compressed ARC, nowadays? It might be reasonable to say that if you disable compressed ARC, and use L2ARC, and bring a newer pool back to old bits, then some of the data won't be able to "stick" in the L2ARC. I mean, we previously considered (but didn't implement) saying that disabling compressed ARC makes L2ARC not work with compressed blocks at all (i.e. removing the recompress-on-L2ARC-write code entirely).

@ahrens yes, you've got the point. The funny thing is that I can't reproduce problems in such theoretic scenario, l2arc works well under manual tests with data from new code, but with old module and disabled ARC compression.

There is a "best effort" thing near arc_hdr_authenticate function, maybe it's the case?

I've tried to add a feature flag, and code began to transform from beautiful and easy to read info ugly if-if monster. I don't want to do this :(

If it's already a problematic and rare case (compressed l2arc with encryption on old code with pool and data from new code), and we are ready to go with this - I propose to merge it as is without a feature flag, PR is ready for it. I can add this comment somewhere (I'd be glad if someone gives me exact place for this).

adamdmoss · 2020-05-08T22:14:01Z

(as an aside, the comments here talk about a >1MB recordsize a few times - I thought 1MB was the upper limit?)

gmelikov · 2020-05-10T21:07:28Z

@adamdmoss by default it is, but technically 16MB is an upper limit now, you may just change this module parameter and voila https://github.com/openzfs/zfs/wiki/ZFS-on-Linux-Module-Parameters#zfs_max_recordsize

It's good if you have really large block workload with rare read.

behlendorf · 2024-08-20T22:24:06Z

@gmelikov the Fedora build failures do look like they're introduced by this PR.

gmelikov · 2024-08-29T18:16:16Z

@behlendorf not sure what was with fedora runner, after rebase I've got 2 green runs.

But I had flaky tests (they've failed only once):

import_devices_missing
redundancy_draid_spare1 (saw in some other PRs too)

Now default compression is lz4, which can stop compression process by itself on incompressible data. If there are additional size checks - we will only make our compressratio worse. New usable compression thresholds are: - less than BPE_PAYLOAD_SIZE (embedded_data feature); - at least one saved sector. Old 12.5% threshold is left to minimize affect on existing user expectations of CPU utilization. If data wasn't compressed - it will be saved as ZIO_COMPRESS_OFF, so if we really need to recompress data without ashift info and check anything - we can just compress it with zero threshold. So, we don't need a new feature flag here! Signed-off-by: George Melikov <[email protected]>

ZLE compressor needs additional bytes to process d_len argument efficiently. Don't use BPE_PAYLOAD_SIZE as d_len with it before we rework zle compressor somehow. Signed-off-by: George Melikov <[email protected]>

On compression we could be more explicit here for cases where we can not recompress the data. Co-authored-by: Alexander Motin <[email protected]> Signed-off-by: George Melikov <[email protected]>

behlendorf

Thanks for updating this. Looks good, the CI failures were unrelated.

ZLE compressor needs additional bytes to process d_len argument efficiently. Don't use BPE_PAYLOAD_SIZE as d_len with it before we rework zle compressor somehow. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: George Melikov <[email protected]> Closes #9416

On compression we could be more explicit here for cases where we can not recompress the data. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Co-authored-by: Alexander Motin <[email protected]> Signed-off-by: George Melikov <[email protected]> Closes #9416

Now default compression is lz4, which can stop compression process by itself on incompressible data. If there are additional size checks - we will only make our compressratio worse. New usable compression thresholds are: - less than BPE_PAYLOAD_SIZE (embedded_data feature); - at least one saved sector. Old 12.5% threshold is left to minimize affect on existing user expectations of CPU utilization. If data wasn't compressed - it will be saved as ZIO_COMPRESS_OFF, so if we really need to recompress data without ashift info and check anything - we can just compress it with zero threshold. So, we don't need a new feature flag here! Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: George Melikov <[email protected]> Closes openzfs#9416

ZLE compressor needs additional bytes to process d_len argument efficiently. Don't use BPE_PAYLOAD_SIZE as d_len with it before we rework zle compressor somehow. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: George Melikov <[email protected]> Closes openzfs#9416

On compression we could be more explicit here for cases where we can not recompress the data. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Co-authored-by: Alexander Motin <[email protected]> Signed-off-by: George Melikov <[email protected]> Closes openzfs#9416