Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Remove tag space initalization for rowHash #3426

Closed
wants to merge 17 commits into from

Conversation

yoniko
Copy link
Contributor

@yoniko yoniko commented Jan 13, 2023

Based on #2971 with an added modification that solves the regression in zstd -b5e7 enwik8 -B128K runs.
This is still a WIP and mostly up to get tested in CI and so other people can review the approach.

The objective here is to remove the initalization of tag space as it's costly when dealing with small data.
However, there are two downsides to doing so, one of them is dealt with here:

  1. If the same tag space is reused if can lead to performance regressions due to hash collisions with previous compressions. To avoid this I added a "salt" to the hash that changes in every match state reset and is XORed to the hash.
  2. Valgrind will alert on usage of uninit memory, this isn't solved in this patch.

Benchmarks in different scenarios available in this spreadsheet.

@terrelln
Copy link
Contributor

To avoid this I added a "salt" to the hash that changes in ever match state reset and is XORed to the hash.

Clever!

Valgrind will alert on usage of uninit memory, this isn't solved in this patch.

I'd recommend initializing the memory once, when it is allocated, but then when memory is reused not re-initializing it. This matches the approach we take with our tables, and will avoid all uninitialized memory accesses.

You should be able to achieve that using the cwksp. You'd probably want to move the allocation above the opt parser space, but below the table space. You could add another "phase" like aligned_initialized, that happens after tables, but before aligned, and make sure that it always returns memory that is initialized to something.

CC @felixhandte

@yoniko
Copy link
Contributor Author

yoniko commented Jan 14, 2023

I'd recommend initializing the memory once, when it is allocated, but then when memory is reused not re-initializing it. This matches the approach we take with our tables, and will avoid all uninitialized memory accesses.

I agree that this is probably a better approach than always resetting the space, but it can still have a performance penalty.
Specifically, when using stream compression with small data using the CCtx only once.
Maybe it can be paired with another idea @Cyan4973 suggested, which is to use a lower hash log on a Z_STREAM_END directive that ends a small block.

@terrelln
Copy link
Contributor

Specifically, when using stream compression with small data using the CCtx only once.

Yeah, but in this case we are already zeroing the hash table, which is 4x larger than the tag space. And generally, I'm more concerned about context-reuse performance.

@yoniko
Copy link
Contributor Author

yoniko commented Jan 14, 2023

Yeah, but in this case we are already zeroing the hash table, which is 4x larger than the tag space. And generally, I'm more concerned about context-reuse performance.

Are you talking about the indices? there's no real reason to zero them out either.
In any case, you make a good point and suggestion.
This is probably the way to go here for now, and we can go back to single-use context optimizations another time.

@yoniko yoniko force-pushed the no-tag-space-init branch 5 times, most recently from 53f926f to ee15d46 Compare January 24, 2023 05:41
/* ZSTD_wildcopy() is used to copy into the literals buffer,
* so we have to oversize the buffer by WILDCOPY_OVERLENGTH bytes.
*/
zc->seqStore.litStart = ZSTD_cwksp_reserve_buffer(ws, blockSize + WILDCOPY_OVERLENGTH);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just continue to call _reserve_buffer() in all these callsites to distinguish them from allocations that actually require aligned memory. And then in the cwksp implementation, we can have _reserve_buffer() just call _reserve_aligned().

int needTagTableInit = 1;
#ifdef HAS_SECURE_RANDOM
if(forWho == ZSTD_resetTarget_CCtx) {
size_t randomGenerated = getSecureRandom(&ms->hashSalt, sizeof(ms->hashSalt));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I continue to think that you don't need a secure random on each reset (if ever), and instead you just need a nonce that can be incremented on each reset (maybe initialized as a secure random on context creation). As discussed, the speed of small compressions matters.

Have you benchmarked the cost of this call yet?

@@ -556,10 +570,11 @@ MEM_STATIC void ZSTD_cwksp_clear(ZSTD_cwksp* ws) {
#endif

ws->tableEnd = ws->objectEnd;
ws->allocStart = ws->workspaceEnd;
ws->allocStart = (void*)((size_t)ws->workspaceEnd & ~(ZSTD_CWKSP_ALIGNMENT_BYTES-1));
ws->initOnceStart = ws->workspaceEnd;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this mean that you are re-init'ing the memory on every compression? The workspace is cleared on every ctx reset, IIRC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, this was the wrong fix on my part, I've updated the PR with a better solution which is to not msan poison the initOnce memory.

* - Aligned: these buffers are used for various purposes that require 4 byte
* alignment, but don't require any initialization before they're used. These
* buffers are each aligned to 64 bytes.
* - Init once: these buffers require to be initialized at least once before
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me why introducing the init-once region requires the removal of the unaligned buffer region. I mean, sure, the buffer region would no longer be attached to the unaligned end of the workspace, and those allocations would now sit between two aligned regions. But it would be more compact to pad the alignment of the buffers once at the edges, rather than round each buffer up to 64 byte alignment.

@@ -237,7 +243,7 @@ MEM_STATIC size_t ZSTD_cwksp_bytes_to_align_ptr(void* ptr, const size_t alignByt
size_t const alignBytesMask = alignBytes - 1;
size_t const bytes = (alignBytes - ((size_t)ptr & (alignBytesMask))) & alignBytesMask;
assert((alignBytes & alignBytesMask) == 0);
assert(bytes != ZSTD_CWKSP_ALIGNMENT_BYTES);
assert(bytes < alignBytes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@yoniko yoniko force-pushed the no-tag-space-init branch 5 times, most recently from 0a759ed to 0014780 Compare January 25, 2023 22:50
…tag space initialization.

Add salting to hash to reduce collision when re-using hash table across multiple compressions.
Salting the hash makes it so hashes from previous compressions won't match to hashes of similar data in current compression
  1. Converted all unaligned buffer allocation to aligned buffer allocations
  2. Added init once aligned memory buffers

- Moved tag table to init once allocation when strong random is available

- Bugfix in hash salting
- Fix off by one bug in `ZSTD_cwksp_owns_buffer`
- Better handle MSAN for init once memory
- Allow to pass custom MOREFLAGS into msan-% targets in Makefile
@yoniko
Copy link
Contributor Author

yoniko commented Jan 26, 2023

Due to complexity vs added benefits it has been decided to put this PR on hold.

@rincebrain
Copy link

Curious if there are any plans for an alternate approach, as I was playing a bit with updating the version of zstd in OpenZFS and it seems like this regression might be why I'm seeing a really terrible regression in performance in levels 9 and 12, to the point that using 15 was twice as fast as 12 in my early tests.

Going to rework the code to be cleaner so I can post things for people to see and experiment with and confirm I didn't replace memcpy with a small woodland creature hand-copying bytes, just wanted to ask if there were plans for this I should wait on, or try to come up with another solution that doesn't regress performance that badly.

@yoniko
Copy link
Contributor Author

yoniko commented Feb 19, 2023

@rincebrain - I doubt it unless you are using streaming compression of small data without specifying an end directive (I haven't found this can kind of usage in OpenZFS).
If you want to measure the impact that this PR could possibly provide then you might want to just apply #2971 and benchmark.

@rincebrain
Copy link

rincebrain commented Feb 19, 2023

I did, it gets better but still not the same.

OpenZFS hands zstd multiple-of-two records between let's say 4k and 16M, not using the streaming interface, always independent startup/teardown.

I can, and will, go bisect between 1.4.5 and now and confirm which version it was that this changed, but since it's spending all its time in ZSTD_RowFindBestMatch, it seemed a reasonable guess, and figured I'd ask.

@yoniko
Copy link
Contributor Author

yoniko commented Feb 22, 2023

I did, it gets better but still not the same.

How much better for this one change?

OpenZFS hands zstd multiple-of-two records between let's say 4k and 16M, not using the streaming interface, always independent startup/teardown.

Are you seeing the improvements across the different sizes or only for specific size ranges?
For context - different sizes result in different compression strategies for Zstd.

I can, and will, go bisect between 1.4.5 and now and confirm which version it was that this changed, but since it's spending all its time in ZSTD_RowFindBestMatch, it seemed a reasonable guess, and figured I'd ask.

RowHash didn't exist in 1.4.5, so it'd make sense you'll see a big change when it was introduced in 1.5.0.
You can also try disabling RowHash by setting the parameter ZSTD_c_useRowMatchFinder to ZSTD_ps_disable.

Finally, sounds like there might be a bigger issue here, it'd be a good idea to open an issue with some more information so it can be tracked properly.

@yoniko
Copy link
Contributor Author

yoniko commented Mar 7, 2023

Hi @rincebrain,
I'm trying to look some more into this but I don't have an OpenZFS setup and can't reproduce.
Did you bisect?
Is there a simple way for me to reproduce the regression you are seeing?

@rincebrain
Copy link

Even if you had one, it wouldn't help, since OpenZFS ships 1.4.5, so you'd need to slam it in there.

No, I had been looking at other things recently; I'll prioritize getting back to this.

@yoniko
Copy link
Contributor Author

yoniko commented Mar 8, 2023

Even if you had one, it wouldn't help, since OpenZFS ships 1.4.5, so you'd need to slam it in there.

No, I had been looking at other things recently; I'll prioritize getting back to this.

I meant a dev setup.
In any case, I can't seem to reproduce the issue right now.
We target another release (1.5.5) in the next few weeks, if you could help us reproduce the issue we will look into including a fix in the new release.

@rincebrain
Copy link

I've got it reproducing at the moment, I'm working on narrowing down a more useful test case than "feed 30 GB in and notice it takes markedly longer".

Though I will say, in my testing, flipping ZSTD_c_useRowMatchFinder to 0 removes the difference above the noise threshold...

@yoniko
Copy link
Contributor Author

yoniko commented Mar 8, 2023

Though I will say, in my testing, flipping ZSTD_c_useRowMatchFinder to 0 removes the difference above the noise threshold...

That great, it means we are certain this is where the issue is.
Generally, I'd expect row match finder to be faster than the alternative, so it's worth looking into.

Can you also post your compile flags?

@rincebrain
Copy link

rincebrain commented Mar 8, 2023

I'll do you one better. (All of these were built purely by just running "make -j" on a vanilla checkout on a Ryzen 5900X. --single-thread and -B1048576 are because ZFS chunks things up into fixed records that it compresses independently, and each compression run is a single thread, for it.)

$ ~/zstd_all/zstd-1.5.4/zstd -b1 -e15 --single-thread -B1048576 evil_zstd_repro
 1#evil_zstd_repro   :   7340032 ->   7340270 (x1.000), 1674.4 MB/s, 58756.2 MB/s
 2#evil_zstd_repro   :   7340032 ->   5751582 (x1.276), 2535.9 MB/s, 11283.1 MB/s
 3#evil_zstd_repro   :   7340032 ->   5757047 (x1.275), 2370.7 MB/s, 11896.4 MB/s
 4#evil_zstd_repro   :   7340032 ->   5757047 (x1.275), 2441.8 MB/s, 11904.0 MB/s
 5#evil_zstd_repro   :   7340032 ->   5719120 (x1.283),  200.2 MB/s, 13318.1 MB/s
 6#evil_zstd_repro   :   7340032 ->   5719069 (x1.283),  197.5 MB/s, 12677.9 MB/s
 7#evil_zstd_repro   :   7340032 ->   5719010 (x1.283),  190.0 MB/s, 13340.3 MB/s
 8#evil_zstd_repro   :   7340032 ->   5718994 (x1.283),  189.0 MB/s, 13347.2 MB/s
 9#evil_zstd_repro   :   7340032 ->   5718994 (x1.283),  173.6 MB/s, 13341.5 MB/s
10#evil_zstd_repro   :   7340032 ->   5718986 (x1.283),  177.5 MB/s, 13335.9 MB/s
11#evil_zstd_repro   :   7340032 ->   5718986 (x1.283),  185.0 MB/s, 13325.1 MB/s
12#evil_zstd_repro   :   7340032 ->   5718986 (x1.283),  184.9 MB/s, 13323.0 MB/s
13#evil_zstd_repro   :   7340032 ->   5712884 (x1.285),  225.4 MB/s  13259.3 MB/s
14#evil_zstd_repro   :   7340032 ->   5712884 (x1.285),  226.4 MB/s  13263.4 MB/s
15#evil_zstd_repro   :   7340032 ->   5712884 (x1.285),  218.5 MB/s  13254.4 MB/s
$ ~/zstd_all/zstd-1.5.4/zstd -b1 -e15 --single-thread -B1048576 --no-row-match-finder evil_zstd_repro
 1#evil_zstd_repro   :   7340032 ->   7340270 (x1.000), 2377.6 MB/s, 58086.6 MB/s
 2#evil_zstd_repro   :   7340032 ->   5751582 (x1.276), 4295.5 MB/s, 11169.7 MB/s
 3#evil_zstd_repro   :   7340032 ->   5757047 (x1.275), 4089.6 MB/s, 10887.6 MB/s
 4#evil_zstd_repro   :   7340032 ->   5757047 (x1.275), 4253.6 MB/s, 10090.1 MB/s
 5#evil_zstd_repro   :   7340032 ->   5713005 (x1.285),  587.0 MB/s, 13188.2 MB/s
 6#evil_zstd_repro   :   7340032 ->   5712943 (x1.285),  663.0 MB/s, 11521.7 MB/s
 7#evil_zstd_repro   :   7340032 ->   5712948 (x1.285),  685.5 MB/s, 13215.1 MB/s
 8#evil_zstd_repro   :   7340032 ->   5712936 (x1.285),  672.1 MB/s, 13195.1 MB/s
 9#evil_zstd_repro   :   7340032 ->   5712880 (x1.285),  607.7 MB/s, 13204.6 MB/s
10#evil_zstd_repro   :   7340032 ->   5712880 (x1.285),  705.4 MB/s, 13185.5 MB/s
11#evil_zstd_repro   :   7340032 ->   5712880 (x1.285),  700.5 MB/s, 13222.8 MB/s
12#evil_zstd_repro   :   7340032 ->   5712880 (x1.285),  700.7 MB/s, 12062.6 MB/s
13#evil_zstd_repro   :   7340032 ->   5712884 (x1.285),  398.0 MB/s  11147.4 MB/s
14#evil_zstd_repro   :   7340032 ->   5712884 (x1.285),  444.7 MB/s  11139.7 MB/s
15#evil_zstd_repro   :   7340032 ->   5712884 (x1.285),  410.7 MB/s  13195.7 MB/s
$ ~/zstd_all/zstd-1.4.5/zstd -b1 -e15 --single-thread -B1048576 evil_zstd_repro
 1#evil_zstd_repro   :   7340032 ->   7340270 (1.000),1346.7 MB/s ,58277.3 MB/s
 2#evil_zstd_repro   :   7340032 ->   5855542 (1.254),2149.3 MB/s ,13199.2 MB/s
 3#evil_zstd_repro   :   7340032 ->   5757003 (1.275),2066.0 MB/s ,13315.1 MB/s
 4#evil_zstd_repro   :   7340032 ->   5757003 (1.275),2096.2 MB/s ,13321.7 MB/s
 5#evil_zstd_repro   :   7340032 ->   5713014 (1.285), 272.4 MB/s ,13750.2 MB/s
 6#evil_zstd_repro   :   7340032 ->   5713006 (1.285), 269.2 MB/s ,13744.7 MB/s
 7#evil_zstd_repro   :   7340032 ->   5712954 (1.285), 265.5 MB/s ,13742.0 MB/s
 8#evil_zstd_repro   :   7340032 ->   5712938 (1.285), 262.5 MB/s ,13702.3 MB/s
 9#evil_zstd_repro   :   7340032 ->   5712935 (1.285), 270.2 MB/s ,13738.0 MB/s
10#evil_zstd_repro   :   7340032 ->   5712879 (1.285), 278.7 MB/s ,13750.6 MB/s
11#evil_zstd_repro   :   7340032 ->   5712879 (1.285), 268.0 MB/s ,13746.0 MB/s
12#evil_zstd_repro   :   7340032 ->   5712879 (1.285), 274.7 MB/s ,13818.3 MB/s
13#evil_zstd_repro   :   7340032 ->   5712881 (1.285), 223.4 MB/s ,13777.5 MB/s
14#evil_zstd_repro   :   7340032 ->   5712881 (1.285), 224.7 MB/s ,13781.0 MB/s
15#evil_zstd_repro   :   7340032 ->   5712881 (1.285), 224.7 MB/s ,13777.3 MB/s

Have fun

@yoniko
Copy link
Contributor Author

yoniko commented Mar 9, 2023

Thank you for reporting, we have reproduced the issued and have pin-pointed its origin.
A fix should be up in the next few days.
See issue #3539 for tracking.

@yoniko
Copy link
Contributor Author

yoniko commented Mar 9, 2023

This PR is deprecated, other PRs have been put in its place.

@yoniko yoniko closed this Mar 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants