[WIP] fix #2966 part 2 : do not initialize tag space #2971

Cyan4973 · 2022-01-04T17:33:27Z

This diff disables tag initialization when using the rowHash mode. This initialization was unconditional, but it becomes the dominating operation when compressing small data in streaming mode (see issue #2966).

I could not find a good reason to initialize tags. It just makes tag values start at 0, but 0 is just a regular tag value, it's not more significant than any other tag value. Worst case, there will be a wrong hint of match presence, but even that should be filtered out by distance analysis, which remains active through indices validation. So this won't impact compression result.

Now, initially, I was suspicious that it would work, because the tag space is 2x larger than it should be, suggesting additional space is used for something else than tag values, like determining the starting position in the row (which would be an overkill memory budget, but that's a different topic). But to my surprise, this change passes all tests successfully, suggesting rowHash is even resilient to a random start position.
Edit : it seems to finally break on msan + fuzzer test below, so that's something worth looking into.

The end result is significant. When combined with #2969, the compression speed of rowHash on small data increases dramatically, as can be seen below (#2969 is required, as otherwise the impact of tag initialization is just lost as part of a larger initialization issue).

The following measurement is taken on a core i7-9700K (turbo disabled) with fullbench -b41, using geldings.txt as sample (a small text file). The test corresponds to a scenario using ZSTD_compressStream() without the benefit of knowing the small sample size beforehand.

level	v1.5.1	#2969 + this PR	comment
1	101 MB/s	113 MB/s
2	67 MB/s	112 MB/s
3	31 MB/s	94 MB/s
4	14 MB/s	93 MB/s
5	9 MB/s	54 MB/s	`rowHash`
6	8.7 MB/s	50 MB/s	`rowHash`
7	4.6 MB/s	50 MB/s	`rowHash`
8	4.5 MB/s	48 MB/s	`rowHash`
9	1.7 MB/s	48 MB/s	`rowHash`
10	0.8 MB/s	45 MB/s	`rowHash`
11	0.8 MB/s	39 MB/s	`rowHash`
12	0.4 MB/s	39 MB/s	`rowHash`
13	0.6 MB/s	38 MB/s

cc @terrelln : there might be some side-effects which are not properly captured by the tests, such as potential reproducibility issues but with a low enough probability that it's too difficult to reproduce during CI tests. And maybe other side effects worth looking into.

Note : WIP, not for merge

(note : this might break due to the need to also track the starting candidate nb per row)

Cyan4973 · 2022-01-04T21:00:30Z

msan tests fail because reading uninitialized tags is an automatic msan failure, even when the algorithm is suitable to start with uninitialized values.
So I guess next step is to mark the tag memory area as "fine" from an msan perspective.

felixhandte · 2022-01-04T21:20:10Z

@Cyan4973, see here for an example of explicitly unpoisoning memory.

Cyan4973 · 2022-01-04T23:04:03Z

msan unpoisoning works for msan tests,

but there is still a remaining issue within zlibwrapper tests.
I suspect it's another case of msan test, but one where zlibwrapper is compiled with msan, but the linked libzstd is not, thus failing the msan test.

This, btw, is a good demo of what could happen if any other application linking libzstd does attempt msan tests.

terrelln · 2022-01-04T23:15:42Z

I suspect it's another case of msan test, but one where zlibwrapper is compiled with msan, but the linked libzstd is not, thus failing the msan test.

That is not supported by MSAN. I'm surprised it was working as-is. So we should fix that test to compiler libzstd with MSAN.

MSAN needs all code to be compiled with MSAN in order to work correctly. Unlike ASAN, which can work with linked code that isn't compiled with ASAN, it will just miss bugs in the non-ASAN code.

terrelln

This should be totally fine for determinism. But we should make sure that we benchmark with context reuse, to ensure that eliding the memset doesn't slow down large file compression with context reuse.

terrelln · 2022-01-04T23:18:27Z

Oh, thats a valgrind check, not MSAN.

terrelln

I don't see a way around the problems with valgrind. We could add a suppression file, but then it would break for other people running valgrind with zstd that don't have our suppressions.

And I worry that we may get spurious bug reports, and concern from people who run valgrind, and think that zstd is buggy because of that report.

Would it make sense to do the same thing for the tag table as we do for the tables, and keep track of what's been initialized?

felixhandte · 2022-01-05T22:56:56Z

I thought about this a little bit and probably the simplest/best way to do this is to keep a known-initialized range. When the tag table is in that range, we can do nothing. When it's outside that range, we memset the new tag table area and everything between it and the existing known-good range, and then record that expansion of the range. In the common case where the table ends up in the same position over and over, we don't have to do any work to reset the cctx, and even if things shift around, we only do incremental clearing as needed.

felixhandte · 2022-01-05T22:57:27Z

The big question is whether we also want to get this in for 1.5.2.

Cyan4973 · 2022-01-05T22:59:58Z

I thought about this a little bit and probably the simplest/best way to do this is to keep a known-initialized range. When the tag table is in that range, we can do nothing. When it's outside that range, we memset the new tag table area and everything between it and the existing known-good range, and then record that expansion of the range. In the common case where the table ends up in the same position over and over, we don't have to do any work to reset the cctx, and even if things shift around, we only do incremental clearing as needed.

This solution looks good to me.

Cyan4973 · 2022-01-05T23:00:59Z

The big question is whether we also want to get this in for 1.5.2.

Yes, we do,
otherwise it's effectively an important speed regression (for a specific scenario) when compared to v1.4.9.

Cyan4973 · 2022-01-06T02:40:18Z

I thought about this a little bit and probably the simplest/best way to do this is to keep a known-initialized range. When the tag table is in that range, we can do nothing. When it's outside that range, we memset the new tag table area and everything between it and the existing known-good range, and then record that expansion of the range. In the common case where the table ends up in the same position over and over, we don't have to do any work to reset the cctx, and even if things shift around, we only do incremental clearing as needed.

This solution looks good to me.

Although, it's not guaranteed that these movements of the tag space generate a single expanding memory region, they may end up defining multiple distant regions, in which case tracking becomes impractical, so we'll have to accept some inefficiency (i.e. initialization will happen more often than it could if tracking of initialized bytes was perfect).

I wonder now how competitive could be initialization with calloc().

edit: on first look, it seems it could be made speed-equivalent.
edit 2 : well, not really, according to fullbench -b43, using calloc() to allocate and initialize workspace makes the compression operation way slower when workspace is continuously recreated and free.

terrelln · 2022-01-07T00:58:36Z

I think a sane approach would be to change the order that the cwksp stores elements.

Instead of:

[                        ... workspace ...                         ]
[objects][tables ... ->] free space [<- ... aligned][<- ... buffers]

Make it:

[                        ... workspace ...                         ]
[objects][tables ... ->] free space [<- ... buffers][<- ... aligned uninitialized][<- ... aligned initialized]

Now we just need to keep track of where the boundary between aligned uninitialized & aligned initialized is, and if the initialized section grows, do that initialization.

terrelln · 2022-01-07T02:02:21Z

I have a branch that I modified to take that approach.

I do have a problem though... With context reuse compression is slower when I don't memset the tag table. I think whats happening is that we're getting tags that match from the previous compression, but are out of our window. This causes extra work in cases where there otherwise wouldn't be any matches.

Basically, I think that this branch:

zstd/lib/compress/zstd_lazy.c

Line 1208 in 26d88c0

if (matchIndex < lowLimit)

rarely happens when the tables are sized to fit the input. Since we filter out at the tag step instead. But, when we remove the memset, and are compressing similar data, we may find matches that were from the previous file. And the tag won't filter out those matches, so we will hit this branch instead.

I'm going to look into it.

terrelln · 2022-01-07T02:18:53Z

Running zstd -b5e5 enwik7 -B128K I see that the branch:

zstd/lib/compress/zstd_lazy.c

Line 1208 in 26d88c0

if (matchIndex < lowLimit)

is taken 50751 times when the tagTable is memset, and 7566147 times when it isn't. Thats an increase of 149x.

Cyan4973 · 2022-01-07T16:40:13Z

Testing zstd -b5e5 enwik7 -B128K on my desktop,
I indeed notice a drop from 85 MB/s to 80 MB/s
when not memsetting the tag area.

To be fair, I was expecting worse given the x150 increase in branch account .

When removing the -B128K, then there is no more difference.
And on the other side, with -B16K, the difference is very small (<2%).

Which makes -B128K the odd worst-case in the middle ?

felixhandte · 2022-01-07T17:01:17Z

I think a sane approach would be to change the order that the cwksp stores elements. [...]

Yeah, I guess so. The reason it was laid out this way was so that even if the buffers require a non-round number of bytes, all of the allocations could fit in a workspace with no extra padding bytes while respecting the alignment requirements of everything. I guess we've already backed away from this though with aligning the tables to 64 bytes. You will probably need to add 3 more padding bytes though to the workspace budget.

Cyan4973 · 2022-01-07T17:05:29Z

I do have a problem though... With context reuse compression is slower when I don't memset the tag table.

This PR started as a v1.5.2 performance fix for small data in streaming mode.
If we now have to balance priorities with other scenarios that would be negatively impacted,
this kind of changes the scope.

Consequently, we may need more time to analyze trade off and consider alternatives.
In which case, this seems no longer an item for v1.5.2.

Cyan4973 · 2022-01-08T07:13:37Z

I see that the branch:
is taken 50751 times when the tagTable is memset, and 7566147 times when it isn't. Thats an increase of 149x.

What about :

creating a mask from the index table
& it with the tag mask
now we have a single mask, each position is guaranteed to be both within range and correspond to the tag
the branch can be removed

I presume the issue is that creating a mask from the index table is a complex (& costly) operation.

terrelln · 2022-01-24T22:08:49Z

I presume the issue is that creating a mask from the index table is a complex (& costly) operation.

It involves looping over the table. I think that this idea could work, and even reduce branches in all cases. It is definitely possible this could be a speed gain in general. But it remains to be tested.

yoniko · 2023-03-13T23:15:18Z

More complete solution merged, see #3528

fix #2966 part 2 : do not initialize tag space

26fd9af

(note : this might break due to the need to also track the starting candidate nb per row)

facebook-github-bot added the CLA Signed label Jan 4, 2022

unpoison tag space for msan

3d53a55

terrelln reviewed Jan 4, 2022

View reviewed changes

Cyan4973 changed the title ~~fix #2966 part 2 : do not initialize tag space~~ [WIP] fix #2966 part 2 : do not initialize tag space Jan 21, 2022

Cyan4973 marked this pull request as draft January 21, 2022 00:11

yoniko self-assigned this Dec 16, 2022

yoniko mentioned this pull request Jan 13, 2023

[WIP] Remove tag space initalization for rowHash #3426

Closed

yoniko mentioned this pull request Mar 7, 2023

Row hash tag space initialization speed regression #3528

Closed

yoniko closed this Mar 13, 2023

Cyan4973 deleted the fix2966_part2 branch March 29, 2023 23:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] fix #2966 part 2 : do not initialize tag space #2971

[WIP] fix #2966 part 2 : do not initialize tag space #2971

Cyan4973 commented Jan 4, 2022 •

edited

Loading

Cyan4973 commented Jan 4, 2022

felixhandte commented Jan 4, 2022

Cyan4973 commented Jan 4, 2022 •

edited

Loading

terrelln commented Jan 4, 2022

terrelln left a comment

terrelln commented Jan 4, 2022

terrelln left a comment

felixhandte commented Jan 5, 2022

felixhandte commented Jan 5, 2022

Cyan4973 commented Jan 5, 2022 •

edited

Loading

Cyan4973 commented Jan 5, 2022

Cyan4973 commented Jan 6, 2022 •

edited

Loading

terrelln commented Jan 7, 2022

terrelln commented Jan 7, 2022 •

edited

Loading

terrelln commented Jan 7, 2022 •

edited

Loading

Cyan4973 commented Jan 7, 2022

felixhandte commented Jan 7, 2022

Cyan4973 commented Jan 7, 2022 •

edited

Loading

Cyan4973 commented Jan 8, 2022 •

edited

Loading

terrelln commented Jan 24, 2022

yoniko commented Mar 13, 2023

[WIP] fix #2966 part 2 : do not initialize tag space #2971

[WIP] fix #2966 part 2 : do not initialize tag space #2971

Conversation

Cyan4973 commented Jan 4, 2022 • edited Loading

Cyan4973 commented Jan 4, 2022

felixhandte commented Jan 4, 2022

Cyan4973 commented Jan 4, 2022 • edited Loading

terrelln commented Jan 4, 2022

terrelln left a comment

Choose a reason for hiding this comment

terrelln commented Jan 4, 2022

terrelln left a comment

Choose a reason for hiding this comment

felixhandte commented Jan 5, 2022

felixhandte commented Jan 5, 2022

Cyan4973 commented Jan 5, 2022 • edited Loading

Cyan4973 commented Jan 5, 2022

Cyan4973 commented Jan 6, 2022 • edited Loading

terrelln commented Jan 7, 2022

terrelln commented Jan 7, 2022 • edited Loading

terrelln commented Jan 7, 2022 • edited Loading

Cyan4973 commented Jan 7, 2022

felixhandte commented Jan 7, 2022

Cyan4973 commented Jan 7, 2022 • edited Loading

Cyan4973 commented Jan 8, 2022 • edited Loading

terrelln commented Jan 24, 2022

yoniko commented Mar 13, 2023

Cyan4973 commented Jan 4, 2022 •

edited

Loading

Cyan4973 commented Jan 4, 2022 •

edited

Loading

Cyan4973 commented Jan 5, 2022 •

edited

Loading

Cyan4973 commented Jan 6, 2022 •

edited

Loading

terrelln commented Jan 7, 2022 •

edited

Loading

terrelln commented Jan 7, 2022 •

edited

Loading

Cyan4973 commented Jan 7, 2022 •

edited

Loading

Cyan4973 commented Jan 8, 2022 •

edited

Loading