Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] fix #2966 part 2 : do not initialize tag space #2971

Closed
wants to merge 2 commits into from

Conversation

Cyan4973
Copy link
Contributor

@Cyan4973 Cyan4973 commented Jan 4, 2022

This diff disables tag initialization when using the rowHash mode. This initialization was unconditional, but it becomes the dominating operation when compressing small data in streaming mode (see issue #2966).

I could not find a good reason to initialize tags. It just makes tag values start at 0, but 0 is just a regular tag value, it's not more significant than any other tag value. Worst case, there will be a wrong hint of match presence, but even that should be filtered out by distance analysis, which remains active through indices validation. So this won't impact compression result.

Now, initially, I was suspicious that it would work, because the tag space is 2x larger than it should be, suggesting additional space is used for something else than tag values, like determining the starting position in the row (which would be an overkill memory budget, but that's a different topic). But to my surprise, this change passes all tests successfully, suggesting rowHash is even resilient to a random start position.
Edit : it seems to finally break on msan + fuzzer test below, so that's something worth looking into.

The end result is significant. When combined with #2969, the compression speed of rowHash on small data increases dramatically, as can be seen below (#2969 is required, as otherwise the impact of tag initialization is just lost as part of a larger initialization issue).

The following measurement is taken on a core i7-9700K (turbo disabled) with fullbench -b41, using geldings.txt as sample (a small text file). The test corresponds to a scenario using ZSTD_compressStream() without the benefit of knowing the small sample size beforehand.

level v1.5.1 #2969 + this PR comment
1 101 MB/s 113 MB/s
2 67 MB/s 112 MB/s
3 31 MB/s 94 MB/s
4 14 MB/s 93 MB/s
5 9 MB/s 54 MB/s rowHash
6 8.7 MB/s 50 MB/s rowHash
7 4.6 MB/s 50 MB/s rowHash
8 4.5 MB/s 48 MB/s rowHash
9 1.7 MB/s 48 MB/s rowHash
10 0.8 MB/s 45 MB/s rowHash
11 0.8 MB/s 39 MB/s rowHash
12 0.4 MB/s 39 MB/s rowHash
13 0.6 MB/s 38 MB/s

cc @terrelln : there might be some side-effects which are not properly captured by the tests, such as potential reproducibility issues but with a low enough probability that it's too difficult to reproduce during CI tests. And maybe other side effects worth looking into.

Note : WIP, not for merge

(note : this might break due to the need to also track the starting candidate nb per row)
@Cyan4973
Copy link
Contributor Author

Cyan4973 commented Jan 4, 2022

msan tests fail because reading uninitialized tags is an automatic msan failure, even when the algorithm is suitable to start with uninitialized values.
So I guess next step is to mark the tag memory area as "fine" from an msan perspective.

@felixhandte
Copy link
Contributor

@Cyan4973, see here for an example of explicitly unpoisoning memory.

@Cyan4973
Copy link
Contributor Author

Cyan4973 commented Jan 4, 2022

msan unpoisoning works for msan tests,

but there is still a remaining issue within zlibwrapper tests.
I suspect it's another case of msan test, but one where zlibwrapper is compiled with msan, but the linked libzstd is not, thus failing the msan test.

This, btw, is a good demo of what could happen if any other application linking libzstd does attempt msan tests.

@terrelln
Copy link
Contributor

terrelln commented Jan 4, 2022

I suspect it's another case of msan test, but one where zlibwrapper is compiled with msan, but the linked libzstd is not, thus failing the msan test.

That is not supported by MSAN. I'm surprised it was working as-is. So we should fix that test to compiler libzstd with MSAN.

MSAN needs all code to be compiled with MSAN in order to work correctly. Unlike ASAN, which can work with linked code that isn't compiled with ASAN, it will just miss bugs in the non-ASAN code.

Copy link
Contributor

@terrelln terrelln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be totally fine for determinism. But we should make sure that we benchmark with context reuse, to ensure that eliding the memset doesn't slow down large file compression with context reuse.

@terrelln
Copy link
Contributor

terrelln commented Jan 4, 2022

Oh, thats a valgrind check, not MSAN.

Copy link
Contributor

@terrelln terrelln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a way around the problems with valgrind. We could add a suppression file, but then it would break for other people running valgrind with zstd that don't have our suppressions.

And I worry that we may get spurious bug reports, and concern from people who run valgrind, and think that zstd is buggy because of that report.

Would it make sense to do the same thing for the tag table as we do for the tables, and keep track of what's been initialized?

@felixhandte
Copy link
Contributor

I thought about this a little bit and probably the simplest/best way to do this is to keep a known-initialized range. When the tag table is in that range, we can do nothing. When it's outside that range, we memset the new tag table area and everything between it and the existing known-good range, and then record that expansion of the range. In the common case where the table ends up in the same position over and over, we don't have to do any work to reset the cctx, and even if things shift around, we only do incremental clearing as needed.

@felixhandte
Copy link
Contributor

The big question is whether we also want to get this in for 1.5.2.

@Cyan4973
Copy link
Contributor Author

Cyan4973 commented Jan 5, 2022

I thought about this a little bit and probably the simplest/best way to do this is to keep a known-initialized range. When the tag table is in that range, we can do nothing. When it's outside that range, we memset the new tag table area and everything between it and the existing known-good range, and then record that expansion of the range. In the common case where the table ends up in the same position over and over, we don't have to do any work to reset the cctx, and even if things shift around, we only do incremental clearing as needed.

This solution looks good to me.

@Cyan4973
Copy link
Contributor Author

Cyan4973 commented Jan 5, 2022

The big question is whether we also want to get this in for 1.5.2.

Yes, we do,
otherwise it's effectively an important speed regression (for a specific scenario) when compared to v1.4.9.

@Cyan4973
Copy link
Contributor Author

Cyan4973 commented Jan 6, 2022

I thought about this a little bit and probably the simplest/best way to do this is to keep a known-initialized range. When the tag table is in that range, we can do nothing. When it's outside that range, we memset the new tag table area and everything between it and the existing known-good range, and then record that expansion of the range. In the common case where the table ends up in the same position over and over, we don't have to do any work to reset the cctx, and even if things shift around, we only do incremental clearing as needed.

This solution looks good to me.

Although, it's not guaranteed that these movements of the tag space generate a single expanding memory region, they may end up defining multiple distant regions, in which case tracking becomes impractical, so we'll have to accept some inefficiency (i.e. initialization will happen more often than it could if tracking of initialized bytes was perfect).

I wonder now how competitive could be initialization with calloc().

edit: on first look, it seems it could be made speed-equivalent.
edit 2 : well, not really, according to fullbench -b43, using calloc() to allocate and initialize workspace makes the compression operation way slower when workspace is continuously recreated and free.

@terrelln
Copy link
Contributor

terrelln commented Jan 7, 2022

I think a sane approach would be to change the order that the cwksp stores elements.

Instead of:

[                        ... workspace ...                         ]
[objects][tables ... ->] free space [<- ... aligned][<- ... buffers]

Make it:

[                        ... workspace ...                         ]
[objects][tables ... ->] free space [<- ... buffers][<- ... aligned uninitialized][<- ... aligned initialized]

Now we just need to keep track of where the boundary between aligned uninitialized & aligned initialized is, and if the initialized section grows, do that initialization.

@terrelln
Copy link
Contributor

terrelln commented Jan 7, 2022

I have a branch that I modified to take that approach.

I do have a problem though... With context reuse compression is slower when I don't memset the tag table. I think whats happening is that we're getting tags that match from the previous compression, but are out of our window. This causes extra work in cases where there otherwise wouldn't be any matches.

Basically, I think that this branch:

if (matchIndex < lowLimit)

rarely happens when the tables are sized to fit the input. Since we filter out at the tag step instead. But, when we remove the memset, and are compressing similar data, we may find matches that were from the previous file. And the tag won't filter out those matches, so we will hit this branch instead.

I'm going to look into it.

@terrelln
Copy link
Contributor

terrelln commented Jan 7, 2022

Running zstd -b5e5 enwik7 -B128K I see that the branch:

if (matchIndex < lowLimit)

is taken 50751 times when the tagTable is memset, and 7566147 times when it isn't. Thats an increase of 149x.

@Cyan4973
Copy link
Contributor Author

Cyan4973 commented Jan 7, 2022

Testing zstd -b5e5 enwik7 -B128K on my desktop,
I indeed notice a drop from 85 MB/s to 80 MB/s
when not memsetting the tag area.

To be fair, I was expecting worse given the x150 increase in branch account .

When removing the -B128K, then there is no more difference.
And on the other side, with -B16K, the difference is very small (<2%).

Which makes -B128K the odd worst-case in the middle ?

@felixhandte
Copy link
Contributor

I think a sane approach would be to change the order that the cwksp stores elements. [...]

Yeah, I guess so. The reason it was laid out this way was so that even if the buffers require a non-round number of bytes, all of the allocations could fit in a workspace with no extra padding bytes while respecting the alignment requirements of everything. I guess we've already backed away from this though with aligning the tables to 64 bytes. You will probably need to add 3 more padding bytes though to the workspace budget.

@Cyan4973
Copy link
Contributor Author

Cyan4973 commented Jan 7, 2022

I do have a problem though... With context reuse compression is slower when I don't memset the tag table.

This PR started as a v1.5.2 performance fix for small data in streaming mode.
If we now have to balance priorities with other scenarios that would be negatively impacted,
this kind of changes the scope.

Consequently, we may need more time to analyze trade off and consider alternatives.
In which case, this seems no longer an item for v1.5.2.

@Cyan4973
Copy link
Contributor Author

Cyan4973 commented Jan 8, 2022

I see that the branch:
is taken 50751 times when the tagTable is memset, and 7566147 times when it isn't. Thats an increase of 149x.

What about :

  1. creating a mask from the index table
  2. & it with the tag mask
  3. now we have a single mask, each position is guaranteed to be both within range and correspond to the tag
  4. the branch can be removed

I presume the issue is that creating a mask from the index table is a complex (& costly) operation.

@Cyan4973 Cyan4973 changed the title fix #2966 part 2 : do not initialize tag space [WIP] fix #2966 part 2 : do not initialize tag space Jan 21, 2022
@Cyan4973 Cyan4973 marked this pull request as draft January 21, 2022 00:11
@terrelln
Copy link
Contributor

I presume the issue is that creating a mask from the index table is a complex (& costly) operation.

It involves looping over the table. I think that this idea could work, and even reduce branches in all cases. It is definitely possible this could be a speed gain in general. But it remains to be tested.

@yoniko
Copy link
Contributor

yoniko commented Mar 13, 2023

More complete solution merged, see #3528

@yoniko yoniko closed this Mar 13, 2023
@Cyan4973 Cyan4973 deleted the fix2966_part2 branch March 29, 2023 23:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants