Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve optimal parser performance on small data #2771

Merged
merged 16 commits into from
Sep 14, 2021
Merged

Conversation

Cyan4973
Copy link
Contributor

@Cyan4973 Cyan4973 commented Sep 8, 2021

This investigation was driven by #2765 .

In essence, this user sample contains a file which, depending on early choices made by the optimal parser, end up compressing or not the data, resulting in large relative differences (+100%).

This is a case of self-reinforcing local optimal. In order to find the shortest path (i.e. which leads to better compression), the algorithm needs to evaluate the "cost" of its choices. For this sample to work, the algorithm must consider small matches as better than literals. If it does, its weight evaluation will favor future choices of the same kind. If not, it will keep considering them too expensive.

One of the known issues of the optimal parser is that it doesn't know which stats to start from. Over time, the process tends to self-regulate, but in the early stages, there is not enough statistics collected yet.
The initially selected approach was to start from a blank pages, with all choices essentially equivalent.
This is what this PR modifies.

Initial statistics are modified this way :

  • Nudge literal lengths slightly towards less or no literals : this small change to initial conditions has an outsized impact, featuring > 100 KB savings on silesia.tar cut into small blocks. It's not clear why, I suspect it merely corrects another side effect which tends to favor more literals than it should. This change, all by itself, fix Compression ratio example #2765.
  • Nudge offset codes towards short distances and repeat codes : this one has a small positive effect. I was expecting more impact. It only improves silesia.tar by a few dozens of KB. That might be because, due to the total cost including supplementary bits, the algorithm might already be biased enough ( in the general case ) in favor of small offsets and repeat code. Still, since the impact was consistently positive, this change was kept. It also fixes, all by itself, Compression ratio example #2765, proving there is more than one way to fix that one.
  • Revisit statistics update : this point is supposed to be mostly important for large files, since it impacts the transport of statistics between blocks. But it also impacts btultra2, which compress a small block twice, the first time merely to collect statistics for the second time. The issue is that the general update algorithm was presumed to be invoked once every 128 KB block. On small inputs, it's invoked twice in a much shorter timeframe, having no time to collect a "full block" of statistics. The issue is that the rescaling factor is static, and will squash down statistics the same way, no matter if they are numerous or not. In this new method, the rescaling is a bit more dynamic, trying to maintain a certain baseline budget between consecutive blocks, and adapting the squashing depending on actual runtime amounts. One issue is that there is no "one size fits all". Files with homogeneous entropy prefer slow adaptation, hence larger inter-block correlations, while files with heterogeneous sections prefer fast adaptation. I presume a more optimal solution should adapt the update rate depending on how likely statistics are to be representative of the following section. It's a topic for a full paper. For the time being, the simple heuristics selected in this PR will do (See benchmarks below).
  • Revisit early literals estimation : when starting a new frame from scratch, the algorithm takes some clues from actual statistics in the samples. It seems it wasn't such a good idea after all. Potential reasons are : there is a large difference between left-over from LZ sequences, and raw bytes directly present in the file, the most recurrent bytes might be present too often and overwhelm statistics unfairly, and possibly the remaining nb of bytes is sometimes too small to be compressed. In all of these cases, literals will be considered cheaper than they actually will be in the final compressed block. Here also, this topic should deserve a full revisit. But for this PR, the simple heuristic selected improves compression ratio enough.

So what are the benefits ? Let's do some benchmark :

`silesia.tar`, cut into 2 KB blocks :
Level File dev PR diff diff % comment  
11 silesia.tar -B2K 95169105 95000131 -168974 -0.18% opt mml=4
12 silesia.tar -B2K 93413153 93017601 -395552 -0.42%   mml=3
13 silesia.tar -B2K 93108397 92691659 -416738 -0.45% ultra  
14 silesia.tar -B2K 92972963 92569885 -403078 -0.43%    
15 silesia.tar -B2K 92950860 92547104 -403756 -0.43%    
16 silesia.tar -B2K 92628560 92401081 -227479 -0.25% ultra2  
17 silesia.tar -B2K 92516103 92288346 -227757 -0.25%    
18 silesia.tar -B2K 92513639 92285819 -227820 -0.25%    
19 silesia.tar -B2K 92513546 92285727 -227819 -0.25%    
20 silesia.tar -B2K 92511223 92284095 -227128 -0.25%    
21 silesia.tar -B2K 92510965 92283782 -227183 -0.25%    
22 silesia.tar -B2K 92510917 92283698 -227219 -0.25%    

This round is consistently positive. It's even better when small matches of length 3 can be selected (levels 12+). The benefit is less pronounced on reaching ultra2 levels. The generous interpretation is that ultra2 was already designed to compensate for the "early statistics" problem, so it benefits less from this new round of improvement. Still, it benefits a little, which is almost surprising, and points at a self-reinforcing effect.

OK, but maybe 2 KB is a too favorable scenario, and the new statistics were designed for this one use case ? Let's try some bigger 16 KB blocks then.

`silesia.tar`, cut into 16 KB blocks :
Level File dev PR diff diff % comment  
11 silesia.tar -B16K 75619806 75488340 -131466 -0.17% opt mml=4
12 silesia.tar -B16K 73648745 73392013 -256732 -0.35%   mml=3
13 silesia.tar -B16K 73257726 72967672 -290054 -0.40% ultra  
14 silesia.tar -B16K 73032188 72730261 -301927 -0.41%    
15 silesia.tar -B16K 72968198 72663133 -305065 -0.42%    
16 silesia.tar -B16K 72800467 72641294 -159173 -0.22% ultra2  
17 silesia.tar -B16K 72577872 72413473 -164399 -0.23%    
18 silesia.tar -B16K 72572119 72407890 -164229 -0.23%    
19 silesia.tar -B16K 72571968 72407720 -164248 -0.23%    
20 silesia.tar -B16K 72572244 72409779 -162465 -0.22%    
21 silesia.tar -B16K 72571829 72409500 -162329 -0.22%    
22 silesia.tar -B16K 72571211 72408727 -162484 -0.22%    

Well, with bigger blocks, the impact seems a little less pronounced, but it's still there, proving initial stats remain useful after 2 KB.

OK, fine for small blocks. What about large blocks ? Full 128 KB ?

`silesia.tar`, cut into 128 KB blocks :
Level File dev PR diff diff % comment  
13 silesia.tar -B128K 65656825 65611779 -45046 -0.07% opt mml=4
14 silesia.tar -B128K 63675123 63462802 -212321 -0.33%   mml=3
15 silesia.tar -B128K 63282894 63059738 -223156 -0.35%    
16 silesia.tar -B128K 63023353 62799390 -223963 -0.36% ultra  
17 silesia.tar -B128K 63007738 62783036 -224702 -0.36%    
18 silesia.tar -B128K 63003547 62780536 -223011 -0.35%    
19 silesia.tar -B128K 62798322 62743773 -54549 -0.09% ultra2  
20 silesia.tar -B128K 62789783 62736999 -52784 -0.08%    
21 silesia.tar -B128K 62785135 62739262 -45873 -0.07%    
22 silesia.tar -B128K 62783801 62736795 -47006 -0.07%    

Note that, for these sizes, zstd_btopt only starts at level 13.
Anyway, the benefit is now clearly reduced. But reassuringly, it's still globally positive, so it's still worth it.

Does this change impact full-length files, and if yes how ?

`silesia.tar` whole file :
Level File dev PR diff diff % comment  
16 silesia.tar 55351684 55321640 -30044 -0.05% opt mml=4
17 silesia.tar 54284712 54275326 -9386 -0.02%   mml=3
18 silesia.tar 53420100 53427159 7059 0.01% ultra  
19 silesia.tar 52986703 52991722 5019 0.01% ultra2  
20 silesia.tar 52577641 52574471 -3170 -0.01%    
21 silesia.tar 52481750 52480144 -1606 0.00%    
22 silesia.tar 52460817 52458096 -2721 -0.01%    

Well, barely. This time, size differences are barely significant. And note that they are not always positive. But wether they are or not, it doesn't matter, as the compression ratio is essentially equivalent.

OK, but let's try some different file now. Maybe calgary.tar, which is a collection of small files of different types appended together, resulting in rapidly changing statistics. How does it behave with this new update policy ?

`calgary.tar` whole file :
Level File dev PR diff diff % comment  
16 calgary.tar 881311 881780 469 0.05% opt mml=4
17 calgary.tar 873376 874742 1366 0.16%   mml=3
18 calgary.tar 861665 863272 1607 0.19% ultra
19 calgary.tar 859732 860927 1195 0.14% ultra2
20 calgary.tar 859732 860927 1195 0.14%  
21 calgary.tar 859598 860842 1244 0.14%  
22 calgary.tar 859587 860789 1202 0.14%  

Well, this time, it's rather negative. But thankfully, by very little.
This is a case where statistics update would benefit from being more actively changed, since 2 consecutive internal files can be very different (text, image, db, etc.). It could be fixed with a faster update rate, but alas, other use cases (like silesia.tar) would then suffer. This is a situation where selecting a "good enough" middle ground heuristic is all we can do before moving on to some more complex algorithm.

Anyway, an important point : it's not always a win. This change is more targeted at small blocks (~2 KB), for which it's generally a win. For larger files, it's less clear, but the impact is also less pronounced.

Cyan4973 and others added 10 commits September 3, 2021 12:51
This is less appropriate for this mode :
benchmark is about accuracy,
it's important to read the exact values.
As a library, the default shouldn't be to write anything on console.
`cover` and `fastcover` have a `g_displayLevel` variable to control this behavior.
It's now set to 0 (no display) by default.
Setting notification to a higher level should be an explicit operation by a console application.
small general compression ratio improvement for btopt+ strategies/
used to be necessary to counter-balance the fixed-weight frequency update
which has been recently changed for an adaptive rate (targeting stable starting frequency stats).
better for larger blocks,
very small inefficiency on small block.
better for large files, and sources with relatively "stable" entropy,
like silesia.tar.
slightly worse for files with rapidly changing entropy,
like Calgary.tar/.

Updated small files tests in fuzzer
notably within kernel space
DISPLAYLEVEL(2, "\r%70s\r", ""); /* blank line */
DISPLAYLEVEL(2, "%2s-%-17.17s : %.*f%s -> \r", marks[markNb], displayName, hr_isize.precision, hr_isize.value, hr_isize.suffix);
assert(srcSize < UINT_MAX);
DISPLAYLEVEL(2, "%2s-%-17.17s :%10u -> \r", marks[markNb], displayName, (unsigned)srcSize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the existing code, you can see full sizes with -vv. Does that seem insufficient? In general I like that this is human-readable formatted.

Copy link
Contributor Author

@Cyan4973 Cyan4973 Sep 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, I noticed this capability.
It wasn't enough unfortunately, as it was breaking scripts depending on previous format.
Moreover, and more importantly, I simply believe that this is not the right place for the condensed quantity format. Benchmarking is about accuracy. I typically always need the exact amount, in order to clearly appreciate changes, even when they are small.

The new "human readable" format is more suitable for general information status.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I will say more often than not I have found myself benchmarking something and then having to re-do it once i realize I forgot -vv. At least for me, I'd always prefer to have it on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

by employing parallel compilation of object files.
@Cyan4973
Copy link
Contributor Author

Cyan4973 commented Sep 8, 2021

The make benchmarking test error is a bit obscure to me. Is it a case of incorrect parsing ?

IndexError: list index out of range

@terrelln
Copy link
Contributor

terrelln commented Sep 9, 2021

Is it a case of incorrect parsing ?

It seems so, probably a change to the benchzstd.c output format that is breaking automated_benchmarking.py.

Comment on lines +231 to +234
optPtr->litSum = ZSTD_scaleStats(optPtr->litFreq, MaxLit, 12);
optPtr->litLengthSum = ZSTD_scaleStats(optPtr->litLengthFreq, MaxLL, 11);
optPtr->matchLengthSum = ZSTD_scaleStats(optPtr->matchLengthFreq, MaxML, 11);
optPtr->offCodeSum = ZSTD_scaleStats(optPtr->offCodeFreq, MaxOff, 11);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the logTarget maybe be based on the block size?

I would expect smaller blocks to want to have a smaller history.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The update policy could be refined even further. But note that the topic of "block size" has many sub-cases, that probably deserve separate policies.

For example, is the "small block" also the "only block" ?
(I presume that's what you meant.)
In which case, the first pass of btultra2 would produce stats which are already scaled to the size of the block. If there are only a few sequences, then the stats will only contain a few elements, way below the logTarget threshold. In which case, they won't be up-scaled (stats are only down-scaled when they need to). So there is a form of built-in adaptation for small inputs in this process.

But if the "small block" is the N-th block in a long stream, and is also expected to be followed by other blocks of variable size, the situation is different. I would guess that a "stream level" logTarget would feel more appropriate.

This can certainly be analyzed even more.
I believe the update policy is one of these places where some non-negligible gains can still be produced.
And my expectation is that a better update policy should be dynamic, reacting to the "probability" that historic statistics may or may not match the statistics of the following block.

As can be guessed though, this is a fairly complex topic, which would require a dedicated (and time-consuming) study.

@terrelln
Copy link
Contributor

terrelln commented Sep 9, 2021

I wonder if btultra2 would do better if you ran btlazy2 as the first parser instead of the optimal parser. I know we've talked about it, but I don't remember if you've tried it. That would help solve the bad literals pricing, and would maybe help favor shorter literal/match lengths. Although I know that btlazy2 does have more literals than other parsers, due to its laziness.

@Cyan4973
Copy link
Contributor Author

Cyan4973 commented Sep 12, 2021

I wonder if btultra2 would do better if you ran btlazy2 as the first parser instead of the optimal parser.

It's a good question. It's probably worth an investigation.

But note that it's not clear if it's obviously better.
I hear the argument that btlazy2 doesn't depend on prior statistics, and therefore will not be pushed one way or another depending on wrong initial statistics.
But then, it introduces other issues. At the very least, I would expect statistics from btlazy2 to be somewhat off compared to those produced by a first btultra pass. For example, btlazy2 doesn't look for (and never selects) match length 3. It also only considers rep0, like all lazy parsers. And as you already mentioned, it tends to produce more literals. So obviously all these cases will be incorrectly represented in final statistics.
It's probably possible to attempt some kind of "blind fix" to partially compensate these bias, but it won't solve all cases. For example, in #2765, all matches are length 3. Therefore, btlazy2 would find no match at all. Difficult to extrapolate initial statistics from there.

Extending the topic, I've been considering some form of "light" initial evaluation for btultra, like for example greedy. This might help, compared to "default" initial statistics, at the cost of new dependency. btultra2 would build up from there.

As can be guessed, the most important point here is that these investigations cost time. And time is the scarcest resource there is.

This PR doesn't "terminate" this topic, it mainly brings it to light. I'm opened to any complementary PR that would improve these initial results.

@Cyan4973
Copy link
Contributor Author

Cyan4973 commented Sep 12, 2021

It seems so, probably a change to the benchzstd.c output format that is breaking automated_benchmarking.py.

I guess that's because benchzstd now outputs its regular results into stdout instead of stderr (which is still used to output error messages). The python script seems to be designed to intercept and parse stderr.

A trivial replacement of stderr by stdout doesn't do the trick though ...

edit : indeed, output is empty, so the attempt at parsing the floating point value fails. Still, I don't get it why it's still empty after replacing stderr by stdout.

edit 2 : get it, that's because the script is downloading and comparing several versions of zstd; therefore, some are outputting into stdout, and some into stderr, making neither of these choices sufficient.
Onto a fix.

@Cyan4973
Copy link
Contributor Author

All remaining issues fixed

Comment on lines +216 to +221
{ unsigned const baseOFCfreqs[MaxOff+1] = {
6, 2, 1, 1, 2, 3, 4, 4,
4, 3, 2, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the change that makes the majority of the difference

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Compression ratio example
5 participants