Improve optimal parser performance on small data #2771

Cyan4973 · 2021-09-08T21:39:35Z

This investigation was driven by #2765 .

In essence, this user sample contains a file which, depending on early choices made by the optimal parser, end up compressing or not the data, resulting in large relative differences (+100%).

This is a case of self-reinforcing local optimal. In order to find the shortest path (i.e. which leads to better compression), the algorithm needs to evaluate the "cost" of its choices. For this sample to work, the algorithm must consider small matches as better than literals. If it does, its weight evaluation will favor future choices of the same kind. If not, it will keep considering them too expensive.

One of the known issues of the optimal parser is that it doesn't know which stats to start from. Over time, the process tends to self-regulate, but in the early stages, there is not enough statistics collected yet.
The initially selected approach was to start from a blank pages, with all choices essentially equivalent.
This is what this PR modifies.

Initial statistics are modified this way :

Nudge literal lengths slightly towards less or no literals : this small change to initial conditions has an outsized impact, featuring > 100 KB savings on silesia.tar cut into small blocks. It's not clear why, I suspect it merely corrects another side effect which tends to favor more literals than it should. This change, all by itself, fix Compression ratio example #2765.
Nudge offset codes towards short distances and repeat codes : this one has a small positive effect. I was expecting more impact. It only improves silesia.tar by a few dozens of KB. That might be because, due to the total cost including supplementary bits, the algorithm might already be biased enough ( in the general case ) in favor of small offsets and repeat code. Still, since the impact was consistently positive, this change was kept. It also fixes, all by itself, Compression ratio example #2765, proving there is more than one way to fix that one.
Revisit statistics update : this point is supposed to be mostly important for large files, since it impacts the transport of statistics between blocks. But it also impacts btultra2, which compress a small block twice, the first time merely to collect statistics for the second time. The issue is that the general update algorithm was presumed to be invoked once every 128 KB block. On small inputs, it's invoked twice in a much shorter timeframe, having no time to collect a "full block" of statistics. The issue is that the rescaling factor is static, and will squash down statistics the same way, no matter if they are numerous or not. In this new method, the rescaling is a bit more dynamic, trying to maintain a certain baseline budget between consecutive blocks, and adapting the squashing depending on actual runtime amounts. One issue is that there is no "one size fits all". Files with homogeneous entropy prefer slow adaptation, hence larger inter-block correlations, while files with heterogeneous sections prefer fast adaptation. I presume a more optimal solution should adapt the update rate depending on how likely statistics are to be representative of the following section. It's a topic for a full paper. For the time being, the simple heuristics selected in this PR will do (See benchmarks below).
Revisit early literals estimation : when starting a new frame from scratch, the algorithm takes some clues from actual statistics in the samples. It seems it wasn't such a good idea after all. Potential reasons are : there is a large difference between left-over from LZ sequences, and raw bytes directly present in the file, the most recurrent bytes might be present too often and overwhelm statistics unfairly, and possibly the remaining nb of bytes is sometimes too small to be compressed. In all of these cases, literals will be considered cheaper than they actually will be in the final compressed block. Here also, this topic should deserve a full revisit. But for this PR, the simple heuristic selected improves compression ratio enough.

So what are the benefits ? Let's do some benchmark :

`silesia.tar`, cut into 2 KB blocks :

Level	File	dev	PR	diff	diff %	comment
11	silesia.tar -B2K	95169105	95000131	-168974	-0.18%	opt	mml=4
12	silesia.tar -B2K	93413153	93017601	-395552	-0.42%		mml=3
13	silesia.tar -B2K	93108397	92691659	-416738	-0.45%	ultra
14	silesia.tar -B2K	92972963	92569885	-403078	-0.43%
15	silesia.tar -B2K	92950860	92547104	-403756	-0.43%
16	silesia.tar -B2K	92628560	92401081	-227479	-0.25%	ultra2
17	silesia.tar -B2K	92516103	92288346	-227757	-0.25%
18	silesia.tar -B2K	92513639	92285819	-227820	-0.25%
19	silesia.tar -B2K	92513546	92285727	-227819	-0.25%
20	silesia.tar -B2K	92511223	92284095	-227128	-0.25%
21	silesia.tar -B2K	92510965	92283782	-227183	-0.25%
22	silesia.tar -B2K	92510917	92283698	-227219	-0.25%

This round is consistently positive. It's even better when small matches of length 3 can be selected (levels 12+). The benefit is less pronounced on reaching ultra2 levels. The generous interpretation is that ultra2 was already designed to compensate for the "early statistics" problem, so it benefits less from this new round of improvement. Still, it benefits a little, which is almost surprising, and points at a self-reinforcing effect.

OK, but maybe 2 KB is a too favorable scenario, and the new statistics were designed for this one use case ? Let's try some bigger 16 KB blocks then.

`silesia.tar`, cut into 16 KB blocks :

Level	File	dev	PR	diff	diff %	comment
11	silesia.tar -B16K	75619806	75488340	-131466	-0.17%	opt	mml=4
12	silesia.tar -B16K	73648745	73392013	-256732	-0.35%		mml=3
13	silesia.tar -B16K	73257726	72967672	-290054	-0.40%	ultra
14	silesia.tar -B16K	73032188	72730261	-301927	-0.41%
15	silesia.tar -B16K	72968198	72663133	-305065	-0.42%
16	silesia.tar -B16K	72800467	72641294	-159173	-0.22%	ultra2
17	silesia.tar -B16K	72577872	72413473	-164399	-0.23%
18	silesia.tar -B16K	72572119	72407890	-164229	-0.23%
19	silesia.tar -B16K	72571968	72407720	-164248	-0.23%
20	silesia.tar -B16K	72572244	72409779	-162465	-0.22%
21	silesia.tar -B16K	72571829	72409500	-162329	-0.22%
22	silesia.tar -B16K	72571211	72408727	-162484	-0.22%

Well, with bigger blocks, the impact seems a little less pronounced, but it's still there, proving initial stats remain useful after 2 KB.

OK, fine for small blocks. What about large blocks ? Full 128 KB ?

`silesia.tar`, cut into 128 KB blocks :

Level	File	dev	PR	diff	diff %	comment
13	silesia.tar -B128K	65656825	65611779	-45046	-0.07%	opt	mml=4
14	silesia.tar -B128K	63675123	63462802	-212321	-0.33%		mml=3
15	silesia.tar -B128K	63282894	63059738	-223156	-0.35%
16	silesia.tar -B128K	63023353	62799390	-223963	-0.36%	ultra
17	silesia.tar -B128K	63007738	62783036	-224702	-0.36%
18	silesia.tar -B128K	63003547	62780536	-223011	-0.35%
19	silesia.tar -B128K	62798322	62743773	-54549	-0.09%	ultra2
20	silesia.tar -B128K	62789783	62736999	-52784	-0.08%
21	silesia.tar -B128K	62785135	62739262	-45873	-0.07%
22	silesia.tar -B128K	62783801	62736795	-47006	-0.07%

Note that, for these sizes, zstd_btopt only starts at level 13.
Anyway, the benefit is now clearly reduced. But reassuringly, it's still globally positive, so it's still worth it.

Does this change impact full-length files, and if yes how ?

`silesia.tar` whole file :

Level	File	dev	PR	diff	diff %	comment
16	silesia.tar	55351684	55321640	-30044	-0.05%	opt	mml=4
17	silesia.tar	54284712	54275326	-9386	-0.02%		mml=3
18	silesia.tar	53420100	53427159	7059	0.01%	ultra
19	silesia.tar	52986703	52991722	5019	0.01%	ultra2
20	silesia.tar	52577641	52574471	-3170	-0.01%
21	silesia.tar	52481750	52480144	-1606	0.00%
22	silesia.tar	52460817	52458096	-2721	-0.01%

Well, barely. This time, size differences are barely significant. And note that they are not always positive. But wether they are or not, it doesn't matter, as the compression ratio is essentially equivalent.

OK, but let's try some different file now. Maybe calgary.tar, which is a collection of small files of different types appended together, resulting in rapidly changing statistics. How does it behave with this new update policy ?

`calgary.tar` whole file :

Level	File	dev	PR	diff	diff %	comment
16	calgary.tar	881311	881780	469	0.05%	opt	mml=4
17	calgary.tar	873376	874742	1366	0.16%		mml=3
18	calgary.tar	861665	863272	1607	0.19%	ultra
19	calgary.tar	859732	860927	1195	0.14%	ultra2
20	calgary.tar	859732	860927	1195	0.14%
21	calgary.tar	859598	860842	1244	0.14%
22	calgary.tar	859587	860789	1202	0.14%

Well, this time, it's rather negative. But thankfully, by very little.
This is a case where statistics update would benefit from being more actively changed, since 2 consecutive internal files can be very different (text, image, db, etc.). It could be fixed with a faster update rate, but alas, other use cases (like silesia.tar) would then suffer. This is a situation where selecting a "good enough" middle ground heuristic is all we can do before moving on to some more complex algorithm.

Anyway, an important point : it's not always a win. This change is more targeted at small blocks (~2 KB), for which it's generally a win. For larger files, it's less clear, but the impact is also less pronounced.

This is less appropriate for this mode : benchmark is about accuracy, it's important to read the exact values.

As a library, the default shouldn't be to write anything on console. `cover` and `fastcover` have a `g_displayLevel` variable to control this behavior. It's now set to 0 (no display) by default. Setting notification to a higher level should be an explicit operation by a console application.

small general compression ratio improvement for btopt+ strategies/

used to be necessary to counter-balance the fixed-weight frequency update which has been recently changed for an adaptive rate (targeting stable starting frequency stats).

better for larger blocks, very small inefficiency on small block.

better for large files, and sources with relatively "stable" entropy, like silesia.tar. slightly worse for files with rapidly changing entropy, like Calgary.tar/. Updated small files tests in fuzzer

notably within kernel space

felixhandte · 2021-09-08T21:51:26Z

programs/benchzstd.c

        DISPLAYLEVEL(2, "\r%70s\r", "");   /* blank line */
-        DISPLAYLEVEL(2, "%2s-%-17.17s : %.*f%s -> \r", marks[markNb], displayName, hr_isize.precision, hr_isize.value, hr_isize.suffix);
+        assert(srcSize < UINT_MAX);
+        DISPLAYLEVEL(2, "%2s-%-17.17s :%10u -> \r", marks[markNb], displayName, (unsigned)srcSize);


With the existing code, you can see full sizes with -vv. Does that seem insufficient? In general I like that this is human-readable formatted.

Ah yes, I noticed this capability.
It wasn't enough unfortunately, as it was breaking scripts depending on previous format.
Moreover, and more importantly, I simply believe that this is not the right place for the condensed quantity format. Benchmarking is about accuracy. I typically always need the exact amount, in order to clearly appreciate changes, even when they are small.

The new "human readable" format is more suitable for general information status.

Yeah, I will say more often than not I have found myself benchmarking something and then having to re-do it once i realize I forgot -vv. At least for me, I'd always prefer to have it on.

Makes sense.

by employing parallel compilation of object files.

Cyan4973 · 2021-09-08T22:57:57Z

The make benchmarking test error is a bit obscure to me. Is it a case of incorrect parsing ?

IndexError: list index out of range

terrelln · 2021-09-09T01:21:59Z

Is it a case of incorrect parsing ?

It seems so, probably a change to the benchzstd.c output format that is breaking automated_benchmarking.py.

terrelln · 2021-09-09T01:25:43Z

lib/compress/zstd_opt.c

+            optPtr->litSum = ZSTD_scaleStats(optPtr->litFreq, MaxLit, 12);
+        optPtr->litLengthSum = ZSTD_scaleStats(optPtr->litLengthFreq, MaxLL, 11);
+        optPtr->matchLengthSum = ZSTD_scaleStats(optPtr->matchLengthFreq, MaxML, 11);
+        optPtr->offCodeSum = ZSTD_scaleStats(optPtr->offCodeFreq, MaxOff, 11);


Should the logTarget maybe be based on the block size?

I would expect smaller blocks to want to have a smaller history.

The update policy could be refined even further. But note that the topic of "block size" has many sub-cases, that probably deserve separate policies.

For example, is the "small block" also the "only block" ?
(I presume that's what you meant.)
In which case, the first pass of btultra2 would produce stats which are already scaled to the size of the block. If there are only a few sequences, then the stats will only contain a few elements, way below the logTarget threshold. In which case, they won't be up-scaled (stats are only down-scaled when they need to). So there is a form of built-in adaptation for small inputs in this process.

But if the "small block" is the N-th block in a long stream, and is also expected to be followed by other blocks of variable size, the situation is different. I would guess that a "stream level" logTarget would feel more appropriate.

This can certainly be analyzed even more.
I believe the update policy is one of these places where some non-negligible gains can still be produced.
And my expectation is that a better update policy should be dynamic, reacting to the "probability" that historic statistics may or may not match the statistics of the following block.

As can be guessed though, this is a fairly complex topic, which would require a dedicated (and time-consuming) study.

terrelln · 2021-09-09T01:29:28Z

I wonder if btultra2 would do better if you ran btlazy2 as the first parser instead of the optimal parser. I know we've talked about it, but I don't remember if you've tried it. That would help solve the bad literals pricing, and would maybe help favor shorter literal/match lengths. Although I know that btlazy2 does have more literals than other parsers, due to its laziness.

Cyan4973 · 2021-09-12T03:30:48Z

I wonder if btultra2 would do better if you ran btlazy2 as the first parser instead of the optimal parser.

It's a good question. It's probably worth an investigation.

But note that it's not clear if it's obviously better.
I hear the argument that btlazy2 doesn't depend on prior statistics, and therefore will not be pushed one way or another depending on wrong initial statistics.
But then, it introduces other issues. At the very least, I would expect statistics from btlazy2 to be somewhat off compared to those produced by a first btultra pass. For example, btlazy2 doesn't look for (and never selects) match length 3. It also only considers rep0, like all lazy parsers. And as you already mentioned, it tends to produce more literals. So obviously all these cases will be incorrectly represented in final statistics.
It's probably possible to attempt some kind of "blind fix" to partially compensate these bias, but it won't solve all cases. For example, in #2765, all matches are length 3. Therefore, btlazy2 would find no match at all. Difficult to extrapolate initial statistics from there.

Extending the topic, I've been considering some form of "light" initial evaluation for btultra, like for example greedy. This might help, compared to "default" initial statistics, at the cost of new dependency. btultra2 would build up from there.

As can be guessed, the most important point here is that these investigations cost time. And time is the scarcest resource there is.

This PR doesn't "terminate" this topic, it mainly brings it to light. I'm opened to any complementary PR that would improve these initial results.

Cyan4973 · 2021-09-12T08:15:19Z

It seems so, probably a change to the benchzstd.c output format that is breaking automated_benchmarking.py.

I guess that's because benchzstd now outputs its regular results into stdout instead of stderr (which is still used to output error messages). The python script seems to be designed to intercept and parse stderr.

A trivial replacement of stderr by stdout doesn't do the trick though ...

edit : indeed, output is empty, so the attempt at parsing the floating point value fails. Still, I don't get it why it's still empty after replacing stderr by stdout.

edit 2 : get it, that's because the script is downloading and comparing several versions of zstd; therefore, some are outputting into stdout, and some into stderr, making neither of these choices sufficient.
Onto a fix.

make it able to process text output sent into either stdout or stderr

Cyan4973 · 2021-09-12T18:38:17Z

All remaining issues fixed

terrelln · 2023-07-07T17:40:41Z

lib/compress/zstd_opt.c

+            {   unsigned const baseOFCfreqs[MaxOff+1] = {
+                    6, 2, 1, 1, 2, 3, 4, 4,
+                    4, 3, 2, 1, 1, 1, 1, 1,
+                    1, 1, 1, 1, 1, 1, 1, 1,
+                    1, 1, 1, 1, 1, 1, 1, 1
+                };


This is the change that makes the majority of the difference

Cyan4973 and others added 10 commits September 3, 2021 12:51

removed pretty-print of sizes in benchmark

eab6922

This is less appropriate for this mode : benchmark is about accuracy, it's important to read the exact values.

new initializer for ll price

27a8bbe

new starting offcode table for zstd_opt

23a9368

new statistics update policy

08ceda3

small general compression ratio improvement for btopt+ strategies/

updated regression tests

b096a5c

removed frequency booster for stat initialization of btultra2

42a3ed7

used to be necessary to counter-balance the fixed-weight frequency update which has been recently changed for an adaptive rate (targeting stable starting frequency stats).

change update rate to 11/10/10/10

ef78611

better for larger blocks, very small inefficiency on small block.

change update rate to 12/11/11/11

7fce9a4

better for large files, and sources with relatively "stable" entropy, like silesia.tar. slightly worse for files with rapidly changing entropy, like Calgary.tar/. Updated small files tests in fuzzer

update regression tests

4f0b1b9

facebook-github-bot added the CLA Signed label Sep 8, 2021

use ZSTD_memcpy() for better portability

b7f46eb

notably within kernel space

felixhandte reviewed Sep 8, 2021

View reviewed changes

make automated-benchmarking faster

5449ede

by employing parallel compilation of object files.

terrelln reviewed Sep 9, 2021

View reviewed changes

Cyan4973 added 3 commits September 12, 2021 01:36

fix automated_benchmarking

640c5b1

make it able to process text output sent into either stdout or stderr

Merge branch 'dev' into opt_investigation

f58e63b

updated regression tests

b6b2855

Merge branch 'dev' into opt_investigation

fd94b9d

terrelln approved these changes Sep 14, 2021

View reviewed changes

Cyan4973 merged commit 2e6f5bc into dev Sep 14, 2021

Cyan4973 mentioned this pull request Sep 30, 2021

[WIP] Minor compression ratio improvements for high compression modes on small data #2781

Draft

Cyan4973 deleted the opt_investigation branch December 9, 2021 00:14

terrelln reviewed Jul 7, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve optimal parser performance on small data #2771

Improve optimal parser performance on small data #2771

Cyan4973 commented Sep 8, 2021 •

edited

Loading

felixhandte Sep 8, 2021

Cyan4973 Sep 8, 2021 •

edited

Loading

senhuang42 Sep 8, 2021

felixhandte Sep 8, 2021

Cyan4973 commented Sep 8, 2021 •

edited

Loading

terrelln commented Sep 9, 2021

terrelln Sep 9, 2021

Cyan4973 Sep 12, 2021

terrelln commented Sep 9, 2021

Cyan4973 commented Sep 12, 2021 •

edited

Loading

Cyan4973 commented Sep 12, 2021 •

edited

Loading

Cyan4973 commented Sep 12, 2021

terrelln Jul 7, 2023

Improve optimal parser performance on small data #2771

Improve optimal parser performance on small data #2771

Conversation

Cyan4973 commented Sep 8, 2021 • edited Loading

felixhandte Sep 8, 2021

Choose a reason for hiding this comment

Cyan4973 Sep 8, 2021 • edited Loading

Choose a reason for hiding this comment

senhuang42 Sep 8, 2021

Choose a reason for hiding this comment

felixhandte Sep 8, 2021

Choose a reason for hiding this comment

Cyan4973 commented Sep 8, 2021 • edited Loading

terrelln commented Sep 9, 2021

terrelln Sep 9, 2021

Choose a reason for hiding this comment

Cyan4973 Sep 12, 2021

Choose a reason for hiding this comment

terrelln commented Sep 9, 2021

Cyan4973 commented Sep 12, 2021 • edited Loading

Cyan4973 commented Sep 12, 2021 • edited Loading

Cyan4973 commented Sep 12, 2021

terrelln Jul 7, 2023

Choose a reason for hiding this comment

Cyan4973 commented Sep 8, 2021 •

edited

Loading

Cyan4973 Sep 8, 2021 •

edited

Loading

Cyan4973 commented Sep 8, 2021 •

edited

Loading

Cyan4973 commented Sep 12, 2021 •

edited

Loading

Cyan4973 commented Sep 12, 2021 •

edited

Loading