Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipelined Implementation of ZSTD_dfast #2774

Merged
merged 17 commits into from
Oct 13, 2021

Conversation

felixhandte
Copy link
Contributor

@felixhandte felixhandte commented Sep 9, 2021

This PR takes the ideas from #2749 and applies them to the double-fast implementation.

Description

We start by pulling a single-segment copy out so that we can work on it separately from the DMS implementation.

This implementation makes two changes to how the input is parsed:

  1. Instead of checking ip + 1 when we find a short match, we check ip + step. This is a pretty minimal change to the parsing behavior, since step is almost always 1.
  2. We write back ip + 1 into the hash table even when we take a long match at ip (instead of only in the short match path). It costs us basically nothing to do this because we've already hashed it. This improves compression ratio.

Unlike the fast implementation, whose pipelining includes speculative work that we might throw away, this implementation doesn't do any additional work. It just moves some of it earlier. In particular, the crucial observation is that when we do not take a long match at the current position, we are guaranteed to inspect the next long position, either by taking a short match and checking the next one or by not taking the short match and moving on to the next position. So we can frontload that loading work some.

Benchmarks

Silesia Results Table
dickens     gcc-4.8    3 |   99.9  100.3 ( +0.400%) |  2.769  2.779 ( +0.361%)
dickens     gcc-5      3 |   99.1   99.2 ( +0.101%) |  2.769  2.779 ( +0.361%)
dickens     gcc-6      3 |  101.5   99.0 ( -2.463%) |  2.769  2.779 ( +0.361%)
dickens     gcc-7      3 |  101.8   99.7 ( -2.063%) |  2.769  2.779 ( +0.361%)
dickens     gcc-8      3 |   96.5   97.9 ( +1.451%) |  2.769  2.779 ( +0.361%)
dickens     gcc-10     3 |  100.4   99.5 ( -0.896%) |  2.769  2.779 ( +0.361%)
dickens     clang-6.0  3 |  103.7  103.0 ( -0.675%) |  2.769  2.779 ( +0.361%)
dickens     clang-7    3 |  100.4   98.0 ( -2.390%) |  2.769  2.779 ( +0.361%)
dickens     clang-8    3 |  100.7   99.1 ( -1.589%) |  2.769  2.779 ( +0.361%)
dickens     clang-9    3 |  102.3   94.5 ( -7.625%) |  2.769  2.779 ( +0.361%)
dickens     clang-11   3 |  100.8  100.2 ( -0.595%) |  2.769  2.779 ( +0.361%)
dickens     clang-12   3 |  100.7   99.2 ( -1.490%) |  2.769  2.779 ( +0.361%)
dickens     gcc-4.8    4 |  101.9  101.8 ( -0.098%) |  2.827  2.841 ( +0.495%)
dickens     gcc-5      4 |   98.5   95.8 ( -2.741%) |  2.827  2.841 ( +0.495%)
dickens     gcc-6      4 |   98.9   96.9 ( -2.022%) |  2.827  2.841 ( +0.495%)
dickens     gcc-7      4 |   98.8   97.6 ( -1.215%) |  2.827  2.841 ( +0.495%)
dickens     gcc-8      4 |   98.6  101.7 ( +3.144%) |  2.827  2.841 ( +0.495%)
dickens     gcc-10     4 |   95.1  100.7 ( +5.889%) |  2.827  2.841 ( +0.495%)
dickens     clang-6.0  4 |  102.1  100.1 ( -1.959%) |  2.827  2.841 ( +0.495%)
dickens     clang-7    4 |   97.9   97.9 ( +0.000%) |  2.827  2.841 ( +0.495%)
dickens     clang-8    4 |  100.8   98.6 ( -2.183%) |  2.827  2.841 ( +0.495%)
dickens     clang-9    4 |   96.8   97.3 ( +0.517%) |  2.827  2.841 ( +0.495%)
dickens     clang-11   4 |  100.0   97.1 ( -2.900%) |  2.827  2.841 ( +0.495%)
dickens     clang-12   4 |   98.9   98.1 ( -0.809%) |  2.827  2.841 ( +0.495%)
enwik8      gcc-4.8    3 |  109.3  110.5 ( +1.098%) |  2.809  2.820 ( +0.392%)
enwik8      gcc-5      3 |  104.8  106.3 ( +1.431%) |  2.809  2.820 ( +0.392%)
enwik8      gcc-6      3 |  107.1  106.2 ( -0.840%) |  2.809  2.820 ( +0.392%)
enwik8      gcc-7      3 |  105.0  106.1 ( +1.048%) |  2.809  2.820 ( +0.392%)
enwik8      gcc-8      3 |  105.7  108.4 ( +2.554%) |  2.809  2.820 ( +0.392%)
enwik8      gcc-10     3 |  105.1  108.1 ( +2.854%) |  2.809  2.820 ( +0.392%)
enwik8      clang-6.0  3 |  111.7  111.1 ( -0.537%) |  2.809  2.820 ( +0.392%)
enwik8      clang-7    3 |  107.6  109.6 ( +1.859%) |  2.809  2.820 ( +0.392%)
enwik8      clang-8    3 |  109.2  110.2 ( +0.916%) |  2.809  2.820 ( +0.392%)
enwik8      clang-9    3 |  109.0  109.0 ( +0.000%) |  2.809  2.820 ( +0.392%)
enwik8      clang-11   3 |  112.5  111.8 ( -0.622%) |  2.809  2.820 ( +0.392%)
enwik8      clang-12   3 |  107.5  109.9 ( +2.233%) |  2.809  2.820 ( +0.392%)
enwik8      gcc-4.8    4 |  103.1  106.5 ( +3.298%) |  2.864  2.877 ( +0.454%)
enwik8      gcc-5      4 |  100.3   99.7 ( -0.598%) |  2.864  2.877 ( +0.454%)
enwik8      gcc-6      4 |  101.9  104.4 ( +2.453%) |  2.864  2.877 ( +0.454%)
enwik8      gcc-7      4 |  102.8  100.8 ( -1.946%) |  2.864  2.877 ( +0.454%)
enwik8      gcc-8      4 |   99.7  100.9 ( +1.204%) |  2.864  2.877 ( +0.454%)
enwik8      gcc-10     4 |   99.3  104.3 ( +5.035%) |  2.864  2.877 ( +0.454%)
enwik8      clang-6.0  4 |  106.5  105.7 ( -0.751%) |  2.864  2.877 ( +0.454%)
enwik8      clang-7    4 |  102.4  103.5 ( +1.074%) |  2.864  2.877 ( +0.454%)
enwik8      clang-8    4 |  104.2  104.6 ( +0.384%) |  2.864  2.877 ( +0.454%)
enwik8      clang-9    4 |  102.4  106.4 ( +3.906%) |  2.864  2.877 ( +0.454%)
enwik8      clang-11   4 |  106.8  103.6 ( -2.996%) |  2.864  2.877 ( +0.454%)
enwik8      clang-12   4 |  105.2  104.1 ( -1.046%) |  2.864  2.877 ( +0.454%)
enwik9      gcc-4.8    3 |  121.7  117.3 ( -3.615%) |  3.191  3.203 ( +0.376%)
enwik9      gcc-5      3 |  113.0  122.1 ( +8.053%) |  3.191  3.203 ( +0.376%)
enwik9      gcc-6      3 |  120.3  123.2 ( +2.411%) |  3.191  3.203 ( +0.376%)
enwik9      gcc-7      3 |  123.3  125.4 ( +1.703%) |  3.191  3.203 ( +0.376%)
enwik9      gcc-8      3 |  124.2  121.7 ( -2.013%) |  3.191  3.203 ( +0.376%)
enwik9      gcc-10     3 |  123.2  125.1 ( +1.542%) |  3.191  3.203 ( +0.376%)
enwik9      clang-6.0  3 |  124.8  125.1 ( +0.240%) |  3.191  3.203 ( +0.376%)
enwik9      clang-7    3 |  119.1  123.5 ( +3.694%) |  3.191  3.203 ( +0.376%)
enwik9      clang-8    3 |  119.5  119.2 ( -0.251%) |  3.191  3.203 ( +0.376%)
enwik9      clang-9    3 |  120.0  120.2 ( +0.167%) |  3.191  3.203 ( +0.376%)
enwik9      clang-11   3 |  123.0  124.6 ( +1.301%) |  3.191  3.203 ( +0.376%)
enwik9      clang-12   3 |  120.8  123.8 ( +2.483%) |  3.191  3.203 ( +0.376%)
enwik9      gcc-4.8    4 |  114.4  111.9 ( -2.185%) |  3.253  3.267 ( +0.430%)
enwik9      gcc-5      4 |  110.1  115.8 ( +5.177%) |  3.253  3.267 ( +0.430%)
enwik9      gcc-6      4 |  115.5  114.4 ( -0.952%) |  3.253  3.267 ( +0.430%)
enwik9      gcc-7      4 |  117.8  117.3 ( -0.424%) |  3.253  3.267 ( +0.430%)
enwik9      gcc-8      4 |  113.5  116.7 ( +2.819%) |  3.253  3.267 ( +0.430%)
enwik9      gcc-10     4 |  111.1  119.1 ( +7.201%) |  3.253  3.267 ( +0.430%)
enwik9      clang-6.0  4 |  117.7  120.0 ( +1.954%) |  3.253  3.267 ( +0.430%)
enwik9      clang-7    4 |  114.7  115.7 ( +0.872%) |  3.253  3.267 ( +0.430%)
enwik9      clang-8    4 |  113.8  118.9 ( +4.482%) |  3.253  3.267 ( +0.430%)
enwik9      clang-9    4 |  116.2  117.2 ( +0.861%) |  3.253  3.267 ( +0.430%)
enwik9      clang-11   4 |  117.1  118.4 ( +1.110%) |  3.253  3.267 ( +0.430%)
enwik9      clang-12   4 |  110.6  114.9 ( +3.888%) |  3.253  3.267 ( +0.430%)
mozilla     gcc-4.8    3 |  148.5  152.1 ( +2.424%) |  2.768  2.771 ( +0.108%)
mozilla     gcc-5      3 |  147.3  152.5 ( +3.530%) |  2.768  2.771 ( +0.108%)
mozilla     gcc-6      3 |  145.2  151.6 ( +4.408%) |  2.768  2.771 ( +0.108%)
mozilla     gcc-7      3 |  149.7  154.8 ( +3.407%) |  2.768  2.771 ( +0.108%)
mozilla     gcc-8      3 |  150.3  152.4 ( +1.397%) |  2.768  2.771 ( +0.108%)
mozilla     gcc-10     3 |  147.5  154.4 ( +4.678%) |  2.768  2.771 ( +0.108%)
mozilla     clang-6.0  3 |  156.4  150.0 ( -4.092%) |  2.768  2.771 ( +0.108%)
mozilla     clang-7    3 |  147.5  153.0 ( +3.729%) |  2.768  2.771 ( +0.108%)
mozilla     clang-8    3 |  145.8  153.8 ( +5.487%) |  2.768  2.771 ( +0.108%)
mozilla     clang-9    3 |  151.1  149.0 ( -1.390%) |  2.768  2.771 ( +0.108%)
mozilla     clang-11   3 |  146.5  152.5 ( +4.096%) |  2.768  2.771 ( +0.108%)
mozilla     clang-12   3 |  145.4  151.8 ( +4.402%) |  2.768  2.771 ( +0.108%)
mozilla     gcc-4.8    4 |  136.0  139.3 ( +2.426%) |  2.798  2.801 ( +0.107%)
mozilla     gcc-5      4 |  135.2  140.3 ( +3.772%) |  2.798  2.801 ( +0.107%)
mozilla     gcc-6      4 |  129.5  139.3 ( +7.568%) |  2.798  2.801 ( +0.107%)
mozilla     gcc-7      4 |  140.2  142.4 ( +1.569%) |  2.798  2.801 ( +0.107%)
mozilla     gcc-8      4 |  135.2  140.9 ( +4.216%) |  2.798  2.801 ( +0.107%)
mozilla     gcc-10     4 |  137.0  141.9 ( +3.577%) |  2.798  2.801 ( +0.107%)
mozilla     clang-6.0  4 |  140.9  137.1 ( -2.697%) |  2.798  2.801 ( +0.107%)
mozilla     clang-7    4 |  133.5  139.8 ( +4.719%) |  2.798  2.801 ( +0.107%)
mozilla     clang-8    4 |  136.9  139.5 ( +1.899%) |  2.798  2.801 ( +0.107%)
mozilla     clang-9    4 |  137.0  137.6 ( +0.438%) |  2.798  2.801 ( +0.107%)
mozilla     clang-11   4 |  138.7  139.5 ( +0.577%) |  2.798  2.801 ( +0.107%)
mozilla     clang-12   4 |  136.6  144.4 ( +5.710%) |  2.798  2.801 ( +0.107%)
mr          gcc-4.8    3 |  117.3  115.8 ( -1.279%) |  2.811  2.810 ( -0.036%)
mr          gcc-5      3 |  115.4  117.5 ( +1.820%) |  2.811  2.810 ( -0.036%)
mr          gcc-6      3 |  118.3  115.9 ( -2.029%) |  2.811  2.810 ( -0.036%)
mr          gcc-7      3 |  120.4  119.2 ( -0.997%) |  2.811  2.810 ( -0.036%)
mr          gcc-8      3 |  118.4  119.6 ( +1.014%) |  2.811  2.810 ( -0.036%)
mr          gcc-10     3 |  116.3  116.2 ( -0.086%) |  2.811  2.810 ( -0.036%)
mr          clang-6.0  3 |  119.6  115.1 ( -3.763%) |  2.811  2.810 ( -0.036%)
mr          clang-7    3 |  112.6  115.0 ( +2.131%) |  2.811  2.810 ( -0.036%)
mr          clang-8    3 |  114.0  117.1 ( +2.719%) |  2.811  2.810 ( -0.036%)
mr          clang-9    3 |  114.6  114.1 ( -0.436%) |  2.811  2.810 ( -0.036%)
mr          clang-11   3 |  109.2  114.2 ( +4.579%) |  2.811  2.810 ( -0.036%)
mr          clang-12   3 |  115.1  114.2 ( -0.782%) |  2.811  2.810 ( -0.036%)
mr          gcc-4.8    4 |  111.8  108.7 ( -2.773%) |  2.861  2.859 ( -0.070%)
mr          gcc-5      4 |  114.0  112.5 ( -1.316%) |  2.861  2.859 ( -0.070%)
mr          gcc-6      4 |  110.2  114.1 ( +3.539%) |  2.861  2.859 ( -0.070%)
mr          gcc-7      4 |  111.2  110.0 ( -1.079%) |  2.861  2.859 ( -0.070%)
mr          gcc-8      4 |  110.8  115.3 ( +4.061%) |  2.861  2.859 ( -0.070%)
mr          gcc-10     4 |  109.4  107.0 ( -2.194%) |  2.861  2.859 ( -0.070%)
mr          clang-6.0  4 |  115.7  109.3 ( -5.532%) |  2.861  2.859 ( -0.070%)
mr          clang-7    4 |  108.9  109.8 ( +0.826%) |  2.861  2.859 ( -0.070%)
mr          clang-8    4 |  110.1  108.4 ( -1.544%) |  2.861  2.859 ( -0.070%)
mr          clang-9    4 |  109.0  108.3 ( -0.642%) |  2.861  2.859 ( -0.070%)
mr          clang-11   4 |  115.2  107.4 ( -6.771%) |  2.861  2.859 ( -0.070%)
mr          clang-12   4 |  112.0  111.9 ( -0.089%) |  2.861  2.859 ( -0.070%)
nci         gcc-4.8    3 |  419.4  412.4 ( -1.669%) | 11.740 11.800 ( +0.511%)
nci         gcc-5      3 |  409.7  415.5 ( +1.416%) | 11.740 11.800 ( +0.511%)
nci         gcc-6      3 |  413.8  415.2 ( +0.338%) | 11.740 11.800 ( +0.511%)
nci         gcc-7      3 |  417.2  413.6 ( -0.863%) | 11.740 11.800 ( +0.511%)
nci         gcc-8      3 |  410.4  413.1 ( +0.658%) | 11.740 11.800 ( +0.511%)
nci         gcc-10     3 |  416.2  408.7 ( -1.802%) | 11.740 11.800 ( +0.511%)
nci         clang-6.0  3 |  424.2  399.5 ( -5.823%) | 11.740 11.800 ( +0.511%)
nci         clang-7    3 |  419.5  422.3 ( +0.667%) | 11.740 11.800 ( +0.511%)
nci         clang-8    3 |  433.3  413.4 ( -4.593%) | 11.740 11.800 ( +0.511%)
nci         clang-9    3 |  433.1  424.2 ( -2.055%) | 11.740 11.800 ( +0.511%)
nci         clang-11   3 |  438.4  412.1 ( -5.999%) | 11.740 11.800 ( +0.511%)
nci         clang-12   3 |  426.4  423.3 ( -0.727%) | 11.740 11.800 ( +0.511%)
nci         gcc-4.8    4 |  423.8  424.0 ( +0.047%) | 11.750 11.800 ( +0.426%)
nci         gcc-5      4 |  420.2  422.8 ( +0.619%) | 11.750 11.800 ( +0.426%)
nci         gcc-6      4 |  389.7  409.8 ( +5.158%) | 11.750 11.800 ( +0.426%)
nci         gcc-7      4 |  425.4  421.3 ( -0.964%) | 11.750 11.800 ( +0.426%)
nci         gcc-8      4 |  418.1  421.9 ( +0.909%) | 11.750 11.800 ( +0.426%)
nci         gcc-10     4 |  425.0  418.0 ( -1.647%) | 11.750 11.800 ( +0.426%)
nci         clang-6.0  4 |  435.3  394.6 ( -9.350%) | 11.750 11.800 ( +0.426%)
nci         clang-7    4 |  427.3  424.4 ( -0.679%) | 11.750 11.800 ( +0.426%)
nci         clang-8    4 |  444.6  417.7 ( -6.050%) | 11.750 11.800 ( +0.426%)
nci         clang-9    4 |  441.1  419.2 ( -4.965%) | 11.750 11.800 ( +0.426%)
nci         clang-11   4 |  437.2  419.2 ( -4.117%) | 11.750 11.800 ( +0.426%)
nci         clang-12   4 |  426.8  406.0 ( -4.873%) | 11.750 11.800 ( +0.426%)
ooffice     gcc-4.8    3 |   99.6  107.4 ( +7.831%) |  1.956  1.957 ( +0.051%)
ooffice     gcc-5      3 |   98.6  108.6 (+10.142%) |  1.956  1.957 ( +0.051%)
ooffice     gcc-6      3 |  101.1  103.9 ( +2.770%) |  1.956  1.957 ( +0.051%)
ooffice     gcc-7      3 |  101.9  108.6 ( +6.575%) |  1.956  1.957 ( +0.051%)
ooffice     gcc-8      3 |  100.3  107.8 ( +7.478%) |  1.956  1.957 ( +0.051%)
ooffice     gcc-10     3 |  101.0  109.6 ( +8.515%) |  1.956  1.957 ( +0.051%)
ooffice     clang-6.0  3 |  108.1  107.8 ( -0.278%) |  1.956  1.957 ( +0.051%)
ooffice     clang-7    3 |   93.5  105.5 (+12.834%) |  1.956  1.957 ( +0.051%)
ooffice     clang-8    3 |   92.6  104.3 (+12.635%) |  1.956  1.957 ( +0.051%)
ooffice     clang-9    3 |   97.0  107.5 (+10.825%) |  1.956  1.957 ( +0.051%)
ooffice     clang-11   3 |   96.0  108.0 (+12.500%) |  1.956  1.957 ( +0.051%)
ooffice     clang-12   3 |   98.3  104.4 ( +6.205%) |  1.956  1.957 ( +0.051%)
ooffice     gcc-4.8    4 |   94.2   98.5 ( +4.565%) |  2.003  2.004 ( +0.050%)
ooffice     gcc-5      4 |   94.2   98.6 ( +4.671%) |  2.003  2.004 ( +0.050%)
ooffice     gcc-6      4 |   93.1   96.2 ( +3.330%) |  2.003  2.004 ( +0.050%)
ooffice     gcc-7      4 |   92.8   99.7 ( +7.435%) |  2.003  2.004 ( +0.050%)
ooffice     gcc-8      4 |   90.9   98.5 ( +8.361%) |  2.003  2.004 ( +0.050%)
ooffice     gcc-10     4 |   94.1   99.6 ( +5.845%) |  2.003  2.004 ( +0.050%)
ooffice     clang-6.0  4 |   98.0   99.4 ( +1.429%) |  2.003  2.004 ( +0.050%)
ooffice     clang-7    4 |   89.1   97.7 ( +9.652%) |  2.003  2.004 ( +0.050%)
ooffice     clang-8    4 |   90.8   94.9 ( +4.515%) |  2.003  2.004 ( +0.050%)
ooffice     clang-9    4 |   90.2   97.4 ( +7.982%) |  2.003  2.004 ( +0.050%)
ooffice     clang-11   4 |   91.6   99.3 ( +8.406%) |  2.003  2.004 ( +0.050%)
ooffice     clang-12   4 |   91.0  101.3 (+11.319%) |  2.003  2.004 ( +0.050%)
osdb        gcc-4.8    3 |  142.6  145.0 ( +1.683%) |  2.867  2.876 ( +0.314%)
osdb        gcc-5      3 |  134.2  148.3 (+10.507%) |  2.867  2.876 ( +0.314%)
osdb        gcc-6      3 |  140.9  145.1 ( +2.981%) |  2.867  2.876 ( +0.314%)
osdb        gcc-7      3 |  138.7  142.1 ( +2.451%) |  2.867  2.876 ( +0.314%)
osdb        gcc-8      3 |  136.4  143.4 ( +5.132%) |  2.867  2.876 ( +0.314%)
osdb        gcc-10     3 |  136.8  145.7 ( +6.506%) |  2.867  2.876 ( +0.314%)
osdb        clang-6.0  3 |  141.3  145.4 ( +2.902%) |  2.867  2.876 ( +0.314%)
osdb        clang-7    3 |  137.9  150.4 ( +9.065%) |  2.867  2.876 ( +0.314%)
osdb        clang-8    3 |  132.5  147.7 (+11.472%) |  2.867  2.876 ( +0.314%)
osdb        clang-9    3 |  135.6  139.3 ( +2.729%) |  2.867  2.876 ( +0.314%)
osdb        clang-11   3 |  134.9  151.0 (+11.935%) |  2.867  2.876 ( +0.314%)
osdb        clang-12   3 |  129.2  141.1 ( +9.211%) |  2.867  2.876 ( +0.314%)
osdb        gcc-4.8    4 |  127.3  132.4 ( +4.006%) |  2.885  2.895 ( +0.347%)
osdb        gcc-5      4 |  123.3  135.7 (+10.057%) |  2.885  2.895 ( +0.347%)
osdb        gcc-6      4 |  124.5  133.6 ( +7.309%) |  2.885  2.895 ( +0.347%)
osdb        gcc-7      4 |  125.1  133.7 ( +6.875%) |  2.885  2.895 ( +0.347%)
osdb        gcc-8      4 |  121.4  136.8 (+12.685%) |  2.885  2.895 ( +0.347%)
osdb        gcc-10     4 |  124.8  142.6 (+14.263%) |  2.885  2.895 ( +0.347%)
osdb        clang-6.0  4 |  132.6  135.4 ( +2.112%) |  2.885  2.895 ( +0.347%)
osdb        clang-7    4 |  129.4  134.9 ( +4.250%) |  2.885  2.895 ( +0.347%)
osdb        clang-8    4 |  130.9  135.2 ( +3.285%) |  2.885  2.895 ( +0.347%)
osdb        clang-9    4 |  120.0  132.5 (+10.417%) |  2.885  2.895 ( +0.347%)
osdb        clang-11   4 |  129.3  138.6 ( +7.193%) |  2.885  2.895 ( +0.347%)
osdb        clang-12   4 |  122.2  131.8 ( +7.856%) |  2.885  2.895 ( +0.347%)
reymont     gcc-4.8    3 |  127.7  117.1 ( -8.301%) |  3.392  3.413 ( +0.619%)
reymont     gcc-5      3 |  123.5  124.1 ( +0.486%) |  3.392  3.413 ( +0.619%)
reymont     gcc-6      3 |  130.6  131.0 ( +0.306%) |  3.392  3.413 ( +0.619%)
reymont     gcc-7      3 |  127.5  129.9 ( +1.882%) |  3.392  3.413 ( +0.619%)
reymont     gcc-8      3 |  127.1  122.1 ( -3.934%) |  3.392  3.413 ( +0.619%)
reymont     gcc-10     3 |  124.3  126.0 ( +1.368%) |  3.392  3.413 ( +0.619%)
reymont     clang-6.0  3 |  127.6  127.6 ( +0.000%) |  3.392  3.413 ( +0.619%)
reymont     clang-7    3 |  125.3  126.8 ( +1.197%) |  3.392  3.413 ( +0.619%)
reymont     clang-8    3 |  127.1  126.7 ( -0.315%) |  3.392  3.413 ( +0.619%)
reymont     clang-9    3 |  126.1  124.5 ( -1.269%) |  3.392  3.413 ( +0.619%)
reymont     clang-11   3 |  124.5  125.5 ( +0.803%) |  3.392  3.413 ( +0.619%)
reymont     clang-12   3 |  122.8  125.9 ( +2.524%) |  3.392  3.413 ( +0.619%)
reymont     gcc-4.8    4 |  127.7  119.0 ( -6.813%) |  3.429  3.453 ( +0.700%)
reymont     gcc-5      4 |  123.6  125.6 ( +1.618%) |  3.429  3.453 ( +0.700%)
reymont     gcc-6      4 |  128.9  135.8 ( +5.353%) |  3.429  3.453 ( +0.700%)
reymont     gcc-7      4 |  128.7  130.0 ( +1.010%) |  3.429  3.453 ( +0.700%)
reymont     gcc-8      4 |  133.3  119.9 (-10.053%) |  3.429  3.453 ( +0.700%)
reymont     gcc-10     4 |  124.7  124.4 ( -0.241%) |  3.429  3.453 ( +0.700%)
reymont     clang-6.0  4 |  130.1  129.6 ( -0.384%) |  3.429  3.453 ( +0.700%)
reymont     clang-7    4 |  128.6  126.2 ( -1.866%) |  3.429  3.453 ( +0.700%)
reymont     clang-8    4 |  129.0  127.8 ( -0.930%) |  3.429  3.453 ( +0.700%)
reymont     clang-9    4 |  129.6  122.3 ( -5.633%) |  3.429  3.453 ( +0.700%)
reymont     clang-11   4 |  127.9  127.1 ( -0.625%) |  3.429  3.453 ( +0.700%)
reymont     clang-12   4 |  125.7  126.8 ( +0.875%) |  3.429  3.453 ( +0.700%)
samba       gcc-4.8    3 |  201.9  206.8 ( +2.427%) |  4.320  4.342 ( +0.509%)
samba       gcc-5      3 |  201.4  211.6 ( +5.065%) |  4.320  4.342 ( +0.509%)
samba       gcc-6      3 |  205.6  208.4 ( +1.362%) |  4.320  4.342 ( +0.509%)
samba       gcc-7      3 |  205.3  205.6 ( +0.146%) |  4.320  4.342 ( +0.509%)
samba       gcc-8      3 |  205.7  210.4 ( +2.285%) |  4.320  4.342 ( +0.509%)
samba       gcc-10     3 |  204.8  202.8 ( -0.977%) |  4.320  4.342 ( +0.509%)
samba       clang-6.0  3 |  209.9  201.9 ( -3.811%) |  4.320  4.342 ( +0.509%)
samba       clang-7    3 |  201.3  207.8 ( +3.229%) |  4.320  4.342 ( +0.509%)
samba       clang-8    3 |  196.2  200.8 ( +2.345%) |  4.320  4.342 ( +0.509%)
samba       clang-9    3 |  200.8  204.5 ( +1.843%) |  4.320  4.342 ( +0.509%)
samba       clang-11   3 |  202.7  207.8 ( +2.516%) |  4.320  4.342 ( +0.509%)
samba       clang-12   3 |  202.0  200.6 ( -0.693%) |  4.320  4.342 ( +0.509%)
samba       gcc-4.8    4 |  194.5  200.7 ( +3.188%) |  4.349  4.373 ( +0.552%)
samba       gcc-5      4 |  190.0  206.1 ( +8.474%) |  4.349  4.373 ( +0.552%)
samba       gcc-6      4 |  198.1  193.5 ( -2.322%) |  4.349  4.373 ( +0.552%)
samba       gcc-7      4 |  199.3  187.9 ( -5.720%) |  4.349  4.373 ( +0.552%)
samba       gcc-8      4 |  190.6  192.5 ( +0.997%) |  4.349  4.373 ( +0.552%)
samba       gcc-10     4 |  194.9  193.9 ( -0.513%) |  4.349  4.373 ( +0.552%)
samba       clang-6.0  4 |  196.0  188.5 ( -3.827%) |  4.349  4.373 ( +0.552%)
samba       clang-7    4 |  188.4  196.8 ( +4.459%) |  4.349  4.373 ( +0.552%)
samba       clang-8    4 |  180.4  192.2 ( +6.541%) |  4.349  4.373 ( +0.552%)
samba       clang-9    4 |  195.8  194.1 ( -0.868%) |  4.349  4.373 ( +0.552%)
samba       clang-11   4 |  196.2  195.2 ( -0.510%) |  4.349  4.373 ( +0.552%)
samba       clang-12   4 |  192.1  195.2 ( +1.614%) |  4.349  4.373 ( +0.552%)
sao         gcc-4.8    3 |   75.3   84.7 (+12.483%) |  1.306  1.306 ( +0.000%)
sao         gcc-5      3 |   76.7   86.6 (+12.907%) |  1.306  1.306 ( +0.000%)
sao         gcc-6      3 |   76.4   81.4 ( +6.545%) |  1.306  1.306 ( +0.000%)
sao         gcc-7      3 |   73.7   85.8 (+16.418%) |  1.306  1.306 ( +0.000%)
sao         gcc-8      3 |   74.8   81.2 ( +8.556%) |  1.306  1.306 ( +0.000%)
sao         gcc-10     3 |   73.8   78.6 ( +6.504%) |  1.306  1.306 ( +0.000%)
sao         clang-6.0  3 |   81.4   84.4 ( +3.686%) |  1.306  1.306 ( +0.000%)
sao         clang-7    3 |   71.7   84.7 (+18.131%) |  1.306  1.306 ( +0.000%)
sao         clang-8    3 |   71.3   83.1 (+16.550%) |  1.306  1.306 ( +0.000%)
sao         clang-9    3 |   72.5   84.0 (+15.862%) |  1.306  1.306 ( +0.000%)
sao         clang-11   3 |   73.4   86.9 (+18.392%) |  1.306  1.306 ( +0.000%)
sao         clang-12   3 |   72.9   85.9 (+17.833%) |  1.306  1.306 ( +0.000%)
sao         gcc-4.8    4 |   69.7   77.9 (+11.765%) |  1.337  1.337 ( +0.000%)
sao         gcc-5      4 |   69.6   77.4 (+11.207%) |  1.337  1.337 ( +0.000%)
sao         gcc-6      4 |   70.5   74.8 ( +6.099%) |  1.337  1.337 ( +0.000%)
sao         gcc-7      4 |   68.7   75.1 ( +9.316%) |  1.337  1.337 ( +0.000%)
sao         gcc-8      4 |   69.0   74.3 ( +7.681%) |  1.337  1.337 ( +0.000%)
sao         gcc-10     4 |   66.8   71.8 ( +7.485%) |  1.337  1.337 ( +0.000%)
sao         clang-6.0  4 |   73.8   74.7 ( +1.220%) |  1.337  1.337 ( +0.000%)
sao         clang-7    4 |   65.4   74.8 (+14.373%) |  1.337  1.337 ( +0.000%)
sao         clang-8    4 |   64.8   73.0 (+12.654%) |  1.337  1.337 ( +0.000%)
sao         clang-9    4 |   62.0   77.4 (+24.839%) |  1.337  1.337 ( +0.000%)
sao         clang-11   4 |   67.7   77.3 (+14.180%) |  1.337  1.337 ( +0.000%)
sao         clang-12   4 |   68.2   76.0 (+11.437%) |  1.337  1.337 ( +0.000%)
webster     gcc-4.8    3 |  126.6  126.8 ( +0.158%) |  3.403  3.420 ( +0.500%)
webster     gcc-5      3 |  117.1  124.2 ( +6.063%) |  3.403  3.420 ( +0.500%)
webster     gcc-6      3 |  122.9  124.0 ( +0.895%) |  3.403  3.420 ( +0.500%)
webster     gcc-7      3 |  126.3  125.5 ( -0.633%) |  3.403  3.420 ( +0.500%)
webster     gcc-8      3 |  127.6  124.4 ( -2.508%) |  3.403  3.420 ( +0.500%)
webster     gcc-10     3 |  123.5  124.7 ( +0.972%) |  3.403  3.420 ( +0.500%)
webster     clang-6.0  3 |  128.3  122.5 ( -4.521%) |  3.403  3.420 ( +0.500%)
webster     clang-7    3 |  124.3  125.2 ( +0.724%) |  3.403  3.420 ( +0.500%)
webster     clang-8    3 |  121.0  120.2 ( -0.661%) |  3.403  3.420 ( +0.500%)
webster     clang-9    3 |  123.1  122.3 ( -0.650%) |  3.403  3.420 ( +0.500%)
webster     clang-11   3 |  120.7  118.8 ( -1.574%) |  3.403  3.420 ( +0.500%)
webster     clang-12   3 |  117.9  122.1 ( +3.562%) |  3.403  3.420 ( +0.500%)
webster     gcc-4.8    4 |  124.4  122.4 ( -1.608%) |  3.455  3.475 ( +0.579%)
webster     gcc-5      4 |  114.5  121.4 ( +6.026%) |  3.455  3.475 ( +0.579%)
webster     gcc-6      4 |  118.2  118.1 ( -0.085%) |  3.455  3.475 ( +0.579%)
webster     gcc-7      4 |  120.9  119.8 ( -0.910%) |  3.455  3.475 ( +0.579%)
webster     gcc-8      4 |  121.0  122.4 ( +1.157%) |  3.455  3.475 ( +0.579%)
webster     gcc-10     4 |  124.3  119.6 ( -3.781%) |  3.455  3.475 ( +0.579%)
webster     clang-6.0  4 |  124.6  120.1 ( -3.612%) |  3.455  3.475 ( +0.579%)
webster     clang-7    4 |  122.6  119.1 ( -2.855%) |  3.455  3.475 ( +0.579%)
webster     clang-8    4 |  120.2  118.0 ( -1.830%) |  3.455  3.475 ( +0.579%)
webster     clang-9    4 |  118.9  121.1 ( +1.850%) |  3.455  3.475 ( +0.579%)
webster     clang-11   4 |  116.2  120.7 ( +3.873%) |  3.455  3.475 ( +0.579%)
webster     clang-12   4 |  121.4  121.2 ( -0.165%) |  3.455  3.475 ( +0.579%)
xml         gcc-4.8    3 |  313.2  308.7 ( -1.437%) |  8.357  8.363 ( +0.072%)
xml         gcc-5      3 |  306.9  313.1 ( +2.020%) |  8.357  8.363 ( +0.072%)
xml         gcc-6      3 |  300.0  304.7 ( +1.567%) |  8.357  8.363 ( +0.072%)
xml         gcc-7      3 |  313.5  312.9 ( -0.191%) |  8.357  8.363 ( +0.072%)
xml         gcc-8      3 |  315.3  315.4 ( +0.032%) |  8.357  8.363 ( +0.072%)
xml         gcc-10     3 |  308.6  310.1 ( +0.486%) |  8.357  8.363 ( +0.072%)
xml         clang-6.0  3 |  318.1  311.3 ( -2.138%) |  8.357  8.363 ( +0.072%)
xml         clang-7    3 |  309.8  308.6 ( -0.387%) |  8.357  8.363 ( +0.072%)
xml         clang-8    3 |  311.6  310.6 ( -0.321%) |  8.357  8.363 ( +0.072%)
xml         clang-9    3 |  313.1  312.9 ( -0.064%) |  8.357  8.363 ( +0.072%)
xml         clang-11   3 |  321.2  318.7 ( -0.778%) |  8.357  8.363 ( +0.072%)
xml         clang-12   3 |  315.9  315.1 ( -0.253%) |  8.357  8.363 ( +0.072%)
xml         gcc-4.8    4 |  313.2  317.0 ( +1.213%) |  8.384  8.390 ( +0.072%)
xml         gcc-5      4 |  305.4  313.7 ( +2.718%) |  8.384  8.390 ( +0.072%)
xml         gcc-6      4 |  292.7  311.8 ( +6.525%) |  8.384  8.390 ( +0.072%)
xml         gcc-7      4 |  310.0  312.3 ( +0.742%) |  8.384  8.390 ( +0.072%)
xml         gcc-8      4 |  316.9  313.2 ( -1.168%) |  8.384  8.390 ( +0.072%)
xml         gcc-10     4 |  310.3  308.9 ( -0.451%) |  8.384  8.390 ( +0.072%)
xml         clang-6.0  4 |  319.9  315.7 ( -1.313%) |  8.384  8.390 ( +0.072%)
xml         clang-7    4 |  310.3  309.6 ( -0.226%) |  8.384  8.390 ( +0.072%)
xml         clang-8    4 |  316.3  292.8 ( -7.430%) |  8.384  8.390 ( +0.072%)
xml         clang-9    4 |  319.4  317.2 ( -0.689%) |  8.384  8.390 ( +0.072%)
xml         clang-11   4 |  331.8  317.2 ( -4.400%) |  8.384  8.390 ( +0.072%)
xml         clang-12   4 |  315.3  319.7 ( +1.395%) |  8.384  8.390 ( +0.072%)
x-ray       gcc-4.8    3 |   71.3   77.0 ( +7.994%) |  1.393  1.393 ( +0.000%)
x-ray       gcc-5      3 |   69.8   77.5 (+11.032%) |  1.393  1.393 ( +0.000%)
x-ray       gcc-6      3 |   70.6   77.1 ( +9.207%) |  1.393  1.393 ( +0.000%)
x-ray       gcc-7      3 |   68.6   75.6 (+10.204%) |  1.393  1.393 ( +0.000%)
x-ray       gcc-8      3 |   67.4   74.4 (+10.386%) |  1.393  1.393 ( +0.000%)
x-ray       gcc-10     3 |   68.6   72.3 ( +5.394%) |  1.393  1.393 ( +0.000%)
x-ray       clang-6.0  3 |   75.7   76.3 ( +0.793%) |  1.393  1.393 ( +0.000%)
x-ray       clang-7    3 |   64.9   79.3 (+22.188%) |  1.393  1.393 ( +0.000%)
x-ray       clang-8    3 |   70.9   77.4 ( +9.168%) |  1.393  1.393 ( +0.000%)
x-ray       clang-9    3 |   66.1   74.8 (+13.162%) |  1.393  1.393 ( +0.000%)
x-ray       clang-11   3 |   68.2   75.3 (+10.411%) |  1.393  1.393 ( +0.000%)
x-ray       clang-12   3 |   66.3   76.8 (+15.837%) |  1.393  1.393 ( +0.000%)
x-ray       gcc-4.8    4 |   65.8   68.4 ( +3.951%) |  1.484  1.484 ( +0.000%)
x-ray       gcc-5      4 |   65.2   68.0 ( +4.294%) |  1.484  1.484 ( +0.000%)
x-ray       gcc-6      4 |   64.9   67.0 ( +3.236%) |  1.484  1.484 ( +0.000%)
x-ray       gcc-7      4 |   62.0   64.9 ( +4.677%) |  1.484  1.484 ( +0.000%)
x-ray       gcc-8      4 |   62.4   66.3 ( +6.250%) |  1.484  1.484 ( +0.000%)
x-ray       gcc-10     4 |   62.7   67.0 ( +6.858%) |  1.484  1.484 ( +0.000%)
x-ray       clang-6.0  4 |   67.9   65.3 ( -3.829%) |  1.484  1.484 ( +0.000%)
x-ray       clang-7    4 |   64.5   69.0 ( +6.977%) |  1.484  1.484 ( +0.000%)
x-ray       clang-8    4 |   66.1   70.8 ( +7.110%) |  1.484  1.484 ( +0.000%)
x-ray       clang-9    4 |   61.9   67.7 ( +9.370%) |  1.484  1.484 ( +0.000%)
x-ray       clang-11   4 |   64.7   67.4 ( +4.173%) |  1.484  1.484 ( +0.000%)
x-ray       clang-12   4 |   62.4   67.0 ( +7.372%) |  1.484  1.484 ( +0.000%)
silesia.tar gcc-4.8    3 |  146.2  149.0 ( +1.915%) |  3.179  3.187 ( +0.252%)
silesia.tar gcc-5      3 |  142.0  139.1 ( -2.042%) |  3.179  3.187 ( +0.252%)
silesia.tar gcc-6      3 |  146.6  150.0 ( +2.319%) |  3.179  3.187 ( +0.252%)
silesia.tar gcc-7      3 |  143.5  147.6 ( +2.857%) |  3.179  3.187 ( +0.252%)
silesia.tar gcc-8      3 |  144.8  145.5 ( +0.483%) |  3.179  3.187 ( +0.252%)
silesia.tar gcc-10     3 |  143.1  146.2 ( +2.166%) |  3.179  3.187 ( +0.252%)
silesia.tar clang-6.0  3 |  147.7  147.3 ( -0.271%) |  3.179  3.187 ( +0.252%)
silesia.tar clang-7    3 |  142.6  148.0 ( +3.787%) |  3.179  3.187 ( +0.252%)
silesia.tar clang-8    3 |  141.3  150.4 ( +6.440%) |  3.179  3.187 ( +0.252%)
silesia.tar clang-9    3 |  143.6  150.2 ( +4.596%) |  3.179  3.187 ( +0.252%)
silesia.tar clang-11   3 |  143.7  149.5 ( +4.036%) |  3.179  3.187 ( +0.252%)
silesia.tar clang-12   3 |  142.8  149.4 ( +4.622%) |  3.179  3.187 ( +0.252%)
silesia.tar gcc-4.8    4 |  135.9  139.8 ( +2.870%) |  3.237  3.246 ( +0.278%)
silesia.tar gcc-5      4 |  134.4  138.1 ( +2.753%) |  3.237  3.246 ( +0.278%)
silesia.tar gcc-6      4 |  134.2  139.7 ( +4.098%) |  3.237  3.246 ( +0.278%)
silesia.tar gcc-7      4 |  135.5  140.1 ( +3.395%) |  3.237  3.246 ( +0.278%)
silesia.tar gcc-8      4 |  137.0  140.0 ( +2.190%) |  3.237  3.246 ( +0.278%)
silesia.tar gcc-10     4 |  138.3  136.0 ( -1.663%) |  3.237  3.246 ( +0.278%)
silesia.tar clang-6.0  4 |  141.4  138.2 ( -2.263%) |  3.237  3.246 ( +0.278%)
silesia.tar clang-7    4 |  137.9  142.2 ( +3.118%) |  3.237  3.246 ( +0.278%)
silesia.tar clang-8    4 |  134.8  140.7 ( +4.377%) |  3.237  3.246 ( +0.278%)
silesia.tar clang-9    4 |  140.0  140.7 ( +0.500%) |  3.237  3.246 ( +0.278%)
silesia.tar clang-11   4 |  135.9  138.3 ( +1.766%) |  3.237  3.246 ( +0.278%)
silesia.tar clang-12   4 |  138.2  132.2 ( -4.342%) |  3.237  3.246 ( +0.278%)

Benchmarked on, as usual, an Intel Xeon E5-2680 v4 @ 2.40GHz.

On the whole we see improvements in ratio, and improvements on speed on less-compressible inputs. It seems like very compressible inputs are neutral on speed of even maybe slightly slower.

Status

This PR is believed to be speed-positive, ratio-positive, and correct.

To-Do:

  • Correctness.
  • Match or improve ratio.
  • Match or improve speed.
  • Simplify DFast DMS implementation.
  • Benchmark.

if (offset_1 > maxRep) offsetSaved = offset_1, offset_1 = 0;
}

_start:
Copy link
Contributor

@Cyan4973 Cyan4973 Sep 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be written as a loop, rather than a goto ?

Note : this may impact some variables' scope.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I thought about it. To a rough approximation, the original implementation looks like this:

size_t dfast() {
  while (ip < ilimit) {
    if (search()) goto _match;
    continue;
_match:
    store();
  }
  return;
}

This PR changes it to something like this:

size_t dfast() {
_start:
  if (ip >= ilimit) goto _cleanup;
  init();
  do {
    if (search()) goto _match;
  } while (ip < ilimit);
_cleanup:
  return;
_match:
  store();
  goto _start;
}

Admittedly, this abuses gotos a bunch. It is quite close to the assembly that gets generated though, which helps me think about the flow.

I believe it could be rewritten to instead look like this:

size_t dfast() {
  if (ip < ilimit) {
    init();
    do {
      if (search()) goto _match;
      continue;
_match:
      store();
      init();
    } while (ip < ilimit);
  }
  return;
}

This solves the abuse of goto, but has two init() blocks, which I don't like. I'm also somewhat attracted to moving the match code outside of the tight loop because it's then clearer what the hot loop is.

Alternatively, it could maybe be structured like this, which only has one init() block, but has two loops nested.

I think this would improve the variable scoping concerns.

size_t dfast() {
  while (ip < ilimit) {
    init();
    do {
      if (search()) goto _match;
    } while (ip < ilimit);
    break;
_match:
    store();
  }
  return;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a version of this last approach here. Performance looks good on gcc-10, but it's slower on clang-12. :/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without modifying the general structure of the current code,
could the _start: / goto _start (and just this one) be converted into a loop ?

It seems it wouldn't impact the code structure, hence should be essentially equivalent for the compiler,

yet it would reduce the nb of goto
and allows a (slight) reduction in scope of several variables
that don't need to retain their values between loop iterations.

@Cyan4973
Copy link
Contributor

The algorithm itself looks fine.
I'll go on later trying to benchmark it to confirm the improvements.

In term of coding style, there is a heavy reliance on goto statements.
I'm not "firmly opposed" to goto, they have their use, but that doesn't mean I like to see many of them around.
This situation seems to have consequences on variable lifetimes,
which are then extended to the entire function,
making it more difficult to track their usage and role.

I wonder if that's always necessary.
Whenever a logic could also be explained easily with loop with a well defined scope,
I believe it's preferable for maintenance.

An improvement could be to convert "some" of these goto into loops,
whenever it feels easy enough to convert.

@felixhandte felixhandte changed the title [WIP] Pipelined Implementation of ZSTD_dfast Pipelined Implementation of ZSTD_dfast Sep 28, 2021
Aside from maybe a latency win in the loop, this means that when we find a
short match, we've already done the hash we need to check the next long match.
Since we're now hashing the position ahead even if we find a long match and
don't search that next position, we can write it back into the hashtable even
in long matches. This seems to cost us no speed, and improves compression
ratio slightly!
This lookup can be advanced to before the short match check because either way
we will use it (in the next loop iter or in `_search_next_long`).
This test depended on `_extDict` and `_noDict` compressing identically, which
is not a guarantee we make, AFAIK.
@Cyan4973
Copy link
Contributor

Cyan4973 commented Oct 7, 2021

benchmark feedback :
I confirm seeing this PR improving both speed and compression ratio of dfast strategy.
Surprisingly (to me), the compression ratio is improved more than I expected, while the compression speed is improved less than I anticipated. Maybe some of the speed gains is consumed into more search or parsing work leading to the compression ratio gain ? (not clear, I need to look at the code in more details).

Anyway, both impacts are fairly small, and both are positive. So, from a measurement perspective, it looks like a pure improvement.

@Cyan4973
Copy link
Contributor

It's a pity that you could not use the new loop to reduce the scope of some local variables, but as you mentioned that it does negatively impact performance, I guess we'll leave it there.

Another small comments is that I noticed in results.csv that, when it comes to "smaller" data (~100 KB), there are several samples where the compression ratio ends up being worse. The impact seems to remain small, so it's not a deal breaker, but this is a good reminder that the "extra work" of updating hash tables more often doesn't necessarily translate into better compression ratio. Consequences are more fuzzy.

@felixhandte felixhandte merged commit 23c1a2d into facebook:dev Oct 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants