Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stagger Stepping in Negative Levels #2921

Merged
merged 6 commits into from
Dec 14, 2021

Conversation

felixhandte
Copy link
Contributor

@felixhandte felixhandte commented Dec 10, 2021

#2749 returned the negative levels to pre-#1562 behavior. I.e., skipping stepSize positions every iteration. This PR emulates #1562-like stepping by applying the stepSize skip every other position.

This PR addresses #2827.

Benchmarks

silesia.tar on gcc-10:

Screenshot from 2021-12-13 17-14-03

silesia.tar on clang-13:

Screenshot from 2021-12-13 17-14-15

Status

I believe this PR is ready for merge.

This replicates the behavior of @terrelln's `ZSTD_fast` implementation. That
is, it always looks at adjacent pairs of positions, and only applies the
acceleration every other position. This produces a more fine-grained
acceleration.
This avoids an additional addition, at the cost of an additional variable.
@felixhandte
Copy link
Contributor Author

felixhandte commented Dec 10, 2021

Old Benchmark `silesia.tar` on GCC-10:

Screenshot from 2021-12-10 16-52-30

The position updates are rewritten from `ip[N] = ip[N-1] + step` to be
`ip[N] = ip[N-2] + step`. This lets us only deal with the asymmetric spacing
of gaps at setup and then we only have to keep a single `step` variable.

This seems to work quite well on GCC and Clang!
@felixhandte
Copy link
Contributor Author

felixhandte commented Dec 13, 2021

Old Benchmark The new version shows the following performance:

gcc-10:

Screenshot from 2021-12-13 15-26-17

clang-13:

Screenshot from 2021-12-13 15-26-35

The templating does at ~10KB to the library size.

I couldn't find a good way to spread `ip0` and `ip1` apart when we accelerate
due to incompressible inputs. (The methods I tried slowed things down quite a
bit.)

Since we aren't splaying ip0 and ip1 apart (which would be like `0_1_2_3_`, as
opposed to the `01__23__` we were actually doing), it's a big ambitious to
increment `step` by 2. Instead, let's increment it by 1, which has the benefit
sliiightly improving compression. Speed remains pretty much unchanged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Compression ratio for --fast=2 and higher became significantly worse. Expected?
3 participants