-
Notifications
You must be signed in to change notification settings - Fork 322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
huff0: decompress directly into output #577
huff0: decompress directly into output #577
Conversation
The Go implementation uses an auxiliary buffer to avoid extensive bound checks. In the asm implementation we know in advance the output buffer is big enough, so it's possible to write directly to that buffer.
@WojciechMula We can just abandon #743 I fixed out-of-bounds writes, since we didn't check if we exceeded the write buffer. Please check out/upgrade to: https://gist.github.com/klauspost/306ed219fb5e275d7c9abb87c9a7b142 |
@klauspost I agree, this change is quite small, there's no point in having to PRs. Thanks for the fixes, will incorporate them. (Added you as a collabolator to my fork, so you don't have to play with gists. Sorry, I should have done it earlier.) |
Regarding the build tags: seems that avo invoked for Go 1.18 does not include the old build tags. Yesterday I tried to figure it out in the avo sources but didn't find where it's checked. I'll recheck that in the evening. |
Aside from the crash from above fuzz tests are running fine 👍🏼 |
FWIW I tried reading 56 bits and doing decode 4 and 6 values for Reading 6 instead of 4 values is significantly slower (~20% deg). The only gain is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
benchmark old ns/op new ns/op delta
BenchmarkDecompress4XNoTable/digits/100-32 334 336 +0.66%
BenchmarkDecompress4XNoTable/digits/10000-32 10835 9562 -11.75%
BenchmarkDecompress4XNoTable/digits/262143-32 303585 270811 -10.80%
BenchmarkDecompress4XNoTable/gettysburg/100-32 285 286 +0.56%
BenchmarkDecompress4XNoTable/gettysburg/10000-32 11393 10268 -9.87%
BenchmarkDecompress4XNoTable/gettysburg/262143-32 327973 289561 -11.71%
BenchmarkDecompress4XNoTable/twain/100-32 331 330 -0.27%
BenchmarkDecompress4XNoTable/twain/10000-32 11458 10235 -10.67%
BenchmarkDecompress4XNoTable/twain/262143-32 374970 345400 -7.89%
BenchmarkDecompress4XNoTable/low-ent.10k/100-32 367 371 +1.01%
BenchmarkDecompress4XNoTable/low-ent.10k/10000-32 10812 9398 -13.08%
BenchmarkDecompress4XNoTable/low-ent.10k/262143-32 256684 221666 -13.64%
BenchmarkDecompress4XNoTable/superlow-ent-10k/262143-32 256839 221322 -13.83%
BenchmarkDecompress4XNoTable/case1/100-32 318 322 +1.23%
BenchmarkDecompress4XNoTable/case1/10000-32 10803 9562 -11.49%
BenchmarkDecompress4XNoTable/case1/262143-32 277377 242147 -12.70%
BenchmarkDecompress4XNoTable/case2/100-32 345 340 -1.62%
BenchmarkDecompress4XNoTable/case2/10000-32 10659 9473 -11.13%
BenchmarkDecompress4XNoTable/case2/262143-32 268723 236376 -12.04%
BenchmarkDecompress4XNoTable/case3/100-32 333 336 +0.99%
BenchmarkDecompress4XNoTable/case3/10000-32 10737 9357 -12.85%
BenchmarkDecompress4XNoTable/case3/262143-32 272268 239011 -12.21%
BenchmarkDecompress4XNoTable/pngdata.001/100-32 361 350 -2.99%
BenchmarkDecompress4XNoTable/pngdata.001/10000-32 11583 10740 -7.28%
BenchmarkDecompress4XNoTable/pngdata.001/262143-32 306257 279850 -8.62%
BenchmarkDecompress4XNoTable/normcount2/100-32 287 289 +0.80%
BenchmarkDecompress4XNoTable/normcount2/10000-32 10832 9696 -10.49%
BenchmarkDecompress4XNoTable/normcount2/262143-32 279908 247688 -11.51%
BenchmarkDecompress4XNoTableTableLog8/digits-32 107990 96969 -10.21%
Co-authored-by: Klaus Post <[email protected]>
Your benchmarks look really good! And match expectations... I'm still puzzled by the regressions observed on my Ice Lake machine. |
Co-authored-by: Klaus Post <[email protected]>
Co-authored-by: Klaus Post <[email protected]>
Co-authored-by: Klaus Post <[email protected]>
Presence of this tag depends on the Go version on which `go generate` was run, see mmcloughlin/avo#183.
@klauspost mmcloughlin/avo#183. It would be great if these settings may be overridden in runtime. |
Can you add this to the |
Adding in #582 |
Solves #576.
Benchmarks on Skylake and Ice Lake machines didn't show any boost. (On my very old laptop there's an improvement). It's reasonable, as
perf report
for tests shows that 95% of the time is spent in decompress routines. But benchmarks ofzstd
show some improvement.I'm benchmarking on AWS machines and they are not that reliable for microbenchmarks. They can show a few per cent differences for two subsequent runs of the same code.