Bleach/unbleach files
"A Tale Of Two Optimisations" by Greg Foletta.
See also my blog post about my thought process.
on [Greg's] Intel i7-8650U laptop, running over a file that's cached in memory and outputting to /dev/null, the encoding / decoding process runs at 258MiB/s.
In comparison, on my ancient laptop with a Core 2 Duo T9900, I get:
C:\...\wscoder (speedup-attempts) > timethis "wse < test.data | wsd > NUL"
TimeThis : Command Line : wse < test.data | wsd > NUL
TimeThis : Elapsed Time : 00:00:02.831
test.data
is a 256 MiB file containing random bytes. In the pipeline above, wse
reads 256 MiB and feeds 1 GiB to wsd
which writes 256 MB to NUL
(Windows' equivalent of /dev/null
). I am not quite sure how to measure that, but, let's conservatively say the pipeline is shuffling 1.5 GB of data in 2.831 seconds. That corresponds to a throughput of about 543 MiB/s.
With a warm cache, I get
C:\...\wscoder (speedup-attempts) > timethis "wse < test.data > NUL"
TimeThis : Command Line : wse < test.data > NUL
TimeThis : Elapsed Time : 00:00:02.187
when encoding. This reads 256 MiB and writes 1,024 MiB to NUL
for a throughput of approximately 585 MiB.
Conversely, when decoding with a warm cache:
C:\Users\sinan\src\wscoder (speedup-attempts) > timethis "wsd < test.encoded > NUL"
TimeThis : Command Line : wsd < test.encoded > NUL
TimeThis : Elapsed Time : 00:00:01.852
This reads 1,024 MiB and writes 256 MiB to NUL
for a throughput of approximately 691 MiB/s.
One way to improve performance is to take advantage of multiple cores by partitioning the buffers. I decided to give that a shot. To my dismay, I found out the most recent version of Visual Studio does not yet support the optional C11 threading library, so I converted both the encoder and decoder to Franken-C++.
I chose to partition the parts of the buffers processed by each thread in an interleaved manner rather than partitioning into blocks because I assumed (but did not verify) that this would reduce cache trashing.
With two threads on the T9900, both encoding and decoding speed improved:
TimeThis : Command Line : wse < test.data > NUL
TimeThis : Elapsed Time : 00:00:01.404
This represents approximately 35% improvement in time and 55% improvement in encoding throughput.
TimeThis : Command Line : wsd < test.encoded > NUL
TimeThis : Elapsed Time : 00:00:01.505
This corresponds to about 19% improvement in time and 23% improvement in throughtput. Finally, looking at the round-trip pipeline:
TimeThis : Command Line : wse < test.data | wsd >NUL
TimeThis : Elapsed Time : 00:00:03.009
we see that it now executes about 6% slower presumably because we are running four threads on a dual-core machine. Regardless, with an encoder/decoder combination, the common use case is NOT to run a round-trip pipeline, so I am OK with that.
Eventually, decided the a more straightforward optimization mentioned on HN madethe most sense and I incorporated that along with the threads. With both in place, on the same T9900, I get:
TimeThis : Command Line : wse < test.data > NUL
TimeThis : Elapsed Time : 00:00:00.933
and
TimeThis : Command Line : wsd < test.encoded > NUL
TimeThis : Elapsed Time : 00:00:01.457
Roughly, these correspond to 1.6 GiB/s encoding and 1 GiB/s decoding performance.