Compress documentation uploaded to S3 #379

pietroalbini · 2019-07-18T12:26:39Z

At the moment we don't compress the documentation uploaded to S3, wasting a lot of space. While money for S3 isn't an issue at the moment, avoiding compression could hurt docs.rs's sustainability in the future.

~~I ran some very rough benchmarks on reqwest 0.9.3, compressing each .html file separately:~~

Bad benchmark

Algorithm	Size	Compression time	Decompression time	Options
Plaintext	33.9 MB	-	-	-
Gzip	12.0 MB	3.2s	3.4s	`-9` (best)
Gzip	12.8 MB	2.5s	3.4s	`-1` (fast)
Zstd	11.7 MB	7.8s	2.4s	`-19` (best)
Zstd	12.5 MB	2.5s	2.4s	`-1` (fast)
Brotli	11.5 MB	5.5s	2.3s	`-9` (best)
Brotli	13.0 MB	2.3s	2.3s	`-0` (fast)

Looking at the numbers, on average if we compress the uploaded docs we're going to save 63% of storage space, which is great from a sustainability point of view. I think we should compress all the uploaded docs going forward, and try to compress (part of) the initial import as well.

For the algoritm choice, I'd say we can go with gzip: there isn't much difference between the resulting sizes and the compression time delta between gzip's fast and best modes is the smallest. We can compress the initial import with -1 to speed it up, and all the new crates with -9.

cc @Mark-Simulacrum @QuietMisdreavus

Benchmark method

Installed compression tools on Ubuntu 18.04 LTS:

$ sudo apt install gzip brotli zstd

Downloaded locally the reqwest documentation:

$ aws s3 cp --recursive s3://rust-docs-rs/rustdoc/reqwest/0.9.3/ .

Compressed every .html file with find:

$ time find <dir> -name "*.html" -exec <command> {} \;

The text was updated successfully, but these errors were encountered:

Mark-Simulacrum · 2019-07-18T12:29:44Z

Could you populate decompression times as well? In some sense those are considerably more important for us to decide whether this is viable. Compression times for all of those are not great though :/

GuillaumeGomez · 2019-07-18T12:30:43Z

@Mark-Simulacrum Why would we need decompression? You can send compressed files as is (I think only gzip compression is supported on web-browser though?).

Mark-Simulacrum · 2019-07-18T12:35:10Z

That's true, in general, certainly. I guess presumably ~all clients are Accept-Encoding: gzip; or w/e the header value is?

I do think this shouldn't be that hard to add -- presumably we'd add a "compressed" column to the files table and if it's set access files in S3 at file.gzip or something along those lines.

pietroalbini · 2019-07-18T12:41:28Z

Added decompression time: gzip is slower than zstd and brotli, and those two are roughly the same. Of course production performance for decompression are going to be way better, as you don't have to write to my hard disk (:sweat_smile:), everything is in memory, the binary is not restarted every time and we will decompress way less files.

Considering the decompression time brotli is probably the best choice?

Why would we need decompression? You can send compressed files as is (I think only gzip compression is supported on web-browser though?).

No, because docs.rs needs to tweak those HTML files (for example to add the top bar).

Mark-Simulacrum · 2019-07-18T12:44:33Z

Given those decompression times, this seems problematic. Even if they were an order of magnitude lower, that's still ~250ms/file we read - and slowing down all requests to docs.rs by that much seems poor. In very unscientific benchmarking (web developer tools), it looks like we respond in about ~60ms on js/css content today and ~150ms on HTML content -- this would significantly increase that.

pietroalbini · 2019-07-18T12:45:01Z

Let me create a proper benchmark.

pietroalbini · 2019-07-18T12:48:57Z

Given those decompression times, this seems problematic. Even if they were an order of magnitude lower, that's still ~250ms/file we read

By the way, the numbers are for compressing and decompressing 2040 files, not a single file.

Mark-Simulacrum · 2019-07-18T12:50:09Z

Ah, yeah, that makes sense. I did think the numbers were awfully large :)

We probably want single-file timing information as that's whats important for the primary use case of serving them.

pietroalbini · 2019-07-18T13:50:06Z

Ok, scratch that, wrote a proper benchmark, and the results are way more accurate:

Algorithm	Level	Size	Comp.	Comp. (one)	Dec.	Dec. (one)
plain	-	25.2 MB	-	-	-	-
gzip	9	3.8 MB	426.9ms	194.4µs	70.1ms	31.9µs
gzip	5	3.9 MB	259.7ms	118.3µs	71.0ms	32.3µs
gzip	1	5.0 MB	119.0ms	54.2µs	82.3ms	37.5µs
zstd	9	3.6 MB	348.2ms	158.6µs	28.8ms	13.1µs
zstd	5	3.8 MB	151.9ms	69.2µs	24.4ms	11.1µs
zstd	1	4.2 MB	68.3ms	31.1µs	24.9ms	11.3µs
brotli	9	3.2 MB	3.6s	1.7ms	49.7ms	22.6µs
brotli	5	3.3 MB	389.4ms	177.3µs	49.8ms	22.7µs
brotli	1	4.4 MB	97.2ms	44.2µs	51.3ms	23.4µs

Mark-Simulacrum · 2019-07-18T13:56:36Z

Okay, those numbers look great! We can definitely afford microseconds of decompression time.

I'll investigate doing this legwork, though I'll keep the upload going in the meantime. I think re-compressing existing files and such can happen in parallel and over time if needed, not sure.

pietroalbini · 2019-07-18T14:16:37Z

Did some more brotli benchmarks:

Algorithm	Level	Size	Comp.	Comp. (one)	Dec.	Dec. (one)
plain	-	25.2 MB	-	-	-	-
brotli	6	3.3 MB	409.2ms	186.4µs	49.8ms	22.7µs
brotli	5	3.3 MB	383.8ms	174.8µs	49.9ms	22.7µs
brotli	4	3.6 MB	264.0ms	120.2µs	50.1ms	22.8µs
brotli	3	3.9 MB	178.0ms	81.0µs	47.6ms	21.7µs
brotli	2	4.1 MB	145.6ms	66.3µs	52.3ms	23.8µs
brotli	1	4.4 MB	97.1ms	44.2µs	51.1ms	23.3µs

Based on that, I think the two options we should consider are zstd 9 and brotli 5:

Both take roughly the same time to compress, zstd 9 is slightly faster though
zstd 9 takes half the time to decompress than brotli 5
brotli 5 uses 10% less storage

I'm sort of preferring brotli 5 as the saved storage is nice, but I don't feel too strongly about that.

najamelan · 2019-09-18T02:48:08Z

putting the cached content in an iframe avoids decompression and post processing. It would require an extra http request, but then since it's cached, it's possible to give it a unique name and cache control immutable so browsers never try to reload it if it's in cache. That might well make up for the extra requests.

namibj · 2020-04-02T00:39:19Z

zstd supports dictionary compression, where you pre-create the dictionary for a collection of files. This is generally beneficial in the <1MB size range, and should IMO be used for this application.

namibj · 2020-04-13T15:18:41Z

Preliminary results on just a few hundred crates (specifically, cargo doc on declarative-dataflow) suggest average compression ratios of >20 with ~0.6 B/cycle (Broadwell, single-threaded) decompression speed for the html files, (individually-handled) with a (precisely) 110KiB dictionary (shared between the ~59MiB (uncompressed) of html files). Testing recorded at < https://asciinema.org/a/kpxnXvV9fk6d7jagcIo3NvKZf>

jyn514 · 2020-05-28T00:32:39Z

putting the cached content in an iframe avoids decompression and post processing. It would require an extra http request, but then since it's cached, it's possible to give it a unique name and cache control immutable so browsers never try to reload it if it's in cache. That might well make up for the extra requests.

Unfortunately it isn't this simple, see #679 (comment) for why iframes aren't a good option.

Nemo157 · 2020-05-28T14:24:41Z

Benchmarks based on those @pietroalbini linked above (code here), adding in zstd custom using a 10MB dictionary I generated from ~4G of docs that @namibj provided, compressing the winapi-0.3.8 docs

Algorithm	Level	Size	Comp.	Comp. (one)	Dec.	Dec. (one)
plain	-	946.8 MB	-	-	-	-
gzip	9	244.1 MB	12.9s	88.3µs	2.6s	17.6µs
gzip	5	244.4 MB	10.1s	69.6µs	2.6s	17.7µs
gzip	1	283.3 MB	5.2s	35.7µs	2.9s	19.9µs
zstd	9	247.7 MB	32.2s	220.9µs	1.2s	8.4µs
zstd	5	251.8 MB	15.0s	102.8µs	1.2s	8.6µs
zstd	1	271.2 MB	2.9s	19.7µs	1.2s	8.4µs
zstd custom	9	50.5 MB	6.7s	46.1µs	246.3ms	1.7µs
zstd custom	5	61.7 MB	3.0s	20.4µs	271.6ms	1.9µs
zstd custom	1	84.3 MB	928.5ms	6.4µs	348.5ms	2.4µs
brotli	9	204.2 MB	21.6s	148.5µs	1.9s	12.7µs
brotli	5	205.7 MB	14.0s	96.0µs	1.9s	13.0µs
brotli	1	278.2 MB	3.7s	25.2µs	2.0s	14.0µs

Kixiron · 2020-05-28T20:35:23Z

What's the timing of generating dictionaries and how often does it have to happen?

Nemo157 · 2020-05-29T07:41:52Z

That dictionary took about a minute. Though if we can get a dictionary trainer that can handle more than 4GB of data to load the entire archive @namibj made I assume it'll take longer.

Preferably we would never generate new dictionaries. As soon as one is used to compress some data that then needs to be part of docs.rs forever. Maybe if in the future rustdoc completely changes how it generates documentation it'd be worth regenerating a new dictionary, but for any minor changes the learnt data should hopefully still be relevant (one idea might be to include docs from a range of old rustdoc versions to potentially reduce overfit on how the latest encodes its docs).

jyn514 · 2020-05-29T14:15:46Z

I think we should also train on more crates than winapi, which is a little special. Maybe we could add an embedded crate like stm32f0 and a smaller crate like hexponent.

Nemo157 · 2020-05-29T14:27:29Z

winapi isn't actually in the training set I'm using, it was just used to compare the results, but yes some time should be spent finding a good training set to use.

jyn514 mentioned this issue Dec 19, 2019

Don't make calls to the database in for loops #530

Closed

jyn514 mentioned this issue Jan 9, 2020

Downloadable docs #174

Closed

jyn514 mentioned this issue Feb 24, 2020

Add a secondary mode with --document-private-items #304

Open

jyn514 mentioned this issue May 9, 2020

Document Docs.rs architecture for contributors #752

Open

jyn514 mentioned this issue May 28, 2020

Add compression for uploaded documentation #780

Merged

jyn514 closed this as completed in #780 Jun 11, 2020

Nemo157 mentioned this issue Feb 1, 2021

Compress documentation per-crate, not per-file #1004

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compress documentation uploaded to S3 #379

Compress documentation uploaded to S3 #379

pietroalbini commented Jul 18, 2019 •

edited

Loading

Mark-Simulacrum commented Jul 18, 2019

GuillaumeGomez commented Jul 18, 2019

Mark-Simulacrum commented Jul 18, 2019

pietroalbini commented Jul 18, 2019

Mark-Simulacrum commented Jul 18, 2019

pietroalbini commented Jul 18, 2019

pietroalbini commented Jul 18, 2019

Mark-Simulacrum commented Jul 18, 2019

pietroalbini commented Jul 18, 2019

Mark-Simulacrum commented Jul 18, 2019

pietroalbini commented Jul 18, 2019

najamelan commented Sep 18, 2019 •

edited

Loading

namibj commented Apr 2, 2020

namibj commented Apr 13, 2020

jyn514 commented May 28, 2020

Nemo157 commented May 28, 2020 •

edited

Loading

Kixiron commented May 28, 2020

Nemo157 commented May 29, 2020

jyn514 commented May 29, 2020

Nemo157 commented May 29, 2020

Compress documentation uploaded to S3 #379

Compress documentation uploaded to S3 #379

Comments

pietroalbini commented Jul 18, 2019 • edited Loading

Mark-Simulacrum commented Jul 18, 2019

GuillaumeGomez commented Jul 18, 2019

Mark-Simulacrum commented Jul 18, 2019

pietroalbini commented Jul 18, 2019

Mark-Simulacrum commented Jul 18, 2019

pietroalbini commented Jul 18, 2019

pietroalbini commented Jul 18, 2019

Mark-Simulacrum commented Jul 18, 2019

pietroalbini commented Jul 18, 2019

Mark-Simulacrum commented Jul 18, 2019

pietroalbini commented Jul 18, 2019

najamelan commented Sep 18, 2019 • edited Loading

namibj commented Apr 2, 2020

namibj commented Apr 13, 2020

jyn514 commented May 28, 2020

Nemo157 commented May 28, 2020 • edited Loading

Kixiron commented May 28, 2020

Nemo157 commented May 29, 2020

jyn514 commented May 29, 2020

Nemo157 commented May 29, 2020

pietroalbini commented Jul 18, 2019 •

edited

Loading

najamelan commented Sep 18, 2019 •

edited

Loading

Nemo157 commented May 28, 2020 •

edited

Loading