API design #6

Piezoid · 2019-05-06T12:21:27Z

Most gzip libraries provide an easy to use API of the form:

decompress(void* state, const char* input, size_t in_len, char* output, size_t out_len)

It decompresses the input stream to the output buffer until it run either out of input data or out of output buffer space. The user fills/empties the buffers and make another call, until end of file is reached.

In a multi threaded implementation this synchronous interface in no longer possible. The code consuming the decompressed stream must be called back from each decompressor thread when a buffer is ready, not in any particular order.

A gzip file is sliced in sections, processed sequentially, which are themselves splited into chunks, processed in parallel, one for each thread. The first chunk is decompressed normally in CPU cache, and yields ~32KB decompressed buffers to the user callback. The other chunks are decompressed into larger buffers (~100MB) and require a second pass of "back references translation". To do this in a cache friendly manner, we propose to translate the buffer by segments fitting in the L1 cache and invoke the callback after each cache fill.

There is other designs matters, like the possibility to run decompression from your own custom thread pool, C FFI compatibility, etc. See the full discussion bellow.

Constraints:

Doing the translation the back-references on the whole buffer before yielding it to the client is costly since it trigger unnecessary TLB + LL cache misses on large memory region. We need a cache efficient translation/consumption step.
The user callbacks will be called at random time from arbitrary threads with decompressed content from arbitrary positions in the stream. The user should be able to reorder them, either synchronously (eg. blocking reordering four writing to stdout) or asynchronously (eg. parser with rests).
Support different threading implementations. (, OpenMP, custom work stealing, etc)
Static library with minimal headers. Smallish machine code blob. Support multiple language, wit h C as the common denominator.)
Able to generate, load custom indexes.

Implementation:

The sequential access thread, yield data in ~32KiB directly from its context window.
For the random access (n-1) threads, the low level interface will yield untranslated buffers + lookup tables
Adaptors provide ways to translate the buffer by steps of 32KiB.
- Something more flexible than fixed intervals for parsing, like requesting overlaps with previous buffer. We'll see how it turns out with FASTQ parsing.
Possibility to specify unconsumed buffer size: "I didn't consumed the last x bytes, please resend them at the beginning of the next buffer".
Full buffer translation for the simpler to use interface (slow).

We need to provide:

a total order over content parts,
- needs polishing: currently (#section, #chunk, [#window flush])
- could add or replace that with positions in the decompressed stream
a reordering for outputting parts in the correct order (currently std::mutex + std::condition_variable),
pin user data to threads (or maybe thread_local is good enough for that ?)

Support different threading mechanisms.

basic implementation launching n threads with std::thread.
API for allocating local decompressor data for each thread, bridge them together and start them from user's threads.

Move away from the header-only distribution. It is too long to parse and generates too much code if multiple compilations units include it (without LTO).
Hide implementation behind a Pimpl class + a virtual (or struct of callbacks)
C interface for FFI interoperability : wrapping function + struct of callbacks.
Provide easy to use cmake subproject + sample project.

API yielding (stream position, content position, context) is enough to generate an index. (overlaps with 2.)
API to decompress from a sync point with an externally provided context (may overlaps with 3.)
Generic index format and high level API to use it. As noted by Heng Li, the 32KB contexts can make a heavy index if we want a fine granularity. Large granularity is sufficient for faster parallel decompression, but not for efficient random accesses.
- We could remove unused characters by re-indexing the back-references.

The text was updated successfully, but these errors were encountered:

ZekunYin · 2019-07-31T15:39:34Z

Hi,
Is it possible to provide a streaming API to decompress file chunk-by-chunk using a single thread just like gzread in zlib? And in my test, I find that pugz is much faster even using only a single thread.

Best,
Zekun

Piezoid added the design label May 6, 2019

Piezoid modified the milestones: 0.1, 1.0 May 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API design #6

API design #6

Piezoid commented May 6, 2019 •

edited

Loading

ZekunYin commented Jul 31, 2019

API design #6

API design #6

Comments

Piezoid commented May 6, 2019 • edited Loading

Constraints:

Implementation:

ZekunYin commented Jul 31, 2019

Piezoid commented May 6, 2019 •

edited

Loading