Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API design #6

Open
6 of 21 tasks
Piezoid opened this issue May 6, 2019 · 1 comment
Open
6 of 21 tasks

API design #6

Piezoid opened this issue May 6, 2019 · 1 comment
Labels
Milestone

Comments

@Piezoid
Copy link
Owner

Piezoid commented May 6, 2019

Most gzip libraries provide an easy to use API of the form:

decompress(void* state, const char* input, size_t in_len, char* output, size_t out_len)

It decompresses the input stream to the output buffer until it run either out of input data or out of output buffer space. The user fills/empties the buffers and make another call, until end of file is reached.

In a multi threaded implementation this synchronous interface in no longer possible. The code consuming the decompressed stream must be called back from each decompressor thread when a buffer is ready, not in any particular order.

A gzip file is sliced in sections, processed sequentially, which are themselves splited into chunks, processed in parallel, one for each thread. The first chunk is decompressed normally in CPU cache, and yields ~32KB decompressed buffers to the user callback. The other chunks are decompressed into larger buffers (~100MB) and require a second pass of "back references translation". To do this in a cache friendly manner, we propose to translate the buffer by segments fitting in the L1 cache and invoke the callback after each cache fill.

There is other designs matters, like the possibility to run decompression from your own custom thread pool, C FFI compatibility, etc. See the full discussion bellow.

Constraints:

  1. Doing the translation the back-references on the whole buffer before yielding it to the client is costly since it trigger unnecessary TLB + LL cache misses on large memory region. We need a cache efficient translation/consumption step.

  2. The user callbacks will be called at random time from arbitrary threads with decompressed content from arbitrary positions in the stream. The user should be able to reorder them, either synchronously (eg. blocking reordering four writing to stdout) or asynchronously (eg. parser with rests).

  3. Support different threading implementations. (, OpenMP, custom work stealing, etc)

  4. Static library with minimal headers. Smallish machine code blob. Support multiple language, wit h C as the common denominator.)

  5. Able to generate, load custom indexes.

Implementation:

  • The sequential access thread, yield data in ~32KiB directly from its context window.
  • For the random access (n-1) threads, the low level interface will yield untranslated buffers + lookup tables
  • Adaptors provide ways to translate the buffer by steps of 32KiB.
    • Something more flexible than fixed intervals for parsing, like requesting overlaps with previous buffer. We'll see how it turns out with FASTQ parsing.
  • Possibility to specify unconsumed buffer size: "I didn't consumed the last x bytes, please resend them at the beginning of the next buffer".
  • Full buffer translation for the simpler to use interface (slow).
  1. We need to provide:
  • a total order over content parts,
    • needs polishing: currently (#section, #chunk, [#window flush])
    • could add or replace that with positions in the decompressed stream
  • a reordering for outputting parts in the correct order (currently std::mutex + std::condition_variable),
  • pin user data to threads (or maybe thread_local is good enough for that ?)
  1. Support different threading mechanisms.
  • basic implementation launching n threads with std::thread.
  • API for allocating local decompressor data for each thread, bridge them together and start them from user's threads.
  • Move away from the header-only distribution. It is too long to parse and generates too much code if multiple compilations units include it (without LTO).
  • Hide implementation behind a Pimpl class + a virtual (or struct of callbacks)
  • C interface for FFI interoperability : wrapping function + struct of callbacks.
  • Provide easy to use cmake subproject + sample project.
  • API yielding (stream position, content position, context) is enough to generate an index. (overlaps with 2.)
  • API to decompress from a sync point with an externally provided context (may overlaps with 3.)
  • Generic index format and high level API to use it. As noted by Heng Li, the 32KB contexts can make a heavy index if we want a fine granularity. Large granularity is sufficient for faster parallel decompression, but not for efficient random accesses.
    • We could remove unused characters by re-indexing the back-references.
@Piezoid Piezoid added the design label May 6, 2019
@Piezoid Piezoid modified the milestones: 0.1, 1.0 May 6, 2019
@ZekunYin
Copy link

Hi,
Is it possible to provide a streaming API to decompress file chunk-by-chunk using a single thread just like gzread in zlib? And in my test, I find that pugz is much faster even using only a single thread.

Best,
Zekun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants