Writing an archive can be inefficient because of seeks #367

QrczakMK · 2023-03-14T11:01:08Z

Description
A new entries in an archive is written by:

writing a preliminary directory entry
writing file contents
seeking backward before the directory entry
writing the real directory entry
seeking forward to skip the file

This can be quite inefficient for an archive consisting of many small files if the archive’s zip_source_t flushes its buffers for seeking and has high latency of flushing the buffer, e.g. when working with a remote file abstraction.

We should be able to do better if ZIP_SOURCE_STAT provides enough information to write the correct directory entry immediately.

Even if libzip does not trust the crc provided by the entry’s zip_source_t, it can write the preliminary entry with the provided crc, and later check if the directory entry needs to be updated or not.

Solution
Avoid write seeks when possible (seeking to the current position is OK), i.e. if ZIP_SOURCE_STAT provides enough information to write the correct directory entry before writing file contents.

The text was updated successfully, but these errors were encountered:

QrczakMK · 2023-03-16T09:29:52Z

Update:

It is sometimes possible for seeks near the current position to be performed within the buffer, and not to cause flushing of the buffer. I made such change in google/riegeli@1d03ff7.

But this does not help when the entry crosses a buffer boundary. Also, POSIX fseek() is specified to flush buffered data to the file, so this does not help when ZIP_SOURCE_SEEK_WRITE maps directly to fseek().

dillof · 2023-03-29T08:32:47Z

Comparing the dirent we wrote before writing the file data and only seeking and re-writing it if it changed sounds like a good optimisation. We'll implement it when we get to it, or you could give it a try and submit a pull request.

Seeking within the file buffer is beyond what we are willing to do.

QrczakMK · 2023-03-29T10:58:27Z

I do not expect libzip to seek within the buffer, since buffering is applied by the implementation of the source, i.e. in general by the client of libzip. For zip_source_file_create() buffering is delegated to stdio and this is infeasible: fseek() is documented to always flush the buffer as a side effect.

I mentioned this as a partial mitigation which can be performed by the source. This mitigation is useful while libzip does seeks, and also for cases where the seek cannot be eliminated because the initial dirent does not contain all information (e.g. crc). This mitigation is partial because it is infeasible if the file crossed buffer boundaries.

The optimization should tentatively trust the client-supplied CRC for the initial dirent. Even if you prefer to recompute CRC for uncompressed sources (per #359), the dirent can be rewritten if the CRC turns out not to match.

dillof · 2023-03-29T14:37:23Z

The optimization should tentatively trust the client-supplied CRC for the initial dirent. Even if you prefer to recompute CRC for uncompressed sources (per #359), the dirent can be rewritten if the CRC turns out not to match.

Yes, that's the plan.

kleisauke · 2023-05-04T11:11:08Z

Alternatively, one can write a 'streamed' ZIP, then seeking wouldn't be necessary. See #378.

QrczakMK added the enhancement Request a new feature. label Mar 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Writing an archive can be inefficient because of seeks #367

Writing an archive can be inefficient because of seeks #367

QrczakMK commented Mar 14, 2023

QrczakMK commented Mar 16, 2023

dillof commented Mar 29, 2023

QrczakMK commented Mar 29, 2023 •

edited

Loading

dillof commented Mar 29, 2023

kleisauke commented May 4, 2023

Writing an archive can be inefficient because of seeks #367

Writing an archive can be inefficient because of seeks #367

Comments

QrczakMK commented Mar 14, 2023

QrczakMK commented Mar 16, 2023

dillof commented Mar 29, 2023

QrczakMK commented Mar 29, 2023 • edited Loading

dillof commented Mar 29, 2023

kleisauke commented May 4, 2023

QrczakMK commented Mar 29, 2023 •

edited

Loading