Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writing an archive can be inefficient because of seeks #367

Open
QrczakMK opened this issue Mar 14, 2023 · 5 comments
Open

Writing an archive can be inefficient because of seeks #367

QrczakMK opened this issue Mar 14, 2023 · 5 comments
Labels
enhancement Request a new feature.

Comments

@QrczakMK
Copy link

Description
A new entries in an archive is written by:

  • writing a preliminary directory entry
  • writing file contents
  • seeking backward before the directory entry
  • writing the real directory entry
  • seeking forward to skip the file

This can be quite inefficient for an archive consisting of many small files if the archive’s zip_source_t flushes its buffers for seeking and has high latency of flushing the buffer, e.g. when working with a remote file abstraction.

We should be able to do better if ZIP_SOURCE_STAT provides enough information to write the correct directory entry immediately.

Even if libzip does not trust the crc provided by the entry’s zip_source_t, it can write the preliminary entry with the provided crc, and later check if the directory entry needs to be updated or not.

Solution
Avoid write seeks when possible (seeking to the current position is OK), i.e. if ZIP_SOURCE_STAT provides enough information to write the correct directory entry before writing file contents.

@QrczakMK QrczakMK added the enhancement Request a new feature. label Mar 14, 2023
@QrczakMK
Copy link
Author

Update:

It is sometimes possible for seeks near the current position to be performed within the buffer, and not to cause flushing of the buffer. I made such change in google/riegeli@1d03ff7.

But this does not help when the entry crosses a buffer boundary. Also, POSIX fseek() is specified to flush buffered data to the file, so this does not help when ZIP_SOURCE_SEEK_WRITE maps directly to fseek().

@dillof
Copy link
Member

dillof commented Mar 29, 2023

Comparing the dirent we wrote before writing the file data and only seeking and re-writing it if it changed sounds like a good optimisation. We'll implement it when we get to it, or you could give it a try and submit a pull request.

Seeking within the file buffer is beyond what we are willing to do.

@QrczakMK
Copy link
Author

QrczakMK commented Mar 29, 2023

I do not expect libzip to seek within the buffer, since buffering is applied by the implementation of the source, i.e. in general by the client of libzip. For zip_source_file_create() buffering is delegated to stdio and this is infeasible: fseek() is documented to always flush the buffer as a side effect.

I mentioned this as a partial mitigation which can be performed by the source. This mitigation is useful while libzip does seeks, and also for cases where the seek cannot be eliminated because the initial dirent does not contain all information (e.g. crc). This mitigation is partial because it is infeasible if the file crossed buffer boundaries.

The optimization should tentatively trust the client-supplied CRC for the initial dirent. Even if you prefer to recompute CRC for uncompressed sources (per #359), the dirent can be rewritten if the CRC turns out not to match.

@dillof
Copy link
Member

dillof commented Mar 29, 2023

The optimization should tentatively trust the client-supplied CRC for the initial dirent. Even if you prefer to recompute CRC for uncompressed sources (per #359), the dirent can be rewritten if the CRC turns out not to match.

Yes, that's the plan.

@kleisauke
Copy link

Alternatively, one can write a 'streamed' ZIP, then seeking wouldn't be necessary. See #378.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Request a new feature.
Projects
None yet
Development

No branches or pull requests

3 participants