Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds zip support to the zlib module #45651

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

arcanis
Copy link
Contributor

@arcanis arcanis commented Nov 28, 2022

Ref #45434

This PR adds a new ZipArchive class to the zlib module, which can be used to read and write content from zip archives. Its current API looks like this:

const fs = require(`fs`);
const {ZipArchive} = require(`zlib`);

// Creates a new in-memory archive
const zip = new ZipArchive();
zip.addFile(`hello`, fs.readFileSync(__filename));
const data = zip.digest();

fs.writeFileSync(`./archive.zip`, data);

// The data obtained from `digest` can also be reopened
const zip2 = new ZipArchive(data);
console.log(zip2.getEntries());
console.log(zip2.getEntries({withFileTypes: true}));
const content = zip2.readEntry(0);

console.log(content);

Maintenance cost

I kept the feature scope limited enough to cover most of the use cases but without increasing the maintenance cost or build cost. A few things have been cut from what the libzip would allow:

  • Opening files directly from the filesystem isn't supported, because it would bypass node:fs. The current API only works with memory buffers (according to my tests it doesn't have any negative impact even when compared to the wasm API which went through file descriptors).

  • Encryption isn't supported, because it's unclear how it should integrate with node:crypto. There's room for follow-up, but it didn't seem a required feature for the first iteration.

Performances

Keep in mind that raw performances aren't the main reason why zip support is important to have as a native feature. The speedup is nice, the simplified garbage collection is very nice, but the real benefit is having a stable cross-platform way to bundle files between platforms. It will be useful for cache mechanisms, transfer algorithms, user CLI generation, and more.

Still, I made some reasonable checks to make sure that no use case regressed. Size of the binary before / after:

before 89604801 85.45MB
after  89780257 85.62MB (+171KB)

Performance-wise, using Yarn as benchmark, the results show native being ~2x faster than wasm (keep in mind the wasm implementation isn't the most popular zip library; projects using jszip will see significantly larger differences):

YARN_EXPERIMENT_NATIVE_ZIPFS=0 PKG=gatsby
➤ YN0000: └ Completed in 15s 806ms
YARN_EXPERIMENT_NATIVE_ZIPFS=1 PKG=gatsby
➤ YN0000: └ Completed in 7s 404ms

YARN_EXPERIMENT_NATIVE_ZIPFS=0 PKG=typescript
➤ YN0000: └ Completed in 13s 676ms
YARN_EXPERIMENT_NATIVE_ZIPFS=1 PKG=typescript
➤ YN0000: └ Completed in 5s 351ms

YARN_EXPERIMENT_NATIVE_ZIPFS=0 PKG=next
➤ YN0000: └ Completed in 5s 923ms
YARN_EXPERIMENT_NATIVE_ZIPFS=1 PKG=next
➤ YN0000: └ Completed in 3s 512ms

To Do

  • Improve the documentation
  • Add more regression tests
  • Benchmark against the WASM libzip
  • API Bikeshedding

@nodejs-github-bot nodejs-github-bot added build Issues and PRs related to build files or the CI. dependencies Pull requests that update a dependency file. meta Issues and PRs related to the general management of the project. needs-ci PRs that need a full CI run. labels Nov 28, 2022
@arcanis arcanis mentioned this pull request Nov 28, 2022
Copy link
Contributor

@mscdex mscdex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 I know this is an unpopular opinion, but I'm not convinced this should live in node core. Sure it's a common file format, but then again so are tar, rar, 7z, zstd, xz, and others, which also don't belong in node core.

I feel like adding such modules to node core is further leading to feature creep. I get that other platforms like PHP and such may have zip modules, but they also include a ton of other modules that make them "kitchen sink" platforms, which I would hate to see node.js become.

@GeoffreyBooth
Copy link
Member

@mscdex Could the discussion of whether we should do this stay in #45434? And this PR discussion can focus on the implementation.

Copy link
Member

@addaleax addaleax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd appreciate some early review to let me know places where the code isn't compatible with Node.js' standards.

Might be jumping the gun here since yeah, there should be a discussion about whether this should happen at all, but, sure, gave it a first look. I do concur with @mscdex's concerns, fwiw.

src/node_zip.cc Outdated Show resolved Hide resolved
src/node_zip.cc Outdated Show resolved Hide resolved
src/node_zip.cc Outdated Show resolved Hide resolved
src/node_zip.cc Outdated
}

void ZipArchive::MemoryInfo(MemoryTracker* tracker) const {
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(might want to fill this out)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How precise does it have to be? I updated the code to track the size of the input buffer + the buffer of any file that gets added later, but it doesn't include the small-ish libzip overhead, and gets confused if the same file is modified multiple times. If the value is indicative it might be fine?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn’t have to be super precise, but it should be usable for debugging. You’ll probably want to track buf_ here via tracker->TrackField("buf", buf_);. If you can’t track or estimate memory owned by libzip (including memory for added entries?) then it’s probably fine to omit it, rather than to give numbers that are potentially very inaccurate (e.g. after repeated AddEntry() + DeleteEntry() calls).

src/node_zip.cc Outdated Show resolved Hide resolved
src/node_zip.cc Outdated Show resolved Hide resolved
src/node_zip.cc Outdated Show resolved Hide resolved
src/node_zip.cc Outdated Show resolved Hide resolved
src/node_zip.cc Outdated
}

zip_int64_t file_index = zip_file_add(zip->zip_, *path, file_source, ZIP_FL_OVERWRITE | ZIP_FL_ENC_UTF_8);
CHECK_GE(file_index, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if this call fails? Likely also applies elsewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now the code is overly strict and if any function fails, the program aborts on the CHECK_GE call. I have to replace most these calls by something that would just throw instead.

src/node_zip.cc Outdated Show resolved Hide resolved
lib/zip.js Outdated Show resolved Hide resolved
lib/zip.js Outdated Show resolved Hide resolved
lib/zip.js Outdated Show resolved Hide resolved
lib/zip.js Outdated Show resolved Hide resolved
lib/zip.js Outdated Show resolved Hide resolved
lib/zip.js Outdated Show resolved Hide resolved
@anonrig
Copy link
Member

anonrig commented Nov 28, 2022

Despite the comments from other reviewers, my main concern is about the usage of the buffer module. I strongly believe that the public API should consume necessary native buffers (TypedArray) instead of Node.js buffers.

@arcanis
Copy link
Contributor Author

arcanis commented Nov 28, 2022

I used Buffer since that's what the other main Node APIs tended to use (fs, crypto, zlib) - wouldn't it be surprising for users to return a regular typed array in just this API?

@anonrig
Copy link
Member

anonrig commented Nov 28, 2022

I used Buffer since that's what the other main Node APIs tended to use (fs, crypto, zlib) - wouldn't it be surprising for users to return a regular typed array in just this API?

There is an initiative to use native buffers in the new public APIs (referencing my personal talks with @addaleax and @jasnell), a @nodejs/tsc member can clarify if this is still a thing.

@addaleax
Copy link
Member

@anonrig I don’t know if that’s the best way forward here, but since this feels like a very broad conversation (“Should new Node.js APIs return Uint8Array or should they return Buffer?”), maybe it’s also best to handle that separately from this specific PR?

@jasnell
Copy link
Member

jasnell commented Nov 29, 2022

In this case, I think Buffer is fine given that it is consistent with the rest of the zlib module. I can see us eventually making a call on avoiding Buffer in the future (or standardizing on it) but this is not the place to decide that

@jasnell
Copy link
Member

jasnell commented Nov 29, 2022

Is a new top level module what we want here? As opposed to adding this to zlib? I know it's not based on zlib but neither is brotli.

@arcanis
Copy link
Contributor Author

arcanis commented Nov 29, 2022

I think @GeoffreyBooth had the same feedback. I don't have a strong opinion there, perhaps zlib would indeed better match user expectations.

@GeoffreyBooth
Copy link
Member

I would put it in zlib. In the future we could consider a friendlier name as an alias for zlib, like node:compression or something, but that's for later.

@tniessen
Copy link
Member

tniessen commented Dec 5, 2022

There is an initiative to use native buffers in the new public APIs

Just leaving this reference here: #41588

doc/api/zip.md Outdated Show resolved Hide resolved
@tniessen
Copy link
Member

tniessen commented Dec 5, 2022

In the future we could consider a friendlier name as an alias for zlib, like node:compression or something, but that's for later.

That might make some sense for zip, which inherently supports compression, but if we do add other archive formats that don't, it won't fit. Also, except for zip, compression and archive formats are orthogonal even if related topics.

@arcanis arcanis changed the title Adds prototype zip module Adds prototype archive module Dec 18, 2022
@arcanis arcanis changed the title Adds prototype archive module Adds zip support to the zlib module Jan 6, 2023
@arcanis arcanis marked this pull request as ready for review January 6, 2023 21:30
Comment on lines +257 to +268
## Compressing multiple files together

<!-- YAML
added: REPLACEME
-->

The `zlib` library provides ways to compress individual objects, but not to
aggregate multiple ones into a single file suitable for redistribution (what
is often called archival).

To this end, `node:zip` provides the `ZipArchive` class which allows to create,
read, and modify zip archives:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this section needs updating? Because of references to node:zip etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, it's a typo, the archive is now part of node:zlib (would it make sense to have a check in lint-md that all node:something identifiers must be valid?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Issues and PRs related to build files or the CI. dependencies Pull requests that update a dependency file. meta Issues and PRs related to the general management of the project. needs-ci PRs that need a full CI run.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants