Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wanted: Reimplement WHATWG URL Parser in WASM #38708

Closed
jasnell opened this issue May 17, 2021 · 34 comments
Closed

Wanted: Reimplement WHATWG URL Parser in WASM #38708

jasnell opened this issue May 17, 2021 · 34 comments
Labels
good first issue Issues that are suitable for first-time contributors. wasm Issues and PRs related to WebAssembly. whatwg-url Issues and PRs related to the WHATWG URL implementation.

Comments

@jasnell
Copy link
Member

jasnell commented May 17, 2021

The current WHATWG URL parser implementation is written in C/C++ and incurs a fairly significant cost crossing the JS/C++ boundary. It should be possible to realize a significant performance improvement by porting the implementation to WASM (similar to how the https://github.com/nodejs/undici project has seen a massive performance boost out of moving llhttp parser to wasm).

If someone wanted to pick up this effort, I am available to help mentor through it.

It will require c/c++ experience. What I would imagine would be best is setting it up as a separate project that we would vendor in as a dependency. Done right it would have no breaking API changes.

@jasnell jasnell added good first issue Issues that are suitable for first-time contributors. wasm Issues and PRs related to WebAssembly. whatwg-url Issues and PRs related to the WHATWG URL implementation. labels May 17, 2021
@devsnek
Copy link
Member

devsnek commented May 17, 2021

Can we use rust?

@jasnell
Copy link
Member Author

jasnell commented May 17, 2021

As a vendored in dependency I personally wouldn't care. But... need to consider the impact on the prerequisites that are necessary to build node.js. If vendoring the dependency requires someone to set up a whole rust development environment to be able to build, then that may be problematic. If, alternatively, what we vendor in is just the wasm files and some glue code such that we only need the wasm compiler, then it really doesn't matter what the source language is.

@akshay1992kalbhor
Copy link

How much C/C++ experience would be needed? I have submitted small patches to Gecko before

@voxpelli
Copy link

The Servo WHATWG URL parser in Rust has CI set up for WASM. Compiling that and publishing the resulting WASM to eg. npm could maybe even be something they themselves would be interested in? https://github.com/servo/rust-url/blob/d673c4d5e22b3a8ac91b7f52faa45dc32a275f75/.github/workflows/main.yml

@legendecas
Copy link
Member

Since the practice already exists in undici, can we first try to work this on an npm module?

@jasnell
Copy link
Member Author

jasnell commented May 17, 2021

I'll be happy with whatever works and keeps us spec compliant. It would be good to ensure that we can still quickly respond to changes in the base spec, so we'll need to make sure that whatever the implementation is it stays maintained.

@arsalanhub
Copy link

@jasnell Has anyone taken the issue? If not then I would like to work on this.

@puzpuzpuz
Copy link
Member

t should be possible to realize a significant performance improvement by porting the implementation to WASM (similar to how the https://github.com/nodejs/undici project has seen a massive performance boost out of moving llhttp parser to wasm).

It tried to find any benchmark results for the mentioned undici migration to wasm and found only the following message: nodejs/undici#575

Probably that's the not right place to look into since I don't see a significant performance improvement in the above message. Most likely that's a wrong place to look at, so could someone tell me where to find more information on existing experiments?

@voxpelli
Copy link

So, simplified, in essence what it would take is to swap the internalBinding('url') in internal/url.js for a WASM-import of the same stuff:

node/lib/internal/url.js

Lines 79 to 102 in 910efc2

const {
domainToASCII: _domainToASCII,
domainToUnicode: _domainToUnicode,
encodeAuth,
toUSVString: _toUSVString,
parse,
setURLConstructor,
URL_FLAGS_CANNOT_BE_BASE,
URL_FLAGS_HAS_FRAGMENT,
URL_FLAGS_HAS_HOST,
URL_FLAGS_HAS_PASSWORD,
URL_FLAGS_HAS_PATH,
URL_FLAGS_HAS_QUERY,
URL_FLAGS_HAS_USERNAME,
URL_FLAGS_IS_DEFAULT_SCHEME_PORT,
URL_FLAGS_SPECIAL,
kFragment,
kHost,
kHostname,
kPathStart,
kPort,
kQuery,
kSchemeStart
} = internalBinding('url');

Similar to how the import of llhttp in undici works.

Considering that, I guess it can easier to compile the existing c++ code to WebAssembly than to set up some wasm-pack / wasm-bindgen flow that compiles JS-bindings for servo/rust-url using something like wasm-pack build --target=nodejs

I did however open servo/rust-url#712 to see if they would be interested in publishing such bindings themselves.

@Flarna
Copy link
Member

Flarna commented May 19, 2021

Hope my question is not too stupid. But may I ask how WASM is debugged? JS and C++ are quite easy to debug but I never debugged WASM till now.

@devsnek
Copy link
Member

devsnek commented May 19, 2021

In this context, you can use chrome devtools. it supports dwarf sections and whatnot so it should "just work"

@jasnell
Copy link
Member Author

jasnell commented May 19, 2021

@voxpelli ... yes, that's essentially it. The one complicating bit there are the domainTo... bits that rely on functionality in ICU, which is not exposed to WASM yet.

@WenheLI
Copy link

WenheLI commented Jun 10, 2021

@jasnell - This sounds interesting, just wondering if there is any starting point for this issue?

@voxpelli
Copy link

Maybe https://www.assemblyscript.org/ could be one way to achieve this?

@anonrig
Copy link
Member

anonrig commented Feb 12, 2022

Hi!

Upon talking with @jasnell, I wanted to give a heads up that I'm currently working on this issue. I've started writing WHATWG URL parser in Rust (as a graduation project for my masters) with a goal of replacing the existing implementation if the performance and the implementation justifies it. It's quite early to estimate the performance impact but if you want to help or just take a look at my progress feel free to visit github.com/anonrig/url.

I'm in the process of URL encoding/decoding the inputs and have a working copy of URLSearchParams written in Rust. My current solution is written in Rust, and later compiled into Wasm using wasm-bindgen and serde-wasm-bindgen. This allows us to generate/create a typescript definition included Node module without writing any JavaScript and simplifies the build process.

If you're new to Rust but want to contribute to my process of solving this issue, you can reference the Github Workflow I've created which creates, builds, tests and later deploys the Rust package to npm under url-wasm (which will be available asap)

My primarily goal is to have a spec compliant URL and URLSearchParams implementation and later focus on the performance impact/benchmarking it. (I'm also an active member in the Node.js Slack channel under my name and surname, if anybody wants to say hello and talk about URL)

@nebarf
Copy link

nebarf commented Feb 18, 2022

@anonrig What's your roadmap? What are next steps in terms of development?
It would be great having those steps categorized under github issues just to be able to pick up some of them.

@bnoordhuis
Copy link
Member

One downside that hasn't been mentioned yet is lack of code sharing. The C++ URL parser gets shared between processes (think shared library) but with WASM, if you have a cluster of let's say 32 workers, you have 32 WASM blobs taking up memory.

I don't think V8 supports precompiling WASM, like it does for regular JS (the startup snapshot), but even if it did, bigger snapshots means slower startup times. Snapshots must be deserialized before they can be used.

The overhead might be negligible for something like a URL parser, though.

@anonrig
Copy link
Member

anonrig commented Feb 20, 2022

Thanks for your patience @nebarf. Upon request I've created an issue on URL-WASM repository and added the missing features. If you have any questions or want to contribute, let's discuss it on the repository issues. anonrig/url#1

@anonrig
Copy link
Member

anonrig commented Feb 20, 2022

Upon my initial benchmarking I've realized that WebAssembly is primarly written to support long-running CPU intensive tasks, which does minimal data sharing between JS and WASM. Due to the nature of URL parser, and being bound to TextEncoder and TextDecoder serialization requirement of WebAssembly, there's a 20% reduced performance. My benchmark was primarily focused on native URL parser and the Rust one I've working on. I'd be happy to add the benchmarks to the repository when I've more information about the performance impact of wasm-bindgen.

@devsnek
Copy link
Member

devsnek commented Feb 20, 2022

This could be an interesting opportunity to experiment with v8's work on wasm interface types & fast calls, which should help with the performance.

@anonrig
Copy link
Member

anonrig commented Mar 1, 2022

Upon implementing a PoC using Rust and WebAssembly, and talking back and forth with @jasnell, I found out that WebAssembly is not suitable for this particular task. I'll start implementing the URL state machine using JavaScript under https://github.com/anonrig/url-js.

Here's my experience and learnings using WASM and Rust: https://www.yagiz.co/implementing-node-js-url-parser-in-webassembly-with-rust/

@RaisinTen
Copy link
Contributor

I'll start implementing the URL state machine using JavaScript under https://github.com/anonrig/url-js.

How would it be different from https://github.com/jsdom/whatwg-url?

@anonrig
Copy link
Member

anonrig commented Mar 2, 2022

How would it be different from https://github.com/jsdom/whatwg-url?

Since URLSearchParams is written in JavaScript, there's no need for me to focus on reimplementing it in JavaScript. The only bottleneck and the point of this task is to create a URL state machine in JS (which is also implemented in whatwg-url). Personally speaking, I did not like the implementation in whatwg-url. It is not well documented, it's not well tested, and even for a guy who read the URL specification couple of times, it's really hard to read/understand the url-state machine implementation. That's why I don't want to re-use the code from the library. But that's again, is just one opinion.

@XadillaX
Copy link
Contributor

XadillaX commented Apr 1, 2022

I've extacted the C++ logic for URL. Maybe I can try to compile it to wasm.

https://github.com/XadillaX/libwhatwgurl

@anonrig
Copy link
Member

anonrig commented May 7, 2022

Status Update:

I've written and open-sourced url-state-machine both available on NPM and Github. The package is written in JavaScript due to the TextEncoding/TextDecoding bottleneck in WebAssembly (For more information read this article).

Currently, url-state-machine is faster than other JavaScript based implementations, but still not as fast as C++. I came to a state where I've squeezed as much as performance from JavaScript with the help of @jasnell, but the outcome is not as good as it should be (especially for long strings).

I'd like to ask the community for some feedback especially on the url-state-machine in case I've missed something important. If not, I believe that this task is destined to be open for as long as v8 provides more performance boosts to the JavaScript user-land.

@devsnek
Copy link
Member

devsnek commented May 7, 2022

interface types should hopefully help with this case, allowing the wasm to tell v8 it wants to deal with strings, instead of you having to manually encode and copy them into the wasm's memory.

@XadillaX
Copy link
Contributor

XadillaX commented May 16, 2022

I've written the C++ version of WHATWG URL and passed all the WPT of 3 May, 2022 (Living Standard — Last Updated 3 May 2022) and already used in our own Web-interoperable Runtime. And I initially commit the C++ code in https://github.com/xadillax/libwhatwgurl. The lib and binding directories are the code of Node.js binding demos.

I think we can continue working on this.

@RaisinTen
Copy link
Contributor

@XadillaX did you compile it to wasm? How performant was it?

@XadillaX
Copy link
Contributor

@XadillaX did you compile it to wasm? How performant was it?

Not yet. I'm currently using it via v8's binding. If necessary, I will try to.

@RaisinTen
Copy link
Contributor

Is it worth moving the current implementation into a separate project outside the core repo if it's also written in C++ and not compiled to wasm? I'm not sure what the advantage would be. A disadvantage I can think of is that it would probably delay the process of landing changes made to the parser in the core repo.

@XadillaX
Copy link
Contributor

Is it worth moving the current implementation into a separate project outside the core repo if it's also written in C++ and not compiled to wasm? I'm not sure what the advantage would be. A disadvantage I can think of is that it would probably delay the process of landing changes made to the parser in the core repo.

Behind this disadvantage (maybe), I think this can:

  1. The core repo may be decoupled;
  2. The libwhatwgurl repo may be maintained / contributed by not only Node.js ecosystem, it also can be maintained by other projects who need this feature.
  3. The libwhatwgurl can focus on catching the new spec and passing the latest WPT as soon as possible, and also can focus on perfomance, etc.

@RaisinTen
Copy link
Contributor

RaisinTen commented May 17, 2022

The core repo may be decoupled

Is maintaining the parser in the core repo problematic?

The libwhatwgurl repo may be maintained / contributed by not only Node.js ecosystem, it also can be maintained by other projects who need this feature.

Why would anyone use this outside of Node.js? Are we going to treat security vulnerabilities from the parser the same way as the ones coming from projects like OpenSSL?

The libwhatwgurl can focus on catching the new spec and passing the latest WPT as soon as possible, and also can focus on perfomance, etc.

That can be done in the core repo too, right?

@voxpelli
Copy link

@XadillaX did you compile it to wasm? How performant was it?

Not yet. I'm currently using it via v8's binding. If necessary, I will try to.

This issue is about investigations whether a WASM-version of the code could become faster than the current C++ version, so for the sake of this issue: It's needed 🙂

Is it worth moving the current implementation into a separate project outside the core repo if it's also written in C++ and not compiled to wasm?

I think it would have to be a separate issue from this one anyhow and one that would make little sense to go ahead with until this one has reached a conclusion.

@jasnell
Copy link
Member Author

jasnell commented Jul 14, 2022

Given the excellent investigation that @anonrig did on this, it's clear that a WASM implementation is not going to bring any performance improvements whatsoever. We can close this issue and pursue other approaches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Issues that are suitable for first-time contributors. wasm Issues and PRs related to WebAssembly. whatwg-url Issues and PRs related to the WHATWG URL implementation.
Projects
None yet
Development

No branches or pull requests