Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a Rust implementation of the Perl preprocessing logic. #279

Merged
merged 30 commits into from
Oct 23, 2023

Conversation

jeremyroman
Copy link
Contributor

This implementation uses a full HTML parser (Servo's, html5ever) and operates on a DOM. It ends up being somewhat more verbose than Perl, but hopefully also more extensible/maintainable.

@jeremyroman
Copy link
Contributor Author

Currently, this only builds with Rust if PROCESS_WITH_RUST=1 exists in the environment. It also uses cargo run to invoke the binary, rather than assuming it always exists and is up to date (as we could perhaps do on the server).

Still, it works. It's verbose because it's hard to beat PCRE operating on text for conciseness when operating on an actual DOM tree.

@domenic @domfarolino

@domenic
Copy link
Member

domenic commented Aug 2, 2023

This is great!! I didn't realize there was so much preprocessing in those scripts...

I'm honestly not going to be able to provide great Rust review. Maye @domfarolino can help with that. But really, even not-well-reviewed Rust with unit tests is better than untested Perl (or Pascal). So we should bias toward merging this.

It seems like, given how you've set this up, there are no blockers to merging this. (Although maybe we should update the README to document things a bit?) Things to do before we ship this:

  • Update the build.whatwg.org server so that people who don't have Rust installed, have an alternate path. Given how the script you've created here takes as input the whole source dir (because it needs the demos and other "boilerplate"), probably the right path is to just allow the entire html-build script to be run server-side, given a compressed zip of the source dir.

  • Optional: stop having Wattsi use nonstandard void <ref> tags, so we can rip out some of the extra code you had to add here.

I hope to tackle these things myself.

domenic added a commit to whatwg/wattsi that referenced this pull request Aug 2, 2023
Instead of a nonstandard void element <ref spec=FOO>, use the text content: <ref>FOO</ref>.

This allows us to use standard HTML parser/serializer tooling with the HTML source, e.g. for whatwg/html-build#279.
Copy link
Member

@domfarolino domfarolino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've only made it through a little of this PR so far but it's looking really great, especially for a Rust noob like me :)

}
NodeData::Comment { ref contents } if contents.starts_with("REPRESENTS ") => {
self.placeholders
.push((node.clone(), contents.subtendril(11, contents.len32() - 11)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw this line means that <!-- REPRESENTS body--> would not work, right? (Because of the extra space before "REPRESENTS"). Is this how the current Perl behaves too? I suppose it doesn't matter because all usages of this in the spec are uniform, but I'm just double checking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current regex is <!--REPRESENTS ([^>]+)-->, which also does not accept a space there. In theory we could try to define a more consistent syntax if we wanted to.

build.sh Show resolved Hide resolved
domenic added a commit to whatwg/wattsi that referenced this pull request Aug 6, 2023
Instead of a nonstandard void element <ref spec=FOO>, use the text content: <ref>FOO</ref>.

This allows us to use standard HTML parser/serializer tooling with the HTML source, e.g. for whatwg/html-build#279.
jeremyroman and others added 3 commits August 6, 2023 19:44
This implementation uses a full HTML parser (Servo's, html5ever) and
operates on a DOM. It ends up being somewhat more verbose than Perl, but
hopefully also more extensible/maintainable.
@domenic
Copy link
Member

domenic commented Aug 6, 2023

I pushed a commit that removed the the custom <ref> stuff. Please take a look, and especially if there's more we could remove?

Also, I'm wondering how necessary parser.rs is. Could we just use read from stdio using non-async IO? Or could we use an existing package like https://crates.io/crates/html5ever-stream/0.1.0 ?

domenic and others added 2 commits August 7, 2023 20:42
It was only different from html5ever::driver::Parser in that it used the
filtered tokenizer. With that gone, the ordinary Parser struct works.
@jeremyroman
Copy link
Contributor Author

There was indeed more that can be removed; I've done so. All that's left that could be replaced by such a crate is basically the 20-line read loop (which is basically the contents of Parser::read_from).

It could also be replaced by just doing this synchronously, which would be fine in practice. I generally like the idea of being async because it's a little more annoying to retrofit async than to do it up-front, and it's not a lot of code to do here. In principle I like the idea of the build server being able to one day do this processing in a single processing without needing to spawn a separate thread per request, too, even though in practice I realize the build server is not that busy.

Copy link
Member

@domfarolino domfarolino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made it through a bit more of this PR, I'll try to review the rest of it over the next couple of days.

src/main.rs Outdated

// Because parsing can jump around the tree a little, it's most reasonable
// to just parse the whole document before doing any processing. Even for
// the HTML5 specification, this doesn't take too long.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, maybe just "HTML Standard" instead of "HTML5 specification"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

src/parser.rs Show resolved Hide resolved
tendril_sink.process(tendril);
break;
}
Err(ref e) if e.kind() == io::ErrorKind::Interrupted => {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I correct in thinking that this loop may just repeat N times hitting this "interrupted" condition, and we just keep it going until it falls into either the Ok(n) or Ok(0)? Do you know why this is the case? It's not really obvious to me, but I might just be dumb.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a common pattern with I/O APIs on POSIX platforms. When a sync syscall is interrupted by a signal handler before any data is read from certain file descriptors, it may return EINTR, which signifies that the operation was interrupted and it's left to userspace to reattempt. See, e.g., https://man7.org/linux/man-pages/man2/read.2.html#:~:text=EINTR%20%20The%20call%20was%20interrupted%20by%20a%20signal%20before%20any%20data%20was%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20read%3B%20see%20signal(7).

This is presumably just a reflection of that API design choice into the Rust std::io module.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh yeah, no I've definitely seen this before in Chromium as well: https://source.chromium.org/chromium/chromium/src/+/main:base/posix/eintr_wrapper.h;l=28;drc=e4622aaeccea84652488d1822c28c78b7115684f. I should've caught that :)

}

impl NodeHandleExt for Handle {
fn parent_node(&self) -> Option<Handle> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me understand the need for this helper? On the surface it seems to just be a more complicated version of the .parent getter, but I'm sure I must be missing something? Is there a benefit to setting the parent to a weak version of itself, and returning the strong ref here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.parent is a Cell<Option<Weak<Node>>>: https://docs.rs/markup5ever_rcdom/latest/markup5ever_rcdom/struct.Node.html#structfield.parent

See this for a more detailed discussion of interior mutability with cells: https://doc.rust-lang.org/std/cell/index.html#cellt

Since Weak<> isn't Copy, neither is Option<Weak<>>. Thus the only way to get the value of the cell is to move it out of the cell (here, done with take).

Once we have that weak handle, we can obtain the strong handle that callers generally want to use (because callers here are assuming that the parent does, in fact, exist). Then, since we took the weak handle out, we need to put it back to avoid changing the state of the node.

So, uh, yes, it's a more complicated version of a getter to satisfy the static and dynamic safety checks that apply to interior mutability cells.

jeremyroman and others added 10 commits September 13, 2023 02:41
This implementation uses a full HTML parser (Servo's, html5ever) and
operates on a DOM. It ends up being somewhat more verbose than Perl, but
hopefully also more extensible/maintainable.
It was only different from html5ever::driver::Parser in that it used the
filtered tokenizer. With that gone, the ordinary Parser struct works.
tokio = { version = "1", features = ["full"] }
html5ever = "*"
markup5ever_rcdom = "*"
tempfile = "3"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like tempfile should be a dev-dependency, technically. (Although it doesn't really matter since this isn't a library.)

Cargo.toml Outdated
[dependencies]
tokio = { version = "1", features = ["full"] }
html5ever = "*"
markup5ever_rcdom = "*"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably pin these to their current versions? We can hopefully rely on dependabot to update them over time.

let string = tokio::fs::read_to_string(path).await?;
Ok(StrTendril::from(string).into_send())
})
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these two functions move to a utils.rs, maybe?

Here and elsewhere, I'm not really sure; I can understand keeping things more encapsulated/local, but it would also be a shame if someone in the future realized they needed something like this for another processor and didn't know about the helpers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to an io_utils.rs.


// Find the paths we need.
let cache_dir = path_from_env("HTML_CACHE", ".cache");
let source_dir = path_from_env("HTML_SOURCE", "../html");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should not have defaults here, and only specify the defaults in build.sh. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Selfishly, this makes quick uses of cargo run a little easier.

// conflicts between them.
boilerplate.apply().await?;
represents.apply()?;
annotate_attributes.apply().await?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth trying to make concurrent the two async ones?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realistically they already do a fair amount of spawned work concurrently (for example, opening files in boilerplate.visit()), which we need to wait for to finish. I don't want the order of these changes to be nondeterministic, though.

src/represents.rs Show resolved Hide resolved
src/tag_omission.rs Outdated Show resolved Hide resolved
// demand.
NodeData::Comment { ref contents } if contents.starts_with("BOILERPLATE ") => {
let path = Path::new(contents[12..].trim());
if is_safe_path(path) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have this is_safe_path check? If we omitted it, what bad thing would happen? Could we assert it, instead of silently skipping the comment if it fails? (Here and below.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes it harder for someone to submit an HTML source file and use it to read back the contents of a file on the build server, which might contain some data we'd rather not have exfiltrated.

Made it an error.

Copy link
Member

@domfarolino domfarolino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the big delay here! Basically looks fine to my somewhat-untrained eyes, but have a few questions here.

// <dd><span data-x="attr-a-href">href</span></dd>
// <dd><span data-x="attr-a-someattribute">someattribute</span></dd>
// ...
fn is_content_attribute_dt(dt: &Handle) -> bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a separate function btw? It never seems to be called elsewhere, except right below this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically because it looked more awkward (to me) to write as a boolean expression, since early-returning would do something a little confusing, and writing it without early returns would result in some strange nesting.

};

// These will be strings like "attr-input-maxlength", which identify particular element-attribute pairs.
let data_x = QualName::new(None, ns!(), LocalName::from("data-x"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: might be good to document /*prefix=*/None here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it, though I haven't seen that as a strong convention elsewhere and the parameter genuinely isn't super interesting here.

.iter()
.filter_map(|c| c.get_attribute(&data_x).filter(|v| !v.is_empty()))
{
// Find the <!-- or: --> comment, if one exists, and extract its contents.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing any comments that match <!-- or: in the source file, yet there appear to be tests for this. Am I missing something in the preprocessing flow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is exactly one, but it doesn't have a space which is probably why you didn't fine it. It's on the name attribute as applied to various form elements, and reads:

<!--or: Name of the element to use in the <code data-x="dom-form-elements">form.elements</code> API. -->

src/interface_index.rs Show resolved Hide resolved
src/interface_index.rs Show resolved Hide resolved
Copy link
Member

@domenic domenic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, let's do this.

@domenic domenic merged commit b5866ef into whatwg:main Oct 23, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants