-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parser: Remove attrs from the block matcher regex #11522
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be a general improvement on parsing speed but not significantly so.
I'm not going to stand in the way, but I'm hesitant of how we're breaking out of the RegExp tokenizer and going back in and feel like it introduces more complexity than the benefit it brings is worth. For example, when we realize we need to adjust the way that the block comments are parsed or stored this seems like it will introduce two steps to the maintenance where currently one exists (editing a RegExp and editing a second-tier sub-parser).
I was hoping this would have a more noticeable speedup but I think the reason it's not more pronounced is because of how performant the PCRE engine is behind the scenes. I'll try and benchmark this on PHP 5.6 or older since I ran the tests on 7.x - that may make a difference yet.
What are your thoughts?
Looks like this needs a decision. I personally don't know the details here to give an opinion. I'm moving out of 4.3 for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very concerned about the trade-offs we're making here. I think we're ending up with a more fragile solution (see inline comment on how to break parsing), opening the door for more ad hoc editing of this parser, in exchange for performance improvements that don't seem to make up for the costs.
|
||
// We know where the attrs start, now search until the close of the block comment. | ||
$attrs_start = $started_at + $length; | ||
$closing_position = strpos( $this->document, '-->', $attrs_start ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will break if anything in the JSON matches it, e.g.
<!-- wp:foo {"id":824,"foo":"look at --> this"} -->
<figure class="wp-block-image"><img class="wp-image-824" src="foobar.png" alt="Foo" /></figure>
<!-- /wp:foo -->
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will break if anything in the JSON matches it, e.g.
<!-- wp:foo {"id":824,"foo":"look at --> this"} --> <figure class="wp-block-image"><img class="wp-image-824" src="foobar.png" alt="Foo" /></figure> <!-- /wp:foo -->
For what it's worth, the editor serializer already "handles" this:
gutenberg/packages/blocks/src/api/serializer.js
Lines 172 to 195 in 660e46e
/** | |
* Given an attributes object, returns a string in the serialized attributes | |
* format prepared for post content. | |
* | |
* @param {Object} attributes Attributes object. | |
* | |
* @return {string} Serialized attributes. | |
*/ | |
export function serializeAttributes( attributes ) { | |
return JSON.stringify( attributes ) | |
// Don't break HTML comments. | |
.replace( /--/g, '\\u002d\\u002d' ) | |
// Don't break non-standard-compliant tools. | |
.replace( /</g, '\\u003c' ) | |
.replace( />/g, '\\u003e' ) | |
.replace( /&/g, '\\u0026' ) | |
// Bypass server stripslashes behavior which would unescape stringify's | |
// escaping of quotation mark. | |
// | |
// See: https://developer.wordpress.org/reference/functions/wp_kses_stripslashes/ | |
.replace( /\\"/g, '\\u0022' ); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd think as far as any HTML parser goes, it would consider the comment as complete once it encounters <!-- wp:foo {"id":824,"foo":"look at -->
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-->
is fundamentally unalloyed inside a comment so I wouldn't block any changes on the presumption that it could be there.
number 1: it breaks block structure
number 2: it breaks the HTML and the page will break entirely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand both your points, but I wonder how easily this could break (e.g. if some plugin is allowing block attributes to be set via the server, whether via block API or (imagine a bulk-migration tool that fails to escape values properly) via database interactions). Shouldn't the parser be robust enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this case I don't think it's a "robust-enough" problem. if some block API or server-side plugin adds -->
inside of an HTML comment it will be the entire post that's broken.
I'm trying to understand a situation where the parser could resole this without massively breaking other things.
we could include a JSON parser inside of the post parser; this would slow down parsing I'm pretty sure, but it wouldn't solve the problem that now we're allowing extremely bad and broken HTML in a post. I don't want to give any impression that it's okay to put -->
inside an HTML comment and I'd rather let developers find out immediately rather than burying it deeper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, you're right, I'm convinced we shouldn't be absorbing this concern.
In theory, it's always nice to gather better measurements of potential gains, though my gut tells me it won't be worth benchmarking.
Realistically, how much memory are we talking about? How pressing is this? Having better answers to this should drive any parsing work at this stage. I'd also point out that we have all the pieces set so that any site admin can switch parser implementations if they so wish; this should be a rare occurrence, but I'd argue that specific sites dealing with abnormally large data sets to parse should consider the highly optimised gutenberg-parser-rs, which provides a PHP extension. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting review status to Changes requested so that I can better filter my list of open PRs.
Feel free to either continue work here or close. 🙇
@mcsf: The issue isn't memory, it's the #11369 increased the I think this PR needs a fair bit more work to make it less fragile and more performant, to justify the increased complexity. For now, I'm going to close it, we can revisit it if we start getting reports of more than 1MB of data being stored in |
This PR is inspired by @dmsnell's comment, where he noted that "that we know more about our inputs than the RegExp does", and "PEGs don't backtrack".
For the issue that #11369 addresses, the problem is that we need to extract the
attrs
from the block delimiter comment, where the definition ofattrs
is "everything until we reach the block end marker-->
". Unfortunately, regex are fairly unsuited to this particular use case, #11369 makes some improvements, but we're ultimately limited by the amount PCRE needs to backtrack when it's trying to find-->
.This PR attempts to avoid the issue entirely, by only matching the start of the block delimiter comment as a regex, then using string functions to find the
-->
, meaning thatattrs
can be extracted from the middle.