Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parser: Remove attrs from the block matcher regex #11522

Closed
wants to merge 1 commit into from

Conversation

pento
Copy link
Member

@pento pento commented Nov 6, 2018

This PR is inspired by @dmsnell's comment, where he noted that "that we know more about our inputs than the RegExp does", and "PEGs don't backtrack".

For the issue that #11369 addresses, the problem is that we need to extract the attrs from the block delimiter comment, where the definition of attrs is "everything until we reach the block end marker -->". Unfortunately, regex are fairly unsuited to this particular use case, #11369 makes some improvements, but we're ultimately limited by the amount PCRE needs to backtrack when it's trying to find -->.

This PR attempts to avoid the issue entirely, by only matching the start of the block delimiter comment as a regex, then using string functions to find the -->, meaning that attrs can be extracted from the middle.

@pento pento added [Feature] Parsing Related to efforts to improving the parsing of a string of data and converting it into a different f [Type] Performance Related to performance efforts labels Nov 6, 2018
@pento pento added this to the 4.3 milestone Nov 6, 2018
@pento pento requested review from dmsnell and a team November 6, 2018 01:27
@pento pento changed the title Remove the attrs parsing the from the regex, do using string function… Parser: Remove attrs from the block matcher regex Nov 6, 2018
@dmsnell
Copy link
Member

dmsnell commented Nov 6, 2018

Looks like this had a generally positive and insignificant change in the normal case.

comparator-ywayevznec now sh_ 2

After adding a new document with a huge JSON payload it made a more noticeable difference.

screen shot 2018-11-06 at 10 31 04 am

Copy link
Member

@dmsnell dmsnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a general improvement on parsing speed but not significantly so.

I'm not going to stand in the way, but I'm hesitant of how we're breaking out of the RegExp tokenizer and going back in and feel like it introduces more complexity than the benefit it brings is worth. For example, when we realize we need to adjust the way that the block comments are parsed or stored this seems like it will introduce two steps to the maintenance where currently one exists (editing a RegExp and editing a second-tier sub-parser).

I was hoping this would have a more noticeable speedup but I think the reason it's not more pronounced is because of how performant the PCRE engine is behind the scenes. I'll try and benchmark this on PHP 5.6 or older since I ran the tests on 7.x - that may make a difference yet.

What are your thoughts?

@youknowriad
Copy link
Contributor

Looks like this needs a decision. I personally don't know the details here to give an opinion. I'm moving out of 4.3 for now.

@youknowriad youknowriad modified the milestones: 4.3, 4.4 Nov 9, 2018
Copy link
Contributor

@mcsf mcsf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very concerned about the trade-offs we're making here. I think we're ending up with a more fragile solution (see inline comment on how to break parsing), opening the door for more ad hoc editing of this parser, in exchange for performance improvements that don't seem to make up for the costs.


// We know where the attrs start, now search until the close of the block comment.
$attrs_start = $started_at + $length;
$closing_position = strpos( $this->document, '-->', $attrs_start );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will break if anything in the JSON matches it, e.g.

<!-- wp:foo {"id":824,"foo":"look at --> this"} -->
<figure class="wp-block-image"><img class="wp-image-824" src="foobar.png" alt="Foo" /></figure>
<!-- /wp:foo -->

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will break if anything in the JSON matches it, e.g.

<!-- wp:foo {"id":824,"foo":"look at --> this"} -->
<figure class="wp-block-image"><img class="wp-image-824" src="foobar.png" alt="Foo" /></figure>
<!-- /wp:foo -->

For what it's worth, the editor serializer already "handles" this:

/**
* Given an attributes object, returns a string in the serialized attributes
* format prepared for post content.
*
* @param {Object} attributes Attributes object.
*
* @return {string} Serialized attributes.
*/
export function serializeAttributes( attributes ) {
return JSON.stringify( attributes )
// Don't break HTML comments.
.replace( /--/g, '\\u002d\\u002d' )
// Don't break non-standard-compliant tools.
.replace( /</g, '\\u003c' )
.replace( />/g, '\\u003e' )
.replace( /&/g, '\\u0026' )
// Bypass server stripslashes behavior which would unescape stringify's
// escaping of quotation mark.
//
// See: https://developer.wordpress.org/reference/functions/wp_kses_stripslashes/
.replace( /\\"/g, '\\u0022' );
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd think as far as any HTML parser goes, it would consider the comment as complete once it encounters <!-- wp:foo {"id":824,"foo":"look at --> ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--> is fundamentally unalloyed inside a comment so I wouldn't block any changes on the presumption that it could be there.

number 1: it breaks block structure
number 2: it breaks the HTML and the page will break entirely

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand both your points, but I wonder how easily this could break (e.g. if some plugin is allowing block attributes to be set via the server, whether via block API or (imagine a bulk-migration tool that fails to escape values properly) via database interactions). Shouldn't the parser be robust enough?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case I don't think it's a "robust-enough" problem. if some block API or server-side plugin adds --> inside of an HTML comment it will be the entire post that's broken.

I'm trying to understand a situation where the parser could resole this without massively breaking other things.

we could include a JSON parser inside of the post parser; this would slow down parsing I'm pretty sure, but it wouldn't solve the problem that now we're allowing extremely bad and broken HTML in a post. I don't want to give any impression that it's okay to put --> inside an HTML comment and I'd rather let developers find out immediately rather than burying it deeper.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you're right, I'm convinced we shouldn't be absorbing this concern.

@mcsf
Copy link
Contributor

mcsf commented Nov 9, 2018

I'll try and benchmark this on PHP 5.6 or older since I ran the tests on 7.x - that may make a difference yet.

What are your thoughts?

In theory, it's always nice to gather better measurements of potential gains, though my gut tells me it won't be worth benchmarking.

but we're ultimately limited by the amount PCRE needs to backtrack when it's trying to find -->.

Realistically, how much memory are we talking about? How pressing is this? Having better answers to this should drive any parsing work at this stage. I'd also point out that we have all the pieces set so that any site admin can switch parser implementations if they so wish; this should be a rare occurrence, but I'd argue that specific sites dealing with abnormally large data sets to parse should consider the highly optimised gutenberg-parser-rs, which provides a PHP extension.

mcsf
mcsf previously requested changes Nov 9, 2018
Copy link
Contributor

@mcsf mcsf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting review status to Changes requested so that I can better filter my list of open PRs.

Feel free to either continue work here or close. 🙇

@mcsf mcsf dismissed their stale review November 9, 2018 18:52

Changed my mind.

@pento
Copy link
Member Author

pento commented Nov 10, 2018

@mcsf: The issue isn't memory, it's the pcre.backtrack_limit setting, which most installs leave at a default of 1,000,000.

#11369 increased the args limit to "more than a million" characters, before it hits this limit. It's kind of squishy limit, because it depends on what characters are in the args as to how big args can get.

I think this PR needs a fair bit more work to make it less fragile and more performant, to justify the increased complexity. For now, I'm going to close it, we can revisit it if we start getting reports of more than 1MB of data being stored in args.

@pento pento closed this Nov 10, 2018
@pento pento deleted the try/parsing-attrs-without-regex branch November 10, 2018 02:01
@pento pento removed this from the 4.4 milestone Nov 10, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[Feature] Parsing Related to efforts to improving the parsing of a string of data and converting it into a different f [Type] Performance Related to performance efforts
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants