Parser: Remove attrs from the block matcher regex #11522

pento · 2018-11-06T01:27:42Z

This PR is inspired by @dmsnell's comment, where he noted that "that we know more about our inputs than the RegExp does", and "PEGs don't backtrack".

For the issue that #11369 addresses, the problem is that we need to extract the attrs from the block delimiter comment, where the definition of attrs is "everything until we reach the block end marker -->". Unfortunately, regex are fairly unsuited to this particular use case, #11369 makes some improvements, but we're ultimately limited by the amount PCRE needs to backtrack when it's trying to find -->.

This PR attempts to avoid the issue entirely, by only matching the start of the block delimiter comment as a regex, then using string functions to find the -->, meaning that attrs can be extracted from the middle.

…s, instead.

dmsnell · 2018-11-06T16:31:30Z

Looks like this had a generally positive and insignificant change in the normal case.

After adding a new document with a huge JSON payload it made a more noticeable difference.

dmsnell

This seems to be a general improvement on parsing speed but not significantly so.

I'm not going to stand in the way, but I'm hesitant of how we're breaking out of the RegExp tokenizer and going back in and feel like it introduces more complexity than the benefit it brings is worth. For example, when we realize we need to adjust the way that the block comments are parsed or stored this seems like it will introduce two steps to the maintenance where currently one exists (editing a RegExp and editing a second-tier sub-parser).

I was hoping this would have a more noticeable speedup but I think the reason it's not more pronounced is because of how performant the PCRE engine is behind the scenes. I'll try and benchmark this on PHP 5.6 or older since I ran the tests on 7.x - that may make a difference yet.

What are your thoughts?

youknowriad · 2018-11-09T08:56:44Z

Looks like this needs a decision. I personally don't know the details here to give an opinion. I'm moving out of 4.3 for now.

mcsf

I'm very concerned about the trade-offs we're making here. I think we're ending up with a more fragile solution (see inline comment on how to break parsing), opening the door for more ad hoc editing of this parser, in exchange for performance improvements that don't seem to make up for the costs.

mcsf · 2018-11-09T17:23:56Z

packages/block-serialization-default-parser/parser.php

+
+		// We know where the attrs start, now search until the close of the block comment.
+		$attrs_start      = $started_at + $length;
+		$closing_position = strpos( $this->document, '-->', $attrs_start );


This will break if anything in the JSON matches it, e.g.

 this"} --> <figure class="wp-block-image"><img class="wp-image-824" src="foobar.png" alt="Foo" /></figure> 

This will break if anything in the JSON matches it, e.g.

 this"} --> <figure class="wp-block-image"><img class="wp-image-824" src="foobar.png" alt="Foo" /></figure> 

For what it's worth, the editor serializer already "handles" this:

gutenberg/packages/blocks/src/api/serializer.js

Lines 172 to 195 in 660e46e

/**

* Given an attributes object, returns a string in the serialized attributes

* format prepared for post content.

*

* @param {Object} attributes Attributes object.

*

* @return {string} Serialized attributes.

*/

export function serializeAttributes( attributes ) {

return JSON.stringify( attributes )

// Don't break HTML comments.

.replace( /--/g, '\\u002d\\u002d' )

// Don't break non-standard-compliant tools.

.replace( /</g, '\\u003c' )

.replace( />/g, '\\u003e' )

.replace( /&/g, '\\u0026' )

// Bypass server stripslashes behavior which would unescape stringify's

// escaping of quotation mark.

//

// See: https://developer.wordpress.org/reference/functions/wp_kses_stripslashes/

.replace( /\\"/g, '\\u0022' );

}

I'd think as far as any HTML parser goes, it would consider the comment as complete once it encounters  ?

--> is fundamentally unalloyed inside a comment so I wouldn't block any changes on the presumption that it could be there.

number 1: it breaks block structure
number 2: it breaks the HTML and the page will break entirely

I understand both your points, but I wonder how easily this could break (e.g. if some plugin is allowing block attributes to be set via the server, whether via block API or (imagine a bulk-migration tool that fails to escape values properly) via database interactions). Shouldn't the parser be robust enough?

in this case I don't think it's a "robust-enough" problem. if some block API or server-side plugin adds --> inside of an HTML comment it will be the entire post that's broken.

I'm trying to understand a situation where the parser could resole this without massively breaking other things.

we could include a JSON parser inside of the post parser; this would slow down parsing I'm pretty sure, but it wouldn't solve the problem that now we're allowing extremely bad and broken HTML in a post. I don't want to give any impression that it's okay to put --> inside an HTML comment and I'd rather let developers find out immediately rather than burying it deeper.

No, you're right, I'm convinced we shouldn't be absorbing this concern.

mcsf · 2018-11-09T17:32:30Z

I'll try and benchmark this on PHP 5.6 or older since I ran the tests on 7.x - that may make a difference yet.

What are your thoughts?

In theory, it's always nice to gather better measurements of potential gains, though my gut tells me it won't be worth benchmarking.

but we're ultimately limited by the amount PCRE needs to backtrack when it's trying to find -->.

Realistically, how much memory are we talking about? How pressing is this? Having better answers to this should drive any parsing work at this stage. I'd also point out that we have all the pieces set so that any site admin can switch parser implementations if they so wish; this should be a rare occurrence, but I'd argue that specific sites dealing with abnormally large data sets to parse should consider the highly optimised gutenberg-parser-rs, which provides a PHP extension.

mcsf

Setting review status to Changes requested so that I can better filter my list of open PRs.

Feel free to either continue work here or close. 🙇

Changed my mind.

pento · 2018-11-10T02:01:50Z

@mcsf: The issue isn't memory, it's the pcre.backtrack_limit setting, which most installs leave at a default of 1,000,000.

#11369 increased the args limit to "more than a million" characters, before it hits this limit. It's kind of squishy limit, because it depends on what characters are in the args as to how big args can get.

I think this PR needs a fair bit more work to make it less fragile and more performant, to justify the increased complexity. For now, I'm going to close it, we can revisit it if we start getting reports of more than 1MB of data being stored in args.

Remove the attrs parsing the from the regex, do using string function…

40d9ef8

…s, instead.

pento added [Feature] Parsing Related to efforts to improving the parsing of a string of data and converting it into a different f [Type] Performance Related to performance efforts labels Nov 6, 2018

pento added this to the 4.3 milestone Nov 6, 2018

pento requested review from dmsnell and a team November 6, 2018 01:27

pento changed the title ~~Remove the attrs parsing the from the regex, do using string function…~~ Parser: Remove attrs from the block matcher regex Nov 6, 2018

dmsnell reviewed Nov 8, 2018

View reviewed changes

youknowriad modified the milestones: 4.3, 4.4 Nov 9, 2018

mcsf reviewed Nov 9, 2018

View reviewed changes

mcsf previously requested changes Nov 9, 2018

View reviewed changes

pento closed this Nov 10, 2018

pento deleted the try/parsing-attrs-without-regex branch November 10, 2018 02:01

pento removed this from the 4.4 milestone Nov 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parser: Remove attrs from the block matcher regex #11522

Parser: Remove attrs from the block matcher regex #11522

pento commented Nov 6, 2018

dmsnell commented Nov 6, 2018

dmsnell left a comment

youknowriad commented Nov 9, 2018

mcsf left a comment

mcsf Nov 9, 2018

aduth Nov 9, 2018

aduth Nov 9, 2018

dmsnell Nov 9, 2018

mcsf Nov 9, 2018

dmsnell Nov 9, 2018

mcsf Nov 9, 2018

mcsf commented Nov 9, 2018

mcsf left a comment

pento commented Nov 10, 2018

	/**
	* Given an attributes object, returns a string in the serialized attributes
	* format prepared for post content.
	*
	* @param {Object} attributes Attributes object.
	*
	* @return {string} Serialized attributes.
	*/
	export function serializeAttributes( attributes ) {
	return JSON.stringify( attributes )
	// Don't break HTML comments.
	.replace( /--/g, '\\u002d\\u002d' )

	// Don't break non-standard-compliant tools.
	.replace( /</g, '\\u003c' )
	.replace( />/g, '\\u003e' )
	.replace( /&/g, '\\u0026' )

	// Bypass server stripslashes behavior which would unescape stringify's
	// escaping of quotation mark.
	//
	// See: https://developer.wordpress.org/reference/functions/wp_kses_stripslashes/
	.replace( /\\"/g, '\\u0022' );
	}

Parser: Remove attrs from the block matcher regex #11522

Parser: Remove attrs from the block matcher regex #11522

Conversation

pento commented Nov 6, 2018

dmsnell commented Nov 6, 2018

dmsnell left a comment

Choose a reason for hiding this comment

youknowriad commented Nov 9, 2018

mcsf left a comment

Choose a reason for hiding this comment

mcsf Nov 9, 2018

Choose a reason for hiding this comment

aduth Nov 9, 2018

Choose a reason for hiding this comment

aduth Nov 9, 2018

Choose a reason for hiding this comment

dmsnell Nov 9, 2018

Choose a reason for hiding this comment

mcsf Nov 9, 2018

Choose a reason for hiding this comment

dmsnell Nov 9, 2018

Choose a reason for hiding this comment

mcsf Nov 9, 2018

Choose a reason for hiding this comment

mcsf commented Nov 9, 2018

mcsf left a comment

Choose a reason for hiding this comment

pento commented Nov 10, 2018