Update post grammar to include markers for inner blocks #11082

dmsnell · 2018-10-25T20:51:34Z

Motivated by #8760
Necessary for #10463, #10108

Abstract

Add new property to parsed block object indicating where in the innerHTML each innerBlock was found.

Summary

Inner blocks, or nested blocks, or blocks-within-blocks, can exist in Gutenberg posts. They are serialized in post_content in place as normal blocks which exist in between another block's comment delimiters.

<!-- wp:outerBlock -->
Check out my
<!-- wp:voidInnerBlock /-->
and my other
<!-- wp:innerBlock -->
with its own content.
<!-- /wp:innerBlock -->
<!-- /wp:outerBlock -->

The way this gets parsed leaves us in a quandary: we cannot reconstruct the original post_content after parsing because we lose the origin location information for each inner block since they are only passed as an array of inner blocks.

{
	"blockName": "core/outerBlock",
	"attrs": {},
	"innerBlocks": [
		{
			"blockName": "core/voidInnerBlock",
			"attrs": {},
			"innerBlocks": [],
			"innerHTML": ""
		},
		{
			"blockName": "core/innerBlock",
			"attrs": {},
			"innerBlocks": [],
			"innerHTML": "\nwith its own content.\n"
		}
	],
	"innerHTML": "\nCheck out my\n\nand my other\n\n"
}

At this point we have parsed the blocks and prepared them for attaching into the JavaScript block code that interprets them but we have lost our reverse transformation.

In this PR I'd like to introduce a new mechanism which shouldn't break existing functionality but which will enable us to go back and forth isomorphically between the post_content and first stage of parsing. If we can tear apart a Gutenberg post and reassemble then it will let us to structurally-informed processing of the posts without needing to be aware of all the block JavaScript.

The proposed mechanism is a form of tombstone or blockMarkers that specify the index into innerHTML where the innerBlocks were found.

{
	"blockName": "core/outerBlock",
	"attrs": {},
	"innerBlocks": [
		{
			"blockName": "core/voidInnerBlock",
			"attrs": {},
			"innerBlocks": [],
			"blockMarkers": [],
			"innerHTML": ""
		},
		{
			"blockName": "core/innerBlock",
			"attrs": {},
			"innerBlocks": [],
			"blockMarkers": [],
			"innerHTML": "\nwith its own content.\n"
		}
	],
	"blockMarkers": [ 14, 28 ],
	"innerHTML": "\nCheck out my\n\nand my other\n\n"
}

The block markers represent the ~~UCS-2 index~~ UTF-8 byte-length into the innerHTML where the block was found and where it should be replaced if we reserialize. ~~UCS-2 has its own quirks that become important to recognize when dealing with Unicode strings.~~ That is, we get the value by taking the portion of innerHTML preceding that inner block and then compute how many bytes are required to represent it in UTF-8, then that count is our index.

The array of markers is of the same length as the array of innerBlocks and each location index corresponds to the block at the same array index in innerBlocks.

Simplified, aa[1]bb[2]cc would correspond to text: "aabbcc", blocks: [1,2], markers: [2, 4 ]

georgestephanis · 2018-10-25T21:21:53Z

That'd work for my needs, but part of me wonders if it would make more sense to inject some sort of comment into the html to be str_replaced with the generated block markup instead of the index of a string that could get more easily mucked up if anything passes through a filter that tweaks it

Hywan · 2018-10-26T08:01:00Z

Thank for you for addressing this issue.

My personal point of view is that the solution is very hacky, and closes the door to alternative parsers that are able to produce better outputs easily, like a real AST (Abstract Syntax Tree). For instance, with the Rust Gutenberg parser, the AST is defined as:

pub enum Node<'a> {
    Block {
        name: (Input<'a>, Input<'a>),
        attributes: Option<Input<'a>>,
        children: Vec<Node<'a>>
    },
    Phrase(Input<'a>)
}

And thus the following input:

<!-- wp:outer-block -->
Check out my
<!-- wp:void-inner-block /-->
and my other
<!-- wp:inner-block -->
with its own content.
<!-- /wp:inner-block -->
<!-- /wp:outer-block -->

produces the following output:

[
    /* Block */ {
        "name": "core/outer-block",
        "attributes": null,
        "children": [
            /* Phrase */ {
                "phrase": "\nCheck out my\n"
            },
            /* Block */ {
                "name": "core/void-inner-block",
                "attributes": null,
                "children": []
            },
            /* Phrase */ {
                "phrase": "\nand my other\n"
            },
            /* Block */ {
                "name": "core/inner-block",
                "attributes": null,
                "children": [
                    /* Phrase */ {
                        "phrase": "\nwith its own content.\n"
                    }
                ]
            },
            /* Phrase */ {
                "phrase": "\n"
            }
        ]
    }
]

The AST is clean and it reflects the document structure: A Block has children that are either Block or Phrase instances. It is a 1:1 mapping of the Gutenberg format.

Keeping the new blockMarkers hack aside, in order for the Gutenberg Rust parser to be compatible with the default Gutenberg parser output, the Block and Phrase objects must be defined as:

class Block {
    constructor(name, attributes, children) {
        this.blockName = name;
        this.attrs = attributes;
        this.innerBlocks = [];
        this.innerHTML = '';

        for (let child of children) {
            if (child instanceof Block) {
                this.innerBlocks.push(child);
            } else if (child instanceof Phrase) {
                this.innerHTML += child.innerHTML;
            }
        }
    }
}

class Phrase {
    constructor(phrase) {
        this.attrs = {};
        this.innerHTML = phrase;
    }
}

The way Block transforms children into innerHTML and innerBlocks shows that the children positions are lost, which is… absurd :-).

What I'm trying to say is: Instead of using the same innerHTML and innerBlocks inherited hacks/workarounds, it would be better to come with a clean AST definition. It's more natural for other parsers to produce a well-structured AST than hacks like innerHTML, innerBlocks, and now —with this proposal— a blockMarkers. It makes no sense to me, and feels like a last-minute solution.

:-)

Hywan

See my comment #11082 (comment).

georgestephanis · 2018-10-29T16:40:13Z

The failing tests deal with a lot of Error: Call to undefined function iconv() -- cc: @dmsnell

Looks like all Core usages of iconv -- https://github.com/WordPress/WordPress/search?q=iconv -- have a function_exists check on it as the extension may not be in place. The ID3 library in Core does have a fallback though --

https://github.com/WordPress/WordPress/blob/6fd8080e7ee7599b36d4528f72a8ced612130b8c/wp-includes/ID3/getid3.lib.php#L960-L1018

mcsf · 2018-10-30T13:33:06Z

Thanks for all the work here. Some topics of review:

`children`, `innerBlocks`, `innerHTML`

I agree with Ivan's suggestion that a parser output that collects all nested fragments — block or freeform — into a block's children property makes more sense. By trivially deriving innerBlocks and innerHTML from it, we'd preserve back-compat, but any party interested in reconstructing nested blocks would consume children directly. I understand this doesn't map directly with the current implementation of block-serialization-default-parser. Furthermore, it would spare us…

String encodings and lengths in PHP

Playing with UCS-2 and conversions worries me. Both because of runtime implications (requiring an environment with iconv or iconv_fallback, plus dealing with potential conversion failure) and the (in)fallibility (which I can't assess) of the length calculation. Bonus question: are the multibyte string functions of any use at all here? My understanding was that they were helpful in these scenarios.

The blockMarkers unit tests for the JS parser are good to have. Could we have that in PHP, and make sure the lengths are the same between JS and PHP?

Beta

The following isn't a closed decision, but an open question.

Given our timelines and my fear of subtly breaking anything in the middle of the 5.0 Beta cycle, is it conceivable that a blockMarker-aware or children-aware version of the parser could be offered separately, as an enhancement of the default parser?

We've already made the parser pluggable both (by way of hooks) in PHP and (b.w.o. WP scripts) in JS.

Any plugins interested in this kind of introspection could pull it in explicitly (Jetpack et al.?), or:
Considering that the Gutenberg plugin will live on after the 5.0 merge as a sort of rolling Beta of phase 2 of Gutenberg development, it would also be the appropriate place to add this enhanced parser. Then, any third-party plugin interested in manipulating nested blocks on the server could require the Gutenberg plugin to be installed.

If we ignore the ucs2length and the children aspects, the actual diff reads well to me and doesn't raise further comments.

Minor: the partition functions in the spec parser could probably be renamed since the predicate argument was dropped.

dmsnell · 2018-10-30T15:15:58Z

@mcsf I'm trying to make a change that doesn't break existing compatibility here so I think that altering the parse format is out of the question.

Given our timelines and my fear of subtly breaking anything in the middle of the 5.0 Beta cycle, is it conceivable that a blockMarker-aware or children-aware version of the parser could be offered separately, as an enhancement of the default parser?

Here I'm going to go back to the motivation for these changes: we broke isomorphism when we decided to create innerBlocks and innerHTML as two independent things. This PR is making it possible to restore that so that the server can process blocks and use the full parser when doing things like rendering a page.

Right now dynamic blocks are still unaware of their nested blocks because the server isn't fully parsing posts.

Playing with UCS-2 and conversions worries me.

let's talk about your worries: "potential conversion failure" is a very unlikely case I think because the transform is well-defined and vetted.

are the multibyte string functions of any use at all here?

sadly no because PHP and JavaScript have different ideas of how many things long a multi-byte character is.

strlen( '𐀀' ) === 4
mb_strlen( "𐀀" ) === 1

strlen( '💩' ) === 4
mb_strlen( '💩' ) === 1

'𐀀'.length === 2
[ ...'𐀀' ].length === 1

'💩'.length === 2
[ ...'💩' ].length === 1

also don't those depend on having the multi-byte PHP extensions loaded just as the iconv functions do?

we can move the conversion to JavaScript but I'm not exactly sure how. it's possible that using the string-cloning-by-iterating method will get a match at the cost of performance.

if it's running in WordPress then iconv_fallback() should be there, right? we don't have to worry about old versions either since this will be in core in 5.0… for environments where WordPress is absent then I think it's reasonable to have those platforms provide iconv_fallback or ucs2length

what can you say about these changes right now for these goals:

changes don't break any existing behaviors or expectations
changes help us process on the server and render using the full parser
changes help us provide inner blocks to dynamic block render callback

dmsnell · 2018-10-30T15:50:19Z

@mcsf we could also switch to something like "bytes into document" which PHP handles quite well with strlen and which we can force in JavaScript if we write a function to count "bytes needed to represent this string in UTF8" which I think isn't too hard of a function

dmsnell · 2018-10-30T15:57:21Z

for my own reference…

call	result
`strlen('a𐀀𠀀❤️😀💩')`	23
`mb_strlen('a𐀀𠀀❤️😀💩')`	7
`'a𐀀𠀀❤️😀💩'.length`	11
`[ ...'a𐀀𠀀❤️😀💩' ].length`	7

I found a function to compute UTF8 byte length in JS and it appears to be right from a code-audit standpoint and from a quick test. it iterates over the string but doesn't clone it.

dmsnell · 2018-11-02T17:07:10Z

Closing in favor of #11334

Attempt three at including positional information from the parse to enable isomorphic reconstruction of the source `post_content` after parsing. See alternate attempts: #11082, #11309 Motivated by: #7247, #8760, Automattic/jetpack#10256 Enables: #10463, #10108 ## Abstract Add new `innerContent` property to each block in parser output indicating where in the innerHTML each innerBlock was found. ## Status - will update fixtures after design review indicates this is the desired approach - all parsers passing new tests for fragment behavior ## Summary Inner blocks, or nested blocks, or blocks-within-blocks, can exist in Gutenberg posts. They are serialized in `post_content` in place as normal blocks which exist in between another block's comment delimiters. ```html  Check out my  and my other  with its own content.   ``` The way this gets parsed leaves us in a quandary: we cannot reconstruct the original `post_content` after parsing because we lose the origin location information for each inner block since they are only passed as an array of inner blocks. ```json { "blockName": "core/outerBlock", "attrs": {}, "innerBlocks": [ { "blockName": "core/voidInnerBlock", "attrs": {}, "innerBlocks": [], "innerHTML": "" }, { "blockName": "core/innerBlock", "attrs": {}, "innerBlocks": [], "innerHTML": "\nwith its own content.\n" } ], "innerHTML": "\nCheck out my\n\nand my other\n\n" } ``` At this point we have parsed the blocks and prepared them for attaching into the JavaScript block code that interprets them but we have lost our reverse transformation. In this PR I'd like to introduce a new mechanism which shouldn't break existing functionality but which will enable us to go back and forth isomorphically between the `post_content` and first stage of parsing. If we can tear apart a Gutenberg post and reassemble then it will let us to structurally-informed processing of the posts without needing to be aware of all the block JavaScript. The proposed mechanism is a new property as a **list of HTML fragments with `null` values interspersed between those fragments where the blocks were found**. ```json { "blockName": "core/outerBlock", "attrs": {}, "innerBlocks": [ { "blockName": "core/voidInnerBlock", "attrs": {}, "innerBlocks": [], "blockMarkers": [], "innerHTML": "" }, { "blockName": "core/innerBlock", "attrs": {}, "innerBlocks": [], "blockMarkers": [], "innerHTML": "\nwith its own content.\n" } ], "innerHTML": "\nCheck out my\n\nand my other\n\n", "innerContent": [ "\nCheck out my\n", null, "\n and my other\n", null, "\n" ], } ``` Doing this allows us to replace those `null` values with their associated block (sequentially) from `innerBlocks`. ## Questions - Why not use a string token instead of an array? - See #11309. The fundamental problem with the token is that it could be valid content input from a person and so there's a probability that we would fail to split the content accurately. - Why add the `null` instead of leaving basic array splits like `[ 'before', 'after' ]`? - By inspection we can see that without an explicit marker we don't know if the block came before or after or between array elements. We could add empty strings `''` and say that blocks exist only _between_ array elements but the parser code would have to be more complicated to make sure we appropriately add those empty strings. The empty strings are a bit odd anyway. - Why add a new property? - Code already depends on `innerHTML` and `innerBlocks`; I don't want to break any existing behaviors and adding is less risky than changing.

dmsnell requested review from mcsf, pento, mtias, gziolo and aduth October 25, 2018 20:51

dmsnell added the [Feature] Parsing Related to efforts to improving the parsing of a string of data and converting it into a different f label Oct 25, 2018

dmsnell changed the title ~~Update post grammar to include markers for inner blocks~~ WIP: Update post grammar to include markers for inner blocks Oct 25, 2018

Hywan suggested changes Oct 26, 2018

View reviewed changes

georgestephanis mentioned this pull request Oct 26, 2018

Contact Form Gutenblock Automattic/jetpack#10256

Closed

5 tasks

dmsnell changed the title ~~WIP: Update post grammar to include markers for inner blocks~~ Update post grammar to include markers for inner blocks Oct 26, 2018

dmsnell mentioned this pull request Oct 27, 2018

Parsing: Use full parser in do_blocks with nested block support #11141

Merged

georgestephanis mentioned this pull request Oct 28, 2018

Parser: Replace dynamic-block regex in do_blocks #10463

Closed

7 tasks

georgestephanis mentioned this pull request Oct 30, 2018

Parser/do dynamic blocks with inner blocks #11227

Closed

4 tasks

dmsnell force-pushed the parser/add-markers branch 2 times, most recently from 35e4377 to bff5592 Compare October 30, 2018 15:37

reset branch on master

a5f5afb

dmsnell force-pushed the parser/add-markers branch from 13cdfdd to a5f5afb Compare October 31, 2018 22:16

dmsnell mentioned this pull request Nov 2, 2018

Parser: Add new list of HTML fragments to parse output #11334

Merged

dmsnell closed this Nov 2, 2018

aduth deleted the parser/add-markers branch January 25, 2019 21:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update post grammar to include markers for inner blocks #11082

Update post grammar to include markers for inner blocks #11082

dmsnell commented Oct 25, 2018 •

edited

Loading

georgestephanis commented Oct 25, 2018

Hywan commented Oct 26, 2018 •

edited

Loading

Hywan left a comment

georgestephanis commented Oct 29, 2018 •

edited

Loading

mcsf commented Oct 30, 2018

dmsnell commented Oct 30, 2018

dmsnell commented Oct 30, 2018

dmsnell commented Oct 30, 2018 •

edited

Loading

dmsnell commented Nov 2, 2018

Update post grammar to include markers for inner blocks #11082

Update post grammar to include markers for inner blocks #11082

Conversation

dmsnell commented Oct 25, 2018 • edited Loading

Abstract

Summary

georgestephanis commented Oct 25, 2018

Hywan commented Oct 26, 2018 • edited Loading

Hywan left a comment

Choose a reason for hiding this comment

georgestephanis commented Oct 29, 2018 • edited Loading

mcsf commented Oct 30, 2018

children, innerBlocks, innerHTML

String encodings and lengths in PHP

Beta

dmsnell commented Oct 30, 2018

dmsnell commented Oct 30, 2018

dmsnell commented Oct 30, 2018 • edited Loading

dmsnell commented Nov 2, 2018

dmsnell commented Oct 25, 2018 •

edited

Loading

Hywan commented Oct 26, 2018 •

edited

Loading

georgestephanis commented Oct 29, 2018 •

edited

Loading

`children`, `innerBlocks`, `innerHTML`

dmsnell commented Oct 30, 2018 •

edited

Loading