Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update post grammar to include markers for inner blocks #11082

Closed
wants to merge 1 commit into from

Conversation

dmsnell
Copy link
Member

@dmsnell dmsnell commented Oct 25, 2018

Motivated by #8760
Necessary for #10463, #10108

Abstract

Add new property to parsed block object indicating where in the innerHTML each innerBlock was found.

Summary

Inner blocks, or nested blocks, or blocks-within-blocks, can exist in Gutenberg posts. They are serialized in post_content in place as normal blocks which exist in between another block's comment delimiters.

<!-- wp:outerBlock -->
Check out my
<!-- wp:voidInnerBlock /-->
and my other
<!-- wp:innerBlock -->
with its own content.
<!-- /wp:innerBlock -->
<!-- /wp:outerBlock -->

The way this gets parsed leaves us in a quandary: we cannot reconstruct the original post_content after parsing because we lose the origin location information for each inner block since they are only passed as an array of inner blocks.

{
	"blockName": "core/outerBlock",
	"attrs": {},
	"innerBlocks": [
		{
			"blockName": "core/voidInnerBlock",
			"attrs": {},
			"innerBlocks": [],
			"innerHTML": ""
		},
		{
			"blockName": "core/innerBlock",
			"attrs": {},
			"innerBlocks": [],
			"innerHTML": "\nwith its own content.\n"
		}
	],
	"innerHTML": "\nCheck out my\n\nand my other\n\n"
}

At this point we have parsed the blocks and prepared them for attaching into the JavaScript block code that interprets them but we have lost our reverse transformation.

In this PR I'd like to introduce a new mechanism which shouldn't break existing functionality but which will enable us to go back and forth isomorphically between the post_content and first stage of parsing. If we can tear apart a Gutenberg post and reassemble then it will let us to structurally-informed processing of the posts without needing to be aware of all the block JavaScript.

The proposed mechanism is a form of tombstone or blockMarkers that specify the index into innerHTML where the innerBlocks were found.

{
	"blockName": "core/outerBlock",
	"attrs": {},
	"innerBlocks": [
		{
			"blockName": "core/voidInnerBlock",
			"attrs": {},
			"innerBlocks": [],
			"blockMarkers": [],
			"innerHTML": ""
		},
		{
			"blockName": "core/innerBlock",
			"attrs": {},
			"innerBlocks": [],
			"blockMarkers": [],
			"innerHTML": "\nwith its own content.\n"
		}
	],
	"blockMarkers": [ 14, 28 ],
	"innerHTML": "\nCheck out my\n\nand my other\n\n"
}

The block markers represent the UCS-2 index UTF-8 byte-length into the innerHTML where the block was found and where it should be replaced if we reserialize. UCS-2 has its own quirks that become important to recognize when dealing with Unicode strings. That is, we get the value by taking the portion of innerHTML preceding that inner block and then compute how many bytes are required to represent it in UTF-8, then that count is our index.

The array of markers is of the same length as the array of innerBlocks and each location index corresponds to the block at the same array index in innerBlocks.

Simplified, aa[1]bb[2]cc would correspond to text: "aabbcc", blocks: [1,2], markers: [2, 4 ]

@dmsnell dmsnell added the [Feature] Parsing Related to efforts to improving the parsing of a string of data and converting it into a different f label Oct 25, 2018
@georgestephanis
Copy link
Contributor

That'd work for my needs, but part of me wonders if it would make more sense to inject some sort of comment into the html to be str_replaced with the generated block markup instead of the index of a string that could get more easily mucked up if anything passes through a filter that tweaks it

@dmsnell dmsnell changed the title Update post grammar to include markers for inner blocks WIP: Update post grammar to include markers for inner blocks Oct 25, 2018
@Hywan
Copy link

Hywan commented Oct 26, 2018

Thank for you for addressing this issue.

My personal point of view is that the solution is very hacky, and closes the door to alternative parsers that are able to produce better outputs easily, like a real AST (Abstract Syntax Tree). For instance, with the Rust Gutenberg parser, the AST is defined as:

pub enum Node<'a> {
    Block {
        name: (Input<'a>, Input<'a>),
        attributes: Option<Input<'a>>,
        children: Vec<Node<'a>>
    },
    Phrase(Input<'a>)
}

And thus the following input:

<!-- wp:outer-block -->
Check out my
<!-- wp:void-inner-block /-->
and my other
<!-- wp:inner-block -->
with its own content.
<!-- /wp:inner-block -->
<!-- /wp:outer-block -->

produces the following output:

[
    /* Block */ {
        "name": "core/outer-block",
        "attributes": null,
        "children": [
            /* Phrase */ {
                "phrase": "\nCheck out my\n"
            },
            /* Block */ {
                "name": "core/void-inner-block",
                "attributes": null,
                "children": []
            },
            /* Phrase */ {
                "phrase": "\nand my other\n"
            },
            /* Block */ {
                "name": "core/inner-block",
                "attributes": null,
                "children": [
                    /* Phrase */ {
                        "phrase": "\nwith its own content.\n"
                    }
                ]
            },
            /* Phrase */ {
                "phrase": "\n"
            }
        ]
    }
]

The AST is clean and it reflects the document structure: A Block has children that are either Block or Phrase instances. It is a 1:1 mapping of the Gutenberg format.

Keeping the new blockMarkers hack aside, in order for the Gutenberg Rust parser to be compatible with the default Gutenberg parser output, the Block and Phrase objects must be defined as:

class Block {
    constructor(name, attributes, children) {
        this.blockName = name;
        this.attrs = attributes;
        this.innerBlocks = [];
        this.innerHTML = '';

        for (let child of children) {
            if (child instanceof Block) {
                this.innerBlocks.push(child);
            } else if (child instanceof Phrase) {
                this.innerHTML += child.innerHTML;
            }
        }
    }
}

class Phrase {
    constructor(phrase) {
        this.attrs = {};
        this.innerHTML = phrase;
    }
}

The way Block transforms children into innerHTML and innerBlocks shows that the children positions are lost, which is… absurd :-).

What I'm trying to say is: Instead of using the same innerHTML and innerBlocks inherited hacks/workarounds, it would be better to come with a clean AST definition. It's more natural for other parsers to produce a well-structured AST than hacks like innerHTML, innerBlocks, and now —with this proposal— a blockMarkers. It makes no sense to me, and feels like a last-minute solution.

:-)

Copy link

@Hywan Hywan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment #11082 (comment).

@dmsnell dmsnell changed the title WIP: Update post grammar to include markers for inner blocks Update post grammar to include markers for inner blocks Oct 26, 2018
@georgestephanis
Copy link
Contributor

georgestephanis commented Oct 29, 2018

The failing tests deal with a lot of Error: Call to undefined function iconv() -- cc: @dmsnell

Looks like all Core usages of iconv -- https://github.com/WordPress/WordPress/search?q=iconv -- have a function_exists check on it as the extension may not be in place. The ID3 library in Core does have a fallback though --

https://github.com/WordPress/WordPress/blob/6fd8080e7ee7599b36d4528f72a8ced612130b8c/wp-includes/ID3/getid3.lib.php#L960-L1018

@mcsf
Copy link
Contributor

mcsf commented Oct 30, 2018

Thanks for all the work here. Some topics of review:

children, innerBlocks, innerHTML

I agree with Ivan's suggestion that a parser output that collects all nested fragments — block or freeform — into a block's children property makes more sense. By trivially deriving innerBlocks and innerHTML from it, we'd preserve back-compat, but any party interested in reconstructing nested blocks would consume children directly. I understand this doesn't map directly with the current implementation of block-serialization-default-parser. Furthermore, it would spare us…

String encodings and lengths in PHP

Playing with UCS-2 and conversions worries me. Both because of runtime implications (requiring an environment with iconv or iconv_fallback, plus dealing with potential conversion failure) and the (in)fallibility (which I can't assess) of the length calculation. Bonus question: are the multibyte string functions of any use at all here? My understanding was that they were helpful in these scenarios.

The blockMarkers unit tests for the JS parser are good to have. Could we have that in PHP, and make sure the lengths are the same between JS and PHP?

Beta

The following isn't a closed decision, but an open question.

Given our timelines and my fear of subtly breaking anything in the middle of the 5.0 Beta cycle, is it conceivable that a blockMarker-aware or children-aware version of the parser could be offered separately, as an enhancement of the default parser?

We've already made the parser pluggable both (by way of hooks) in PHP and (b.w.o. WP scripts) in JS.

  • Any plugins interested in this kind of introspection could pull it in explicitly (Jetpack et al.?), or:
  • Considering that the Gutenberg plugin will live on after the 5.0 merge as a sort of rolling Beta of phase 2 of Gutenberg development, it would also be the appropriate place to add this enhanced parser. Then, any third-party plugin interested in manipulating nested blocks on the server could require the Gutenberg plugin to be installed.

If we ignore the ucs2length and the children aspects, the actual diff reads well to me and doesn't raise further comments.

Minor: the partition functions in the spec parser could probably be renamed since the predicate argument was dropped.

@dmsnell
Copy link
Member Author

dmsnell commented Oct 30, 2018

@mcsf I'm trying to make a change that doesn't break existing compatibility here so I think that altering the parse format is out of the question.

Given our timelines and my fear of subtly breaking anything in the middle of the 5.0 Beta cycle, is it conceivable that a blockMarker-aware or children-aware version of the parser could be offered separately, as an enhancement of the default parser?

Here I'm going to go back to the motivation for these changes: we broke isomorphism when we decided to create innerBlocks and innerHTML as two independent things. This PR is making it possible to restore that so that the server can process blocks and use the full parser when doing things like rendering a page.

Right now dynamic blocks are still unaware of their nested blocks because the server isn't fully parsing posts.

Playing with UCS-2 and conversions worries me.

let's talk about your worries: "potential conversion failure" is a very unlikely case I think because the transform is well-defined and vetted.

are the multibyte string functions of any use at all here?

sadly no because PHP and JavaScript have different ideas of how many things long a multi-byte character is.

strlen( '𐀀' ) === 4
mb_strlen( "𐀀" ) === 1

strlen( '💩' ) === 4
mb_strlen( '💩' ) === 1
'𐀀'.length === 2
[ ...'𐀀' ].length === 1

'💩'.length === 2
[ ...'💩' ].length === 1

also don't those depend on having the multi-byte PHP extensions loaded just as the iconv functions do?

we can move the conversion to JavaScript but I'm not exactly sure how. it's possible that using the string-cloning-by-iterating method will get a match at the cost of performance.

if it's running in WordPress then iconv_fallback() should be there, right? we don't have to worry about old versions either since this will be in core in 5.0… for environments where WordPress is absent then I think it's reasonable to have those platforms provide iconv_fallback or ucs2length


what can you say about these changes right now for these goals:

  • changes don't break any existing behaviors or expectations
  • changes help us process on the server and render using the full parser
  • changes help us provide inner blocks to dynamic block render callback

@dmsnell dmsnell force-pushed the parser/add-markers branch 2 times, most recently from 35e4377 to bff5592 Compare October 30, 2018 15:37
@dmsnell
Copy link
Member Author

dmsnell commented Oct 30, 2018

@mcsf we could also switch to something like "bytes into document" which PHP handles quite well with strlen and which we can force in JavaScript if we write a function to count "bytes needed to represent this string in UTF8" which I think isn't too hard of a function

@dmsnell
Copy link
Member Author

dmsnell commented Oct 30, 2018

for my own reference…

call result
strlen('a𐀀𠀀❤️😀💩') 23
mb_strlen('a𐀀𠀀❤️😀💩') 7
'a𐀀𠀀❤️😀💩'.length 11
[ ...'a𐀀𠀀❤️😀💩' ].length 7

I found a function to compute UTF8 byte length in JS and it appears to be right from a code-audit standpoint and from a quick test. it iterates over the string but doesn't clone it.

@dmsnell
Copy link
Member Author

dmsnell commented Nov 2, 2018

Closing in favor of #11334

dmsnell added a commit that referenced this pull request Nov 7, 2018
Attempt three at including positional information from the parse to enable isomorphic reconstruction of the source `post_content` after parsing.

See alternate attempts: #11082, #11309
Motivated by: #7247, #8760, Automattic/jetpack#10256
Enables: #10463, #10108

## Abstract

Add new `innerContent` property to each block in parser output indicating where in the innerHTML each innerBlock was found.

## Status

 - will update fixtures after design review indicates this is the desired approach
 - all parsers passing new tests for fragment behavior

## Summary

Inner blocks, or nested blocks, or blocks-within-blocks, can exist in Gutenberg posts. They are serialized in `post_content` in place as normal blocks which exist in between another block's comment delimiters.

```html
<!-- wp:outerBlock -->
Check out my
<!-- wp:voidInnerBlock /-->
and my other
<!-- wp:innerBlock -->
with its own content.
<!-- /wp:innerBlock -->
<!-- /wp:outerBlock -->
```

The way this gets parsed leaves us in a quandary: we cannot reconstruct the original `post_content` after parsing because we lose the origin location information for each inner block since they are only passed as an array of inner blocks.

```json
{
	"blockName": "core/outerBlock",
	"attrs": {},
	"innerBlocks": [
		{
			"blockName": "core/voidInnerBlock",
			"attrs": {},
			"innerBlocks": [],
			"innerHTML": ""
		},
		{
			"blockName": "core/innerBlock",
			"attrs": {},
			"innerBlocks": [],
			"innerHTML": "\nwith its own content.\n"
		}
	],
	"innerHTML": "\nCheck out my\n\nand my other\n\n"
}
```

At this point we have parsed the blocks and prepared them for attaching into the JavaScript block code that interprets them but we have lost our reverse transformation.

In this PR I'd like to introduce a new mechanism which shouldn't break existing functionality but which will enable us to go back and forth isomorphically between the `post_content` and first stage of parsing. If we can tear apart a Gutenberg post and reassemble then it will let us to structurally-informed processing of the posts without needing to be aware of all the block JavaScript.

The proposed mechanism is a new property as a **list of HTML fragments with `null` values interspersed between those fragments where the blocks were found**.

```json
{
	"blockName": "core/outerBlock",
	"attrs": {},
	"innerBlocks": [
		{
			"blockName": "core/voidInnerBlock",
			"attrs": {},
			"innerBlocks": [],
			"blockMarkers": [],
			"innerHTML": ""
		},
		{
			"blockName": "core/innerBlock",
			"attrs": {},
			"innerBlocks": [],
			"blockMarkers": [],
			"innerHTML": "\nwith its own content.\n"
		}
	],
	"innerHTML": "\nCheck out my\n\nand my other\n\n",
	"innerContent": [ "\nCheck out my\n", null, "\n and my other\n", null, "\n" ],
}
```

Doing this allows us to replace those `null` values with their associated block (sequentially) from `innerBlocks`.

## Questions

 - Why not use a string token instead of an array?
    - See #11309. The fundamental problem with the token is that it could be valid content input from a person and so there's a probability that we would fail to split the content accurately.

 - Why add the `null` instead of leaving basic array splits like `[ 'before', 'after' ]`?
    - By inspection we can see that without an explicit marker we don't know if the block came before or after or between array elements. We could add empty strings `''` and say that blocks exist only _between_ array elements but the parser code would have to be more complicated to make sure we appropriately add those empty strings. The empty strings are a bit odd anyway.

 - Why add a new property?
    - Code already depends on `innerHTML` and `innerBlocks`; I don't want to break any existing behaviors and adding is less risky than changing.
@aduth aduth deleted the parser/add-markers branch January 25, 2019 21:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[Feature] Parsing Related to efforts to improving the parsing of a string of data and converting it into a different f
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants