Skip to content

Commit

Permalink
comment it
Browse files Browse the repository at this point in the history
  • Loading branch information
dmsnell committed Aug 23, 2018
1 parent a9e9d01 commit 396f748
Showing 1 changed file with 69 additions and 0 deletions.
69 changes: 69 additions & 0 deletions lib/parser-rd-trampoline.php
Original file line number Diff line number Diff line change
@@ -1,5 +1,74 @@
<?php

/**
* Implements the formal specification for parsing Gutenberg documents
* serialized into HTML (nominally in `post_content` of a WordPress post)
*
* @see https://github.com/WordPress/gutenberg/tree/master/packages/block-serialization-spec-parser
*
* ## What is different about this one from the spec-parser?
*
* This is a recursive-descent parser that scans linearly once through the input document.
* Instead of directly recursing it utilizes a trampoline mechanism to prevent stack overflow.
* In order to minimize data copying and passing it's built into a class with class properties.
* Between every token (a block comment delimiter) we can instrument the parser and intervene.
*
* The spec parser is defined via a _Parsing Expression Grammar_ (PEG) which answers many
* questions inherently that we must answer explicitly in this parser. The goal for this
* implementation is to match the characteristics of the PEG so that it can be directly
* swapped out so that the only changes are better runtime performance and memory usage.
*
* ## How does it work?
*
* It's pretty self-explanatory...haha
*
* Every Gutenberg document is nominally an HTML document which in addition to normal HTML may
* also contain specially designed HTML comments - the block comment delimiters - which separate
* and isolate the blocks which are serialized in the document.
*
* This parser attempts to create a kind of state-machine around the transitions triggered from
* those delimiters - the "tokens" of the grammar. Every time we find one we should only be doing
* one of a small set of actions:
*
* - enter a new block
* - exit out of a block
*
* Those actions have different effects depending on the context; for instance, when we exit a
* block we either need to add it to the output block list _or_ we need to append it as the
* next `innerBlock` on the parent block below it in the block stack (the place where we track
* open blocks). The details are documented below.
*
* The biggest challenge in this parser is making the right accounting of indices required to
* to construct the `innerHTML` values for each block at every level of nesting depth. We take
* a simple approach:
*
* - start each newly-opened block with an empty `innerHTML`
* - whenever we push a first block into the `innerBlocks` list then add the content from
* where the content of the parent block started to where this inner block starts
* - whenever we push another block into the `innerBlocks` list then add the content from
* where the previous inner block ended to where this inner block starts
* - when we close out an open block we add the content from where the last inner block
* ended to where the closing block delimiter starts
* - if there are no inner blocks then we take the entire content between the opening and
* closing block comment delimiters as the `innerHTML`
*
* ## I meant, how does it perform?
*
* This parser operates much faster than the generated parser from the specification.
* Because w know more about the parsing than the PEG does we can take advantage of several
* tricks to improve our speed and memory usage:
*
* - we only have one or two distinct tokens depending on how you look at it and they are
* all readily matched via a regular expression. instead of parsing on a character-per-
* character basis we can allow the PCRE RegExp engine skip over large swaths of the
* document for us in order to find those tokens.
* - since `preg_match()` takes an `offset` parameter we can crawl through the input
* without passing copies of the input text on every step. we can track our position
* in the string and only pass a number instead
* - not copying all those strings means that we'll also skip many memory allocations
*
*/

function rdt_parse( $document ) {
static $parser;

Expand Down

0 comments on commit 396f748

Please sign in to comment.