Parser: Add a JEST test to build and compare language agnostic parsers #6030

youknowriad · 2018-04-06T09:03:05Z

The idea of this PR is to provide the tools (validation and performance) to build block parsers.

This adds a jest test accepting a binary (using env variable) as entry and parsing a list of fixtures using the given binary and comparing the results with snapshots.

To test this, you can generate the pegjs based binary by running npm run build-parser and then run the test like this BINARY=test/parser/bin/peg-parser.js npm run test-parser

To build and test your own custom parser, you need to build a binary (using any language, php, js ...). This binary should accept to args the first one is the input file (an HTML file) and the second one the file where to write the JSON output. Something like this: mybinary myfixture.html myfixture.json

And then you can test your parser like this : BINARY=mybinary npm run test-parser

Right now, I just added a small file as a fixture, we'd have to add a bunch of examples including long posts to compare performances properly.

dmsnell

Thanks for jumping on this!

binary is somewhat misleading. we're not asking for a binary and we'll rarely provide one. what we're asking for is an executable path and the goal is to run a parser, right? why not call it something like that?

second, I guess there's value in running this against our fixtures but I find the need to store the results of the parse in those files a bit dubious. these tests are for comparing parser output, so maybe we just run them both each time we test and forget saving the results. the provided parser need only accept the input document on stdin and produce the parse on stdout. Note that if we rely on the snapshots we add needless obstacles when someone wants to change the parser and is making sure they can replicate that in their home-built parser; we'd have to update the snapshot every time we change the official PEG.

to that end we can eliminate the new test/parser/bin/peg-parser.js file and the Jest test, or at least simplify it (no file access, no rimraf, no snapshots, etc…).

further with snapshot testing we're not testing that the parses produce equal output in the type of equality we care about, but rather testing that they produce byte-for-byte equality that doesn't matter. we don't care if the output objects for each block have keys in differing orders or if we produce an empty attrs object vs. omitting it.

I propose that we focus on testing the semantic equality of the parsers then so that we don't break trust in the tests.

did we consider providing this outside of the Gutenberg repo? I can see several benefits to doing that in terms of keeping the parser development cycle lean and the tests easy to understand and discover. I think ideally we could expose this on a web service where someone paste in their code and we execute it then provide an easily navigable results panel. think of something like REPL.it with a diff output.

anyway I think having a web service is down the road, but having a separate repo could be valuable now.

thanks!

youknowriad · 2018-04-06T10:37:08Z

binary is somewhat misleading. we're not asking for a binary and we'll rarely provide one. what we're asking for is an executable path and the goal is to run a parser, right? why not call it something like that?

Agreed, maybe I can change it to PARSER

Note that if we rely on the snapshots we add needless obstacles when someone wants to change the parser and is making sure they can replicate that in their home-built parser;

They don't need to replicate anything, the idea is that they'll run the tests with the tool provided here and it will automatically check against the provided snapshots.

we'd have to update the snapshot every time we change the official PEG.

Yes and that's fine because we want to validate that the output is similar in all parsers. I guess the alternative would be to consider the peg parser as the source of truth instead of the snapshots, I think it's not a great idea, I think we should look at the peg parser as just another parser and avoid running it when running other parsers test because it can have an impact on performance. (misleading the results)

further with snapshot testing we're not testing that the parses produce equal output in the type of equality we care about, but rather testing that they produce byte-for-byte equality that doesn't matter. we don't care if the output objects for each block have keys in differing orders or if we produce an empty attrs object vs. omitting it.

The fact that I'm running JSON.parse before comparing to the snapshots helps in eliminating the formatting issues, but the order of the keys may change, that's true. We could tweak the test to order the JSON or something but I don't think it's critical.

did we consider providing this outside of the Gutenberg repo? I can see several benefits to doing that in terms of keeping the parser development cycle lean and the tests easy to understand and discover. I think ideally we could expose this on a web service where someone paste in their code and we execute it then provide an easily navigable results panel. think of something like REPL.it with a diff output.

Sounds like a good idea, let's not create another repository right now though. Let's focus on getting the tool right and we could extract later if needed (the same way we did for other packages).

dmsnell · 2018-04-06T12:48:54Z

They don't need to replicate anything, the idea is that they'll run the tests with the tool provided here and it will automatically check against the provided snapshots.

Yeah the point is that I don't see any reason to make this as complicated as we have made it. We're not incorporating other parsers into the master of this project - we're just providing a tool to compare "is my parser or version of the parser identical in its output to the official parser?"

I guess the alternative would be to consider the peg parser as the source of truth instead of the snapshots

I think we have to make the PEG parser the source of truth because we have no other specification. That parser is the specification right now whereas snapshot tests not only don't specify anything but also don't present themselves in a readable way. They can change unexpectedly too and we'd be none the wiser unless we catch it in review. If the parser itself gets changed we would all know right away when the app breaks.

youknowriad · 2018-04-06T13:02:08Z

The advantage of keeping the snapshots is that it also serves as tests for the peg parser but open to change. Let's see what the others think

dmsnell · 2018-04-06T13:04:04Z

The advantage of keeping the snapshots is that it also serves as tests for the peg parser but open to change. Let's see what the others think

yeah that seems fine but also seems like something for the project itself to test for unanticipated regressions, but not as an implicit and poorly-defined specification and not as a way of supporting third-party parsers. I think they are separate things.

mcsf · 2018-04-09T13:58:48Z

BINARY=mybinary npm run test-parser

Worth noting that the environment variable isn't properly propagated, likely because the npm script defers to npx:

npm run test-parser -> npx jest test/parser/test.js

Without changing the script, what works for me is exporting the variable:

export BINARY=test/parser/bin/peg-parser.js
npm run test-parser

(Secondly, I'd also prefer a term like parser or executable over binary)

youknowriad · 2018-04-09T14:04:21Z

Worth noting that the environment variable isn't properly propagated, likely because the npm script defers to npx:

Weird, I just tested and it works for me in both cases (npx and npm run)

youknowriad · 2018-04-12T09:10:21Z

I removed the snapshots and added the Gutenberg demo post as a fixture. Feel free to add other long posts as fixtures if needed.

dmsnell · 2018-04-23T13:31:35Z

Related: I built an in-browser test that accepts three inputs…

parser A
parser B
list of URLs of posts to parse to compare

The parsers are JS source code that will be eval()d to generate the parser. That is, they should be sourcecode that when evaluated returns a function that when called parses content…

// length parser - not entirely spec-compliant to gutenberg
;(function() {
  return function(content) {
    return content.length;
  }
)();

The post list is a list of URLs, one per line, where post_content of a Gutenberg post may be downloaded.

The page will randomize trails of each parser against the input documents and updated the report as it goes. It will continue to iterate until told to stop.

https://dmsnell.github.io/gutenberg-parser-comparator/
repo

See the latest commits from WordPress/gutenberg#6030.

Hywan · 2018-05-08T20:16:04Z

test/parser/test.js

+			const inputFile = path.resolve( fixturesDir, f + '.html' );
+			const start = Date.now();
+			exec( `cat ${ inputFile } | ${ parserExecutable }`, ( err, parserOutput ) => {
+				const end = Date.now();


This approach only counts the boot time for the parserExecutable. In the case of a script written with NodeJS, whatever you use inside your NodeJS file (native module, JS, or WASM) it makes almost no difference.

A difference is visible when you use another executable, like native Rust binary for instance.

I think these numbers can mislead the user. Should we remove them, and come with a more robust in-browser approach?

dmsnell · 2018-06-15T09:07:33Z

Rebased to merge in updates from master

dmsnell · 2018-06-17T19:51:32Z

While working on this I came across some issues I felt may not end up working well given this approach. Although quite incomplete I have written down my thoughts on how I think one solution could look to handle our concerns. Right now do we think this is still worth merging?

youknowriad · 2018-09-07T12:06:03Z

Closing as superseded by other efforts

youknowriad added the [Feature] Parsing Related to efforts to improving the parsing of a string of data and converting it into a different f label Apr 6, 2018

youknowriad self-assigned this Apr 6, 2018

youknowriad requested review from mcsf, mtias, aduth and dmsnell April 6, 2018 09:03

youknowriad force-pushed the add/parser-tools branch from 64b52cc to 615f179 Compare April 6, 2018 09:13

dmsnell reviewed Apr 6, 2018

View reviewed changes

mcsf mentioned this pull request Apr 11, 2018

Docs: Auto-generate human-readable version of Gutenberg block grammar #6116

Merged

dmsnell mentioned this pull request Apr 16, 2018

WIP: Add dynamic block #6170

Closed

youknowriad force-pushed the add/parser-tools branch from 7edaf2b to 4e88683 Compare May 7, 2018 10:38

Hywan added a commit to Hywan/gutenberg-parser-rs that referenced this pull request May 7, 2018

feat(nodejs) Use the new test-parser API from Gutenberg.

107188c

See the latest commits from WordPress/gutenberg#6030.

Hywan reviewed May 8, 2018

View reviewed changes

youknowriad added 5 commits June 15, 2018 10:58

Parser: Add a JEST test to build and compare language agnostic parsers

8664911

Rename BINARY to PARSER

2472602

Test against the default parser and remove the fixtures

d0cffb8

Adding the gutenberg demo post as fixture

f0f6824

Replace filesystem with stdin/stdout for fair performance comparison

e4ac586

dmsnell force-pushed the add/parser-tools branch from 4e88683 to e4ac586 Compare June 15, 2018 09:06

mcsf mentioned this pull request Jul 27, 2018

Overview of Short-term Parsing Enhancements #8244

Closed

11 tasks

youknowriad closed this Sep 7, 2018

youknowriad deleted the add/parser-tools branch September 7, 2018 12:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parser: Add a JEST test to build and compare language agnostic parsers #6030

Parser: Add a JEST test to build and compare language agnostic parsers #6030

youknowriad commented Apr 6, 2018

dmsnell left a comment

youknowriad commented Apr 6, 2018

dmsnell commented Apr 6, 2018

youknowriad commented Apr 6, 2018

dmsnell commented Apr 6, 2018

mcsf commented Apr 9, 2018 •

edited

Loading

youknowriad commented Apr 9, 2018

youknowriad commented Apr 12, 2018

dmsnell commented Apr 23, 2018

Hywan May 8, 2018

dmsnell commented Jun 15, 2018

dmsnell commented Jun 17, 2018

youknowriad commented Sep 7, 2018

Parser: Add a JEST test to build and compare language agnostic parsers #6030

Parser: Add a JEST test to build and compare language agnostic parsers #6030

Conversation

youknowriad commented Apr 6, 2018

dmsnell left a comment

Choose a reason for hiding this comment

youknowriad commented Apr 6, 2018

dmsnell commented Apr 6, 2018

youknowriad commented Apr 6, 2018

dmsnell commented Apr 6, 2018

mcsf commented Apr 9, 2018 • edited Loading

youknowriad commented Apr 9, 2018

youknowriad commented Apr 12, 2018

dmsnell commented Apr 23, 2018

Hywan May 8, 2018

Choose a reason for hiding this comment

dmsnell commented Jun 15, 2018

dmsnell commented Jun 17, 2018

youknowriad commented Sep 7, 2018

mcsf commented Apr 9, 2018 •

edited

Loading