Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parser: Add a JEST test to build and compare language agnostic parsers #6030

Closed
wants to merge 5 commits into from

Conversation

youknowriad
Copy link
Contributor

The idea of this PR is to provide the tools (validation and performance) to build block parsers.

This adds a jest test accepting a binary (using env variable) as entry and parsing a list of fixtures using the given binary and comparing the results with snapshots.

To test this, you can generate the pegjs based binary by running npm run build-parser and then run the test like this BINARY=test/parser/bin/peg-parser.js npm run test-parser

To build and test your own custom parser, you need to build a binary (using any language, php, js ...). This binary should accept to args the first one is the input file (an HTML file) and the second one the file where to write the JSON output. Something like this: mybinary myfixture.html myfixture.json

And then you can test your parser like this : BINARY=mybinary npm run test-parser

Right now, I just added a small file as a fixture, we'd have to add a bunch of examples including long posts to compare performances properly.

@youknowriad youknowriad added the [Feature] Parsing Related to efforts to improving the parsing of a string of data and converting it into a different f label Apr 6, 2018
@youknowriad youknowriad self-assigned this Apr 6, 2018
Copy link
Member

@dmsnell dmsnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for jumping on this!

binary is somewhat misleading. we're not asking for a binary and we'll rarely provide one. what we're asking for is an executable path and the goal is to run a parser, right? why not call it something like that?

second, I guess there's value in running this against our fixtures but I find the need to store the results of the parse in those files a bit dubious. these tests are for comparing parser output, so maybe we just run them both each time we test and forget saving the results. the provided parser need only accept the input document on stdin and produce the parse on stdout. Note that if we rely on the snapshots we add needless obstacles when someone wants to change the parser and is making sure they can replicate that in their home-built parser; we'd have to update the snapshot every time we change the official PEG.

to that end we can eliminate the new test/parser/bin/peg-parser.js file and the Jest test, or at least simplify it (no file access, no rimraf, no snapshots, etc…).

further with snapshot testing we're not testing that the parses produce equal output in the type of equality we care about, but rather testing that they produce byte-for-byte equality that doesn't matter. we don't care if the output objects for each block have keys in differing orders or if we produce an empty attrs object vs. omitting it.

I propose that we focus on testing the semantic equality of the parsers then so that we don't break trust in the tests.

did we consider providing this outside of the Gutenberg repo? I can see several benefits to doing that in terms of keeping the parser development cycle lean and the tests easy to understand and discover. I think ideally we could expose this on a web service where someone paste in their code and we execute it then provide an easily navigable results panel. think of something like REPL.it with a diff output.

anyway I think having a web service is down the road, but having a separate repo could be valuable now.

thanks!

@youknowriad
Copy link
Contributor Author

binary is somewhat misleading. we're not asking for a binary and we'll rarely provide one. what we're asking for is an executable path and the goal is to run a parser, right? why not call it something like that?

Agreed, maybe I can change it to PARSER

Note that if we rely on the snapshots we add needless obstacles when someone wants to change the parser and is making sure they can replicate that in their home-built parser;

They don't need to replicate anything, the idea is that they'll run the tests with the tool provided here and it will automatically check against the provided snapshots.

we'd have to update the snapshot every time we change the official PEG.

Yes and that's fine because we want to validate that the output is similar in all parsers. I guess the alternative would be to consider the peg parser as the source of truth instead of the snapshots, I think it's not a great idea, I think we should look at the peg parser as just another parser and avoid running it when running other parsers test because it can have an impact on performance. (misleading the results)

further with snapshot testing we're not testing that the parses produce equal output in the type of equality we care about, but rather testing that they produce byte-for-byte equality that doesn't matter. we don't care if the output objects for each block have keys in differing orders or if we produce an empty attrs object vs. omitting it.

The fact that I'm running JSON.parse before comparing to the snapshots helps in eliminating the formatting issues, but the order of the keys may change, that's true. We could tweak the test to order the JSON or something but I don't think it's critical.

did we consider providing this outside of the Gutenberg repo? I can see several benefits to doing that in terms of keeping the parser development cycle lean and the tests easy to understand and discover. I think ideally we could expose this on a web service where someone paste in their code and we execute it then provide an easily navigable results panel. think of something like REPL.it with a diff output.

Sounds like a good idea, let's not create another repository right now though. Let's focus on getting the tool right and we could extract later if needed (the same way we did for other packages).

@dmsnell
Copy link
Member

dmsnell commented Apr 6, 2018

They don't need to replicate anything, the idea is that they'll run the tests with the tool provided here and it will automatically check against the provided snapshots.

Yeah the point is that I don't see any reason to make this as complicated as we have made it. We're not incorporating other parsers into the master of this project - we're just providing a tool to compare "is my parser or version of the parser identical in its output to the official parser?"

I guess the alternative would be to consider the peg parser as the source of truth instead of the snapshots

I think we have to make the PEG parser the source of truth because we have no other specification. That parser is the specification right now whereas snapshot tests not only don't specify anything but also don't present themselves in a readable way. They can change unexpectedly too and we'd be none the wiser unless we catch it in review. If the parser itself gets changed we would all know right away when the app breaks.

@youknowriad
Copy link
Contributor Author

The advantage of keeping the snapshots is that it also serves as tests for the peg parser but open to change. Let's see what the others think

@dmsnell
Copy link
Member

dmsnell commented Apr 6, 2018

The advantage of keeping the snapshots is that it also serves as tests for the peg parser but open to change. Let's see what the others think

yeah that seems fine but also seems like something for the project itself to test for unanticipated regressions, but not as an implicit and poorly-defined specification and not as a way of supporting third-party parsers. I think they are separate things.

@mcsf
Copy link
Contributor

mcsf commented Apr 9, 2018

BINARY=mybinary npm run test-parser

Worth noting that the environment variable isn't properly propagated, likely because the npm script defers to npx:

npm run test-parser -> npx jest test/parser/test.js

Without changing the script, what works for me is exporting the variable:

export BINARY=test/parser/bin/peg-parser.js
npm run test-parser

(Secondly, I'd also prefer a term like parser or executable over binary)

@youknowriad
Copy link
Contributor Author

Worth noting that the environment variable isn't properly propagated, likely because the npm script defers to npx:

Weird, I just tested and it works for me in both cases (npx and npm run)

@youknowriad
Copy link
Contributor Author

I removed the snapshots and added the Gutenberg demo post as a fixture. Feel free to add other long posts as fixtures if needed.

@dmsnell dmsnell mentioned this pull request Apr 16, 2018
@dmsnell
Copy link
Member

dmsnell commented Apr 23, 2018

Related: I built an in-browser test that accepts three inputs…

  • parser A
  • parser B
  • list of URLs of posts to parse to compare

The parsers are JS source code that will be eval()d to generate the parser. That is, they should be sourcecode that when evaluated returns a function that when called parses content…

// length parser - not entirely spec-compliant to gutenberg
;(function() {
  return function(content) {
    return content.length;
  }
)();

The post list is a list of URLs, one per line, where post_content of a Gutenberg post may be downloaded.

The page will randomize trails of each parser against the input documents and updated the report as it goes. It will continue to iterate until told to stop.

https://dmsnell.github.io/gutenberg-parser-comparator/
repo

Hywan added a commit to Hywan/gutenberg-parser-rs that referenced this pull request May 7, 2018
const inputFile = path.resolve( fixturesDir, f + '.html' );
const start = Date.now();
exec( `cat ${ inputFile } | ${ parserExecutable }`, ( err, parserOutput ) => {
const end = Date.now();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach only counts the boot time for the parserExecutable. In the case of a script written with NodeJS, whatever you use inside your NodeJS file (native module, JS, or WASM) it makes almost no difference.

A difference is visible when you use another executable, like native Rust binary for instance.

I think these numbers can mislead the user. Should we remove them, and come with a more robust in-browser approach?

@dmsnell
Copy link
Member

dmsnell commented Jun 15, 2018

Rebased to merge in updates from master

@dmsnell
Copy link
Member

dmsnell commented Jun 17, 2018

While working on this I came across some issues I felt may not end up working well given this approach. Although quite incomplete I have written down my thoughts on how I think one solution could look to handle our concerns. Right now do we think this is still worth merging?

@youknowriad
Copy link
Contributor Author

Closing as superseded by other efforts

@youknowriad youknowriad closed this Sep 7, 2018
@youknowriad youknowriad deleted the add/parser-tools branch September 7, 2018 12:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[Feature] Parsing Related to efforts to improving the parsing of a string of data and converting it into a different f
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants