Refactor to remove regular expressions #96

wooorm · 2015-11-26T16:39:19Z

Refactor to remove regular expressions

...in favour of an algorithmic approach.

New features:

Add support for silent mode on tokenisers, which will
detect whether a given value would match that tokeniser,
but without eating any actual content.
Plugging into the parser is singificantly improved.

Fixes:

Better handling of mismatched parentheses in links;
Continued block-quotes in non-GFM-mode;
Block-quotes followed by lazy definitions in commonmark-mode;
Malformed HTML block elements are no longer supported;
Bug where GFM’s literal URL detection could detect e-mail addresses
without an at-character;
Bug where mailto: literal URLs were not properly stripped
of their protocol.

Todo:

A lot of places support escaped (a slash followed by another character),
instead of allowing any character to be escaped, only certain characters
should be supported (e.g., all ASCII-character for commonmark-mode, and
several separate subsets on other flavours).

eush77 · 2015-11-26T21:29:48Z

lib/parse.js

@@ -84,6 +135,35 @@ var INLINE_CODE = 'inlineCode';
 var BREAK = 'break';
 var ROOT = 'root';

+/** TODO */
+function isWhiteSpace(character) {


This is equivalent to /\s/.test, right? I measured it on Node (got nearly the same results, regexp version is slightly faster), on Firefox (regexp version is 2x faster) and on (older) Chromium (isWhiteSpace is faster by ~25%). I also created a jsperf for this.

It is not clear to me why isWhiteSpace version should be faster. JavaScript engines adopted JIT compilation for regular expressions long time ago. JITed code will be basically equivalent to this if-else chain, it can only be better and shared with other parts of the program. Expanding RegExp character classes manually to if-then chains like this is hardly an algorithmic improvement.

I just hope the main reason behind this version of isWhiteSpace is that it is more readable and maintainable this way. It could be.

I indeed hope the new "algorithmic" approach performs better, but mostly, that it will provide performs composability (fenced code blocks in commonmark). Also, by adding this in version 3 (entities and escapes), I hope to ease the change to full commonmark supporting emphasis / strong (version 4?)

Im hoping this will also fix the safari issues!

Im hoping this will also fix the safari issues!

I just ran the benchmark on JavaScriptCore and it showed that RegExp version (/\s/.test) is faster than isWhiteSpace (5 times faster, actually). The version of JSC is rather old though.

I indeed hope the new "algorithmic" approach performs better, but mostly, that it will provide performs composability

That is true, and I agree that algorithmic approach is better than parsing with RegExps! But I think that simple one-character RegExps (such as /\s/ or /[0-9a-fA-F]/) can still be used when they make the code simpler and more concise. There is nothing more "algorithmic" in

if ( char === '0' || char === '1' || char === '2' || // ...

compared to /[0-9a-fA-F]/. It is just the same, except that when written in such an explicit form it is also prone to typos and missing cases. At the same time, I understand that for more complex character classes there is no short equivalent.

I can imagine, thanks for worrying haha! I think you’re right, we’ll see. I’ll have to see where things go from here. I might just revert that change, but we’ll have to see where things go when everything is transformed to the new “algorithmic” approach!

First I’m going overboard with this, then we’ll see which cases can return to \s like regexes! 😄

👍 That's really not a big deal. I just wanted to share some thoughts. This effort on rewriting the parser is way more important for the project!

@eush77 Thought: maybe switching from character comparison to character-code comparison might speed things up.

wooorm · 2015-12-02T17:49:32Z

Whoop whoop! 50% there 😄

wooorm · 2015-12-05T12:53:16Z

75%! 👌

The coming algorithms are a bit harder, as they use negative look-aheads for other patterns. A loose blockquote stops when a definition is found, I’ve no idea how to handle that yet.

wooorm · 2015-12-07T12:51:32Z

92.85714286% 👌

wooorm · 2015-12-09T22:38:33Z

I’ve just rewritten the list-parsing algorithm, and they’re such a mess. I mean, pedantic and gfm are horrible.

I also found several spots where the algorithms weren’t sufficient yet, and identified them in the code for future changes.

Also: I’ll still have to check the coverage and investigate if any lines are never used.

wooorm · 2015-12-12T18:31:47Z

The parser is completely rewritten. Some fixes ensued. 100% coverage is back, meaning, I’m close to pushing this. Just need to ensure the tests will pass by updating mdast plug-ins.

...in favour of an algorithmic approach. New features: * Add support for `silent` mode on tokenisers, which will detect whether a given value would match that tokeniser, but without eating any actual content. * Plugging into the parser is singificantly improved. * Although not a feature, performance is increased by 50%. Fixes: * Better handling of mismatched parentheses in links; * Continued block-quotes in non-GFM-mode; * Block-quotes followed by lazy definitions in commonmark-mode; * Malformed HTML block elements are no longer supported; * Bug where GFM’s literal URL detection could detect e-mail addresses without an at-character; * Bug where `mailto:` literal URLs were not properly stripped of their protocol. Closes GH-96. [ci skip]

...in favour of an algorithmic approach. New features: * Add support for `silent` mode on tokenisers, which will detect whether a given value would match that tokeniser, but without eating any actual content. * Plugging into the parser is singificantly improved; * Searching for upcomming inline nodes is improved by allowing `locator`s to search for possible places where nodes start. Fixes: * Better handling of mismatched parentheses in links; * Continued block-quotes in non-GFM-mode; * Block-quotes followed by lazy definitions in commonmark-mode; * Malformed HTML block elements are no longer supported; * Bug where GFM’s literal URL detection could detect e-mail addresses without an at-character; * Bug where `mailto:` literal URLs were not properly stripped of their protocol. Todo: * A lot of places support escaped (a slash followed by another character), instead of allowing any character to be escaped, only certain characters should be supported (e.g., all ASCII-character for commonmark-mode, and several separate subsets on other flavours). Closes GH-96. [ci skip]

...in favour of an algorithmic approach. New features: * Add support for `silent` mode on tokenisers, which will detect whether a given value would match that tokeniser, but without eating any actual content. * Plugging into the parser is singificantly improved; * Searching for upcomming inline nodes is improved by allowing `locator`s to search for possible places where nodes start. Fixes: * Better handling of mismatched parentheses in links; * Continued block-quotes in non-GFM-mode; * Block-quotes followed by lazy definitions in commonmark-mode; * Malformed HTML block elements are no longer supported; * Bug where GFM’s literal URL detection could detect e-mail addresses without an at-character; * Bug where `mailto:` literal URLs were not properly stripped of their protocol. Todo: * A lot of places support escaped (a slash followed by another character), instead of allowing any character to be escaped, only certain characters should be supported (e.g., all ASCII-character for commonmark-mode, and several separate subsets on other flavours). [ci skip]

wooorm · 2015-12-14T19:20:42Z

Whoop 🎉 Successfully built on the next branch by updating all mdast packages 😄

wooorm · 2015-12-24T23:00:13Z

Closed by 25a26f2.

eush77 reviewed Nov 26, 2015
View reviewed changes

wooorm force-pushed the feature/algorithmic branch 6 times, most recently from 43caf0d to ea51a1e Compare December 2, 2015 17:48

wooorm force-pushed the feature/algorithmic branch 6 times, most recently from 63fd815 to f18ea5b Compare December 5, 2015 12:51

wooorm force-pushed the feature/algorithmic branch 3 times, most recently from 417dd75 to aae11c3 Compare December 6, 2015 14:43

wooorm mentioned this pull request Dec 7, 2015

Ambiguous definitions: footnotes without spacing #106

Closed

wooorm force-pushed the feature/algorithmic branch from aae11c3 to 9b5ae3c Compare December 7, 2015 12:50

wooorm force-pushed the feature/algorithmic branch from 9b5ae3c to c56b122 Compare December 9, 2015 22:36

wooorm force-pushed the feature/algorithmic branch 5 times, most recently from 9ade401 to 63c150b Compare December 12, 2015 18:29

wooorm force-pushed the feature/algorithmic branch from 63c150b to 546d074 Compare December 12, 2015 18:51

wooorm changed the title ~~[WIP] Refactor to remove regular expressions~~ Refactor to remove regular expressions Dec 12, 2015

wooorm force-pushed the feature/algorithmic branch 2 times, most recently from ba379bb to 2b005b7 Compare December 13, 2015 19:20

wooorm force-pushed the feature/algorithmic branch 2 times, most recently from 4185591 to 6548415 Compare December 14, 2015 17:49

wooorm force-pushed the feature/algorithmic branch from 6548415 to 6d13eb5 Compare December 14, 2015 18:51

This was referenced Dec 14, 2015

Add docs for new parser integration #109

Closed

Make it easier for plugins to add tokenizers to the parser #42

Closed

wooorm self-assigned this Dec 14, 2015

wooorm added this to the 3.0.0 milestone Dec 14, 2015

wooorm closed this Dec 24, 2015

wooorm deleted the feature/algorithmic branch December 27, 2015 09:01

wooorm added 🦋 type/enhancement This is great to have remark labels Jan 10, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor to remove regular expressions #96

Refactor to remove regular expressions #96

wooorm commented Nov 26, 2015

eush77 Nov 26, 2015

wooorm Nov 26, 2015

eush77 Nov 27, 2015

wooorm Nov 27, 2015

eush77 Nov 27, 2015

wooorm Nov 28, 2015

wooorm commented Dec 2, 2015

wooorm commented Dec 5, 2015

wooorm commented Dec 7, 2015

wooorm commented Dec 9, 2015

wooorm commented Dec 12, 2015

wooorm commented Dec 14, 2015

wooorm commented Dec 24, 2015

Refactor to remove regular expressions #96

Refactor to remove regular expressions #96

Conversation

wooorm commented Nov 26, 2015

eush77 Nov 26, 2015

Choose a reason for hiding this comment

wooorm Nov 26, 2015

Choose a reason for hiding this comment

eush77 Nov 27, 2015

Choose a reason for hiding this comment

wooorm Nov 27, 2015

Choose a reason for hiding this comment

eush77 Nov 27, 2015

Choose a reason for hiding this comment

wooorm Nov 28, 2015

Choose a reason for hiding this comment

wooorm commented Dec 2, 2015

wooorm commented Dec 5, 2015

wooorm commented Dec 7, 2015

wooorm commented Dec 9, 2015

wooorm commented Dec 12, 2015

wooorm commented Dec 14, 2015

wooorm commented Dec 24, 2015