Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor to remove regular expressions #96

Closed
wants to merge 1 commit into from
Closed

Conversation

wooorm
Copy link
Member

@wooorm wooorm commented Nov 26, 2015

Refactor to remove regular expressions

...in favour of an algorithmic approach.

New features:

  • Add support for silent mode on tokenisers, which will
    detect whether a given value would match that tokeniser,
    but without eating any actual content.
  • Plugging into the parser is singificantly improved.

Fixes:

  • Better handling of mismatched parentheses in links;
  • Continued block-quotes in non-GFM-mode;
  • Block-quotes followed by lazy definitions in commonmark-mode;
  • Malformed HTML block elements are no longer supported;
  • Bug where GFM’s literal URL detection could detect e-mail addresses
    without an at-character;
  • Bug where mailto: literal URLs were not properly stripped
    of their protocol.

Todo:

  • A lot of places support escaped (a slash followed by another character),
    instead of allowing any character to be escaped, only certain characters
    should be supported (e.g., all ASCII-character for commonmark-mode, and
    several separate subsets on other flavours).

@@ -84,6 +135,35 @@ var INLINE_CODE = 'inlineCode';
var BREAK = 'break';
var ROOT = 'root';

/** TODO */
function isWhiteSpace(character) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is equivalent to /\s/.test, right? I measured it on Node (got nearly the same results, regexp version is slightly faster), on Firefox (regexp version is 2x faster) and on (older) Chromium (isWhiteSpace is faster by ~25%). I also created a jsperf for this.

It is not clear to me why isWhiteSpace version should be faster. JavaScript engines adopted JIT compilation for regular expressions long time ago. JITed code will be basically equivalent to this if-else chain, it can only be better and shared with other parts of the program. Expanding RegExp character classes manually to if-then chains like this is hardly an algorithmic improvement.

I just hope the main reason behind this version of isWhiteSpace is that it is more readable and maintainable this way. It could be.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I indeed hope the new "algorithmic" approach performs better, but mostly, that it will provide performs composability (fenced code blocks in commonmark). Also, by adding this in version 3 (entities and escapes), I hope to ease the change to full commonmark supporting emphasis / strong (version 4?)

Im hoping this will also fix the safari issues!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im hoping this will also fix the safari issues!

I just ran the benchmark on JavaScriptCore and it showed that RegExp version (/\s/.test) is faster than isWhiteSpace (5 times faster, actually). The version of JSC is rather old though.

I indeed hope the new "algorithmic" approach performs better, but mostly, that it will provide performs composability

That is true, and I agree that algorithmic approach is better than parsing with RegExps! But I think that simple one-character RegExps (such as /\s/ or /[0-9a-fA-F]/) can still be used when they make the code simpler and more concise. There is nothing more "algorithmic" in

if (
    char === '0' ||
    char === '1' ||
    char === '2' ||
// ...

compared to /[0-9a-fA-F]/. It is just the same, except that when written in such an explicit form it is also prone to typos and missing cases. At the same time, I understand that for more complex character classes there is no short equivalent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can imagine, thanks for worrying haha! I think you’re right, we’ll see. I’ll have to see where things go from here. I might just revert that change, but we’ll have to see where things go when everything is transformed to the new “algorithmic” approach!

First I’m going overboard with this, then we’ll see which cases can return to \s like regexes! 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 That's really not a big deal. I just wanted to share some thoughts. This effort on rewriting the parser is way more important for the project!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eush77 Thought: maybe switching from character comparison to character-code comparison might speed things up.

@wooorm wooorm force-pushed the feature/algorithmic branch 6 times, most recently from 43caf0d to ea51a1e Compare December 2, 2015 17:48
@wooorm
Copy link
Member Author

wooorm commented Dec 2, 2015

Whoop whoop! 50% there 😄

@wooorm wooorm force-pushed the feature/algorithmic branch 6 times, most recently from 63fd815 to f18ea5b Compare December 5, 2015 12:51
@wooorm
Copy link
Member Author

wooorm commented Dec 5, 2015

75%! 👌

The coming algorithms are a bit harder, as they use negative look-aheads for other patterns. A loose blockquote stops when a definition is found, I’ve no idea how to handle that yet.

@wooorm
Copy link
Member Author

wooorm commented Dec 7, 2015

92.85714286% 👌

@wooorm
Copy link
Member Author

wooorm commented Dec 9, 2015

I’ve just rewritten the list-parsing algorithm, and they’re such a mess. I mean, pedantic and gfm are horrible.

I also found several spots where the algorithms weren’t sufficient yet, and identified them in the code for future changes.

Also: I’ll still have to check the coverage and investigate if any lines are never used.

@wooorm wooorm force-pushed the feature/algorithmic branch 5 times, most recently from 9ade401 to 63c150b Compare December 12, 2015 18:29
@wooorm
Copy link
Member Author

wooorm commented Dec 12, 2015

The parser is completely rewritten. Some fixes ensued. 100% coverage is back, meaning, I’m close to pushing this. Just need to ensure the tests will pass by updating mdast plug-ins.

@wooorm wooorm changed the title [WIP] Refactor to remove regular expressions Refactor to remove regular expressions Dec 12, 2015
wooorm added a commit that referenced this pull request Dec 12, 2015
...in favour of an algorithmic approach.

New features:

*   Add support for `silent` mode on tokenisers, which will
    detect whether a given value would match that tokeniser,
    but without eating any actual content.
*   Plugging into the parser is singificantly improved.
*   Although not a feature, performance is increased by 50%.

Fixes:

*   Better handling of mismatched parentheses in links;
*   Continued block-quotes in non-GFM-mode;
*   Block-quotes followed by lazy definitions in commonmark-mode;
*   Malformed HTML block elements are no longer supported;
*   Bug where GFM’s literal URL detection could detect e-mail addresses
    without an at-character;
*   Bug where `mailto:` literal URLs were not properly stripped
    of their protocol.

Closes GH-96.

[ci skip]
@wooorm wooorm force-pushed the feature/algorithmic branch 2 times, most recently from ba379bb to 2b005b7 Compare December 13, 2015 19:20
wooorm added a commit that referenced this pull request Dec 13, 2015
...in favour of an algorithmic approach.

New features:

*   Add support for `silent` mode on tokenisers, which will
    detect whether a given value would match that tokeniser,
    but without eating any actual content.
*   Plugging into the parser is singificantly improved;
*   Searching for upcomming inline nodes is improved by allowing
    `locator`s to search for possible places where nodes start.

Fixes:

*   Better handling of mismatched parentheses in links;
*   Continued block-quotes in non-GFM-mode;
*   Block-quotes followed by lazy definitions in commonmark-mode;
*   Malformed HTML block elements are no longer supported;
*   Bug where GFM’s literal URL detection could detect e-mail addresses
    without an at-character;
*   Bug where `mailto:` literal URLs were not properly stripped
    of their protocol.

Todo:

*   A lot of places support escaped (a slash followed by another character),
    instead of allowing any character to be escaped, only certain characters
    should be supported (e.g., all ASCII-character for commonmark-mode, and
    several separate subsets on other flavours).

Closes GH-96.

[ci skip]
@wooorm wooorm force-pushed the feature/algorithmic branch 2 times, most recently from 4185591 to 6548415 Compare December 14, 2015 17:49
...in favour of an algorithmic approach.

New features:

*   Add support for `silent` mode on tokenisers, which will
    detect whether a given value would match that tokeniser,
    but without eating any actual content.
*   Plugging into the parser is singificantly improved;
*   Searching for upcomming inline nodes is improved by allowing
    `locator`s to search for possible places where nodes start.

Fixes:

*   Better handling of mismatched parentheses in links;
*   Continued block-quotes in non-GFM-mode;
*   Block-quotes followed by lazy definitions in commonmark-mode;
*   Malformed HTML block elements are no longer supported;
*   Bug where GFM’s literal URL detection could detect e-mail addresses
    without an at-character;
*   Bug where `mailto:` literal URLs were not properly stripped
    of their protocol.

Todo:

*   A lot of places support escaped (a slash followed by another character),
    instead of allowing any character to be escaped, only certain characters
    should be supported (e.g., all ASCII-character for commonmark-mode, and
    several separate subsets on other flavours).

[ci skip]
@wooorm
Copy link
Member Author

wooorm commented Dec 14, 2015

Whoop 🎉 Successfully built on the next branch by updating all mdast packages 😄

@wooorm wooorm self-assigned this Dec 14, 2015
@wooorm wooorm added this to the 3.0.0 milestone Dec 14, 2015
@wooorm
Copy link
Member Author

wooorm commented Dec 24, 2015

Closed by 25a26f2.

@wooorm wooorm closed this Dec 24, 2015
@wooorm wooorm deleted the feature/algorithmic branch December 27, 2015 09:01
@wooorm wooorm added 🦋 type/enhancement This is great to have remark labels Jan 10, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
remark 🦋 type/enhancement This is great to have
Development

Successfully merging this pull request may close these issues.

2 participants