-
-
Notifications
You must be signed in to change notification settings - Fork 354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor to remove regular expressions #96
Conversation
@@ -84,6 +135,35 @@ var INLINE_CODE = 'inlineCode'; | |||
var BREAK = 'break'; | |||
var ROOT = 'root'; | |||
|
|||
/** TODO */ | |||
function isWhiteSpace(character) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is equivalent to /\s/.test
, right? I measured it on Node (got nearly the same results, regexp version is slightly faster), on Firefox (regexp version is 2x faster) and on (older) Chromium (isWhiteSpace
is faster by ~25%). I also created a jsperf for this.
It is not clear to me why isWhiteSpace
version should be faster. JavaScript engines adopted JIT compilation for regular expressions long time ago. JITed code will be basically equivalent to this if-else chain, it can only be better and shared with other parts of the program. Expanding RegExp character classes manually to if-then chains like this is hardly an algorithmic improvement.
I just hope the main reason behind this version of isWhiteSpace
is that it is more readable and maintainable this way. It could be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I indeed hope the new "algorithmic" approach performs better, but mostly, that it will provide performs composability (fenced code blocks in commonmark). Also, by adding this in version 3 (entities and escapes), I hope to ease the change to full commonmark supporting emphasis / strong (version 4?)
Im hoping this will also fix the safari issues!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im hoping this will also fix the safari issues!
I just ran the benchmark on JavaScriptCore and it showed that RegExp version (/\s/.test
) is faster than isWhiteSpace
(5 times faster, actually). The version of JSC is rather old though.
I indeed hope the new "algorithmic" approach performs better, but mostly, that it will provide performs composability
That is true, and I agree that algorithmic approach is better than parsing with RegExps! But I think that simple one-character RegExps (such as /\s/
or /[0-9a-fA-F]/
) can still be used when they make the code simpler and more concise. There is nothing more "algorithmic" in
if (
char === '0' ||
char === '1' ||
char === '2' ||
// ...
compared to /[0-9a-fA-F]/
. It is just the same, except that when written in such an explicit form it is also prone to typos and missing cases. At the same time, I understand that for more complex character classes there is no short equivalent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can imagine, thanks for worrying haha! I think you’re right, we’ll see. I’ll have to see where things go from here. I might just revert that change, but we’ll have to see where things go when everything is transformed to the new “algorithmic” approach!
First I’m going overboard with this, then we’ll see which cases can return to \s
like regexes! 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 That's really not a big deal. I just wanted to share some thoughts. This effort on rewriting the parser is way more important for the project!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eush77 Thought: maybe switching from character comparison to character-code comparison might speed things up.
43caf0d
to
ea51a1e
Compare
Whoop whoop! 50% there 😄 |
63fd815
to
f18ea5b
Compare
75%! 👌 The coming algorithms are a bit harder, as they use negative look-aheads for other patterns. A loose blockquote stops when a definition is found, I’ve no idea how to handle that yet. |
417dd75
to
aae11c3
Compare
aae11c3
to
9b5ae3c
Compare
92.85714286% 👌 |
9b5ae3c
to
c56b122
Compare
I’ve just rewritten the list-parsing algorithm, and they’re such a mess. I mean, pedantic and gfm are horrible. I also found several spots where the algorithms weren’t sufficient yet, and identified them in the code for future changes. Also: I’ll still have to check the coverage and investigate if any lines are never used. |
9ade401
to
63c150b
Compare
The parser is completely rewritten. Some fixes ensued. 100% coverage is back, meaning, I’m close to pushing this. Just need to ensure the tests will pass by updating mdast plug-ins. |
63c150b
to
546d074
Compare
...in favour of an algorithmic approach. New features: * Add support for `silent` mode on tokenisers, which will detect whether a given value would match that tokeniser, but without eating any actual content. * Plugging into the parser is singificantly improved. * Although not a feature, performance is increased by 50%. Fixes: * Better handling of mismatched parentheses in links; * Continued block-quotes in non-GFM-mode; * Block-quotes followed by lazy definitions in commonmark-mode; * Malformed HTML block elements are no longer supported; * Bug where GFM’s literal URL detection could detect e-mail addresses without an at-character; * Bug where `mailto:` literal URLs were not properly stripped of their protocol. Closes GH-96. [ci skip]
ba379bb
to
2b005b7
Compare
...in favour of an algorithmic approach. New features: * Add support for `silent` mode on tokenisers, which will detect whether a given value would match that tokeniser, but without eating any actual content. * Plugging into the parser is singificantly improved; * Searching for upcomming inline nodes is improved by allowing `locator`s to search for possible places where nodes start. Fixes: * Better handling of mismatched parentheses in links; * Continued block-quotes in non-GFM-mode; * Block-quotes followed by lazy definitions in commonmark-mode; * Malformed HTML block elements are no longer supported; * Bug where GFM’s literal URL detection could detect e-mail addresses without an at-character; * Bug where `mailto:` literal URLs were not properly stripped of their protocol. Todo: * A lot of places support escaped (a slash followed by another character), instead of allowing any character to be escaped, only certain characters should be supported (e.g., all ASCII-character for commonmark-mode, and several separate subsets on other flavours). Closes GH-96. [ci skip]
4185591
to
6548415
Compare
...in favour of an algorithmic approach. New features: * Add support for `silent` mode on tokenisers, which will detect whether a given value would match that tokeniser, but without eating any actual content. * Plugging into the parser is singificantly improved; * Searching for upcomming inline nodes is improved by allowing `locator`s to search for possible places where nodes start. Fixes: * Better handling of mismatched parentheses in links; * Continued block-quotes in non-GFM-mode; * Block-quotes followed by lazy definitions in commonmark-mode; * Malformed HTML block elements are no longer supported; * Bug where GFM’s literal URL detection could detect e-mail addresses without an at-character; * Bug where `mailto:` literal URLs were not properly stripped of their protocol. Todo: * A lot of places support escaped (a slash followed by another character), instead of allowing any character to be escaped, only certain characters should be supported (e.g., all ASCII-character for commonmark-mode, and several separate subsets on other flavours). [ci skip]
6548415
to
6d13eb5
Compare
Whoop 🎉 Successfully built on the |
Closed by 25a26f2. |
Refactor to remove regular expressions
...in favour of an algorithmic approach.
New features:
silent
mode on tokenisers, which willdetect whether a given value would match that tokeniser,
but without eating any actual content.
Fixes:
without an at-character;
mailto:
literal URLs were not properly strippedof their protocol.
Todo:
instead of allowing any character to be escaped, only certain characters
should be supported (e.g., all ASCII-character for commonmark-mode, and
several separate subsets on other flavours).