Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix source positions for inlines. #298

Open
wants to merge 21 commits into
base: master
Choose a base branch
from

Conversation

chriszielinski
Copy link

@chriszielinski chriszielinski commented Apr 26, 2019

This PR fixes the source positions for:

  • Inlines inside inconsistently indented blocks.
  • Links/images.
  • Autolinks.
  • Escaped newlines.
  • Non-matched backticks.

Fixing the source positions for inlines inside inconsistently indented blocks is accomplished by maintaining the leading trivia when constructing the block (see add_line in blocks.c), and removing it during the inline parsing stage (see changes in inlines.c). The three source position tests added demonstrate the current implementation's shortfalls.

These changes are minimally invasive, and thus, slightly degrade performance—due to stripping the leading whitespace during the inline parsing phase. This can be avoided for normal parse/render operations by wrapping the functionality introduced in this PR with the conditional if (CMARK_OPT_SOURCEPOS & options). However, this will lead to the output of invalid source positions when the CMARK_OPT_SOURCEPOS option is specified for a rendering operation (i.e. cmark_render_xml) but not the original parsing operation (i.e. cmark_parse_document).

@jgm jgm mentioned this pull request Apr 28, 2019
@chriszielinski
Copy link
Author

The performance penalty encountered when running the benchmark 100 times is ~14ms to the mean execution time (and ~10ms to the median). Given that benchinput.md is 110.6 MB containing 1,484,151 lines (including blank lines), this is a minuscule penalty.

That being said, I still advocate only using this functionality when the CMARK_OPT_SOURCEPOS option is specified for the original parsing operation (as described above).

As an additional thought, aside from complexity, the overhead of introducing some other means of tracking the starting columns for each input line #296 (comment) may have a worse penalty than just stripping whitespace—thoughts?

System:

  • OS: macOS Mojave (10.14.4)
  • Model Identifier: MacBookPro12,1
  • Processor: 2.7 GHz Intel Core i5
  • Memory: 8 GB 1867 MHz DDR3

This PR

sudo renice -10 $$; make bench
{ for x in `seq 1 100` ; do \
        /usr/bin/env time -p build/src/cmark --sourcepos </dev/null >/dev/null ; \
        /usr/bin/env time -p build/src/cmark --sourcepos bench/benchinput.md >/dev/null ; \
        done \
    } 2>&1  | grep 'real' | awk '{print $2}' | python3 'bench/stats.py'
mean = 1.5747, median = 1.5700, stdev = 0.0258

cmark master (a61c4902f07789d40a717daef710da29e7899485)

sudo renice -10 $$; make bench
Password:
{ for x in `seq 1 100` ; do \
        /usr/bin/env time -p build/src/cmark --sourcepos </dev/null >/dev/null ; \
        /usr/bin/env time -p build/src/cmark --sourcepos bench/benchinput.md >/dev/null ; \
        done \
    } 2>&1  | grep 'real' | awk '{print $2}' | python3 'bench/stats.py'
mean = 1.5607, median = 1.5600, stdev = 0.0261

@jgm
Copy link
Member

jgm commented May 10, 2019

It's a small enough performance difference that I don't think we need to worry about it.

There's still the issue whether to make the parser, as well as the renderer, sensitive to CMARK_OPT_SOURCEPOS. Is there any reason, besides performance, to do this?

@chriszielinski
Copy link
Author

Not ostensibly. The only other reason supporting sensitivity is the isolation it affords from any potentially introduced side effects of the functionality. That being said, if there is confidence in the comprehensiveness of the unit tests, then this is no concern.

sandratatarevicova pushed a commit to orchitech/commonmarker that referenced this pull request Apr 8, 2020
@martincizek
Copy link

martincizek commented Apr 13, 2020

if there is confidence in the comprehensiveness of the unit tests, then this is no concern.

@jgm @chriszielinski There is high confidence that the patch breaks nothing while improving a lot. Some minor sourcepos issues likely remain, but the current status is just broken. Will explain below together with some more rationale why this patch is critical for some users and good for cmark's reputation.

Why the confidence:

  • The existing tests passes before and after applying the patch - for the sake of completeness.
  • When I try to hardcode SOURCEPOS option value on parser only (cmark_parser_new_with_mem), the tests pass both with this flag asserted and cleared. Which does not mean much more than the previous point, as this patch does not evaluate it. It means the already existing code sourcepos is not test-covered. :)
  • We had made an integration test on around ~250k Markdown texts, where we tried to locate all node delimiters using sourcepos. It is based on patched cmark-gfm, but this patch was included without changes. See the results in cmark-gfm-sourcepos-histogram.md.txt. My interpretation:
    • Cmark parsing is likely as expected even on large scale, because the delimiter chars located using sourcepos are as expected.
    • This mass test actually cross-checks both parsing and sourcepos generating and would not be even possible without correct sourcepos.
    • The sourcepos implementation for links and footnote definitions should be checked. But this does not worsen the current situation.

I think the histogram speaks for itself and I was actually impressed by the resulting accuracy, especially because most of the texts contain UTF-8 outside US-ASCII.

Why to merge this patch: It allows valuable use cases as our GfmEditor. This tool actually serves for migrating from Redcarpet :). Currently part of a Redmine plugin, but we can make it a separate tool for users and the uses cases can go beyond Redcarpet migration. You can check its description here.

Why not to wait (my 2 cents):

  • The confidence in not breaking things has more sources than unit tests and leak check.
  • As agreed above, the perfomance penalty is not an issue.
  • Sourcepos is a game-changing feature for some users, but it is currently just broken
  • And I totally understand that CMARK_OPT_SOURCEPOS should be more considered as a parser option and it should be done subsequently, but ironically it is a reason not to wait, as the option can't even be communicated to the users atm. The current situation is:
    • When I do not enable CMARK_OPT_SOURCEPOS on parser, sourcepos is broken on all inlines.
    • When I do enable CMARK_OPT_SOURCEPOS on parser, sourcepos gets improved on code, but remains broken on all inlines.

Hope it won't be turned down because our work was made on the cmark-gfm fork. But everything except extensions applies to this repo as well.

Thank you for your work and let me know if I can help with something. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants