Fix source positions for inlines. #298

chriszielinski · 2019-04-26T18:53:36Z

This PR fixes the source positions for:

Inlines inside inconsistently indented blocks.
Links/images.
Autolinks.
Escaped newlines.
Non-matched backticks.

Fixing the source positions for inlines inside inconsistently indented blocks is accomplished by maintaining the leading trivia when constructing the block (see add_line in blocks.c), and removing it during the inline parsing stage (see changes in inlines.c). The three source position tests added demonstrate the current implementation's shortfalls.

These changes are minimally invasive, and thus, slightly degrade performance—due to stripping the leading whitespace during the inline parsing phase. This can be avoided for normal parse/render operations by wrapping the functionality introduced in this PR with the conditional if (CMARK_OPT_SOURCEPOS & options). However, this will lead to the output of invalid source positions when the CMARK_OPT_SOURCEPOS option is specified for a rendering operation (i.e. cmark_render_xml) but not the original parsing operation (i.e. cmark_parse_document).

…urce position test case.

chriszielinski · 2019-05-08T20:52:36Z

The performance penalty encountered when running the benchmark 100 times is ~14ms to the mean execution time (and ~10ms to the median). Given that benchinput.md is 110.6 MB containing 1,484,151 lines (including blank lines), this is a minuscule penalty.

That being said, I still advocate only using this functionality when the CMARK_OPT_SOURCEPOS option is specified for the original parsing operation (as described above).

As an additional thought, aside from complexity, the overhead of introducing some other means of tracking the starting columns for each input line #296 (comment) may have a worse penalty than just stripping whitespace—thoughts?

System:

OS: macOS Mojave (10.14.4)
Model Identifier: MacBookPro12,1
Processor: 2.7 GHz Intel Core i5
Memory: 8 GB 1867 MHz DDR3

This PR

sudo renice -10 $$; make bench
{ for x in `seq 1 100` ; do \
        /usr/bin/env time -p build/src/cmark --sourcepos </dev/null >/dev/null ; \
        /usr/bin/env time -p build/src/cmark --sourcepos bench/benchinput.md >/dev/null ; \
        done \
    } 2>&1  | grep 'real' | awk '{print $2}' | python3 'bench/stats.py'
mean = 1.5747, median = 1.5700, stdev = 0.0258

cmark master (a61c4902f07789d40a717daef710da29e7899485)

sudo renice -10 $$; make bench
Password:
{ for x in `seq 1 100` ; do \
        /usr/bin/env time -p build/src/cmark --sourcepos </dev/null >/dev/null ; \
        /usr/bin/env time -p build/src/cmark --sourcepos bench/benchinput.md >/dev/null ; \
        done \
    } 2>&1  | grep 'real' | awk '{print $2}' | python3 'bench/stats.py'
mean = 1.5607, median = 1.5600, stdev = 0.0261

jgm · 2019-05-10T03:28:16Z

It's a small enough performance difference that I don't think we need to worry about it.

There's still the issue whether to make the parser, as well as the renderer, sensitive to CMARK_OPT_SOURCEPOS. Is there any reason, besides performance, to do this?

chriszielinski · 2019-05-14T19:07:26Z

Not ostensibly. The only other reason supporting sensitivity is the isolation it affords from any potentially introduced side effects of the functionality. That being said, if there is confidence in the comprehensiveness of the unit tests, then this is no concern.

…end condition.

Applied patch from commonmark/cmark#298

martincizek · 2020-04-13T13:07:01Z

if there is confidence in the comprehensiveness of the unit tests, then this is no concern.

@jgm @chriszielinski There is high confidence that the patch breaks nothing while improving a lot. Some minor sourcepos issues likely remain, but the current status is just broken. Will explain below together with some more rationale why this patch is critical for some users and good for cmark's reputation.

Why the confidence:

The existing tests passes before and after applying the patch - for the sake of completeness.
When I try to hardcode SOURCEPOS option value on parser only (cmark_parser_new_with_mem), the tests pass both with this flag asserted and cleared. Which does not mean much more than the previous point, as this patch does not evaluate it. It means the already existing code sourcepos is not test-covered. :)
We had made an integration test on around ~250k Markdown texts, where we tried to locate all node delimiters using sourcepos. It is based on patched cmark-gfm, but this patch was included without changes. See the results in cmark-gfm-sourcepos-histogram.md.txt. My interpretation:
- Cmark parsing is likely as expected even on large scale, because the delimiter chars located using sourcepos are as expected.
- This mass test actually cross-checks both parsing and sourcepos generating and would not be even possible without correct sourcepos.
- The sourcepos implementation for links and footnote definitions should be checked. But this does not worsen the current situation.

I think the histogram speaks for itself and I was actually impressed by the resulting accuracy, especially because most of the texts contain UTF-8 outside US-ASCII.

Why to merge this patch: It allows valuable use cases as our GfmEditor. This tool actually serves for migrating from Redcarpet :). Currently part of a Redmine plugin, but we can make it a separate tool for users and the uses cases can go beyond Redcarpet migration. You can check its description here.

Why not to wait (my 2 cents):

The confidence in not breaking things has more sources than unit tests and leak check.
As agreed above, the perfomance penalty is not an issue.
Sourcepos is a game-changing feature for some users, but it is currently just broken
And I totally understand that CMARK_OPT_SOURCEPOS should be more considered as a parser option and it should be done subsequently, but ironically it is a reason not to wait, as the option can't even be communicated to the users atm. The current situation is:
- When I do not enable CMARK_OPT_SOURCEPOS on parser, sourcepos is broken on all inlines.
- When I do enable CMARK_OPT_SOURCEPOS on parser, sourcepos gets improved on code, but remains broken on all inlines.

Hope it won't be turned down because our work was made on the cmark-gfm fork. But everything except extensions applies to this repo as well.

Thank you for your work and let me know if I can help with something. :)

chriszielinski added 4 commits April 26, 2019 10:21

Add cmark_strbuf_remove that removes a subrange of characters.

8d6189a

Fix source positions for inlines inside inconsistently indented blocks.

e3d0624

Add three additional source position tests.

9e40723

Fix outdated expected test result.

01dd296

jgm mentioned this pull request Apr 28, 2019

Source position seems off #296

Open

chriszielinski added 14 commits April 30, 2019 19:37

Fix autolink source position.

0df0155

Fix expected autolink test fixture, and add an additional autolink so…

a0b8f1e

…urce position test case.

Fix source position of setext headings.

923d781

Add setext heading test case.

8da9153

Fix source position for ATX-style headings.

7bc130c

Add ATX-style heading source position test case.

8566ec6

Fix HTMl block source position.

d16c3d0

Add HTML block source position test case.

33f3451

Fix thematic break source position.

52b5ec8

Add thematic break source position test case.

3b10dbc

Fix ending source position for lists and list items.

3b20016

Add list/list item source position test case.

d15b5d8

Remove commented out code.

7125696

Correct list source position.

a8ba567

chriszielinski added 3 commits May 15, 2019 08:22

Fix source position for HTML blocks without a matching end condition.

62f4c3f

Add test case for source position of a HTML block without a matching …

b37b674

…end condition.

Fix line length.

405fee1

sandratatarevicova pushed a commit to orchitech/commonmarker that referenced this pull request Apr 8, 2020

Fix source positions for inlines

8e1a5b8

Applied patch from commonmark/cmark#298

kivikakk mentioned this pull request Sep 21, 2022

[BUG] Incorrect position of one-line HTML comment tags kivikakk/cmark-gfm-hs#23

Open

shp7724 mentioned this pull request Jun 23, 2024

Strikethrough range incorrectly detected for multiline text swiftlang/swift-markdown#183

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix source positions for inlines. #298

Fix source positions for inlines. #298

chriszielinski commented Apr 26, 2019 •

edited

Loading

chriszielinski commented May 8, 2019

jgm commented May 10, 2019

chriszielinski commented May 14, 2019

martincizek commented Apr 13, 2020 •

edited

Loading

Fix source positions for inlines. #298

Are you sure you want to change the base?

Fix source positions for inlines. #298

Conversation

chriszielinski commented Apr 26, 2019 • edited Loading

chriszielinski commented May 8, 2019

System:

This PR

cmark master (a61c4902f07789d40a717daef710da29e7899485)

jgm commented May 10, 2019

chriszielinski commented May 14, 2019

martincizek commented Apr 13, 2020 • edited Loading

chriszielinski commented Apr 26, 2019 •

edited

Loading

martincizek commented Apr 13, 2020 •

edited

Loading