fix some emphasis parsing #269

tatchi · 2022-05-13T05:19:16Z

Not ready for review yet. Opening it as a draft to give an overview of what I'm currently working on.

tatchi · 2022-05-19T19:35:03Z

Okay, I think I managed to fix all the broken tests mentioned in #244. At least tests are passing now and nothing else broke 😁

It's the first time I'm working on a parser, and that was not easy 🥵😅

Based on the expected outcome, I first tried to guess what should be the solution until I told myself that it should probably be a good idea to read the spec. First time that I'm reading a spec (or rather a part of it), and that was very helpful! 😁

There are 3 main commits:

b52495a that fixes 410,411, 414 and 428
cceeb2c that fixes 415 and 416
7e6a712 that fixes 468 and 469

I'm not very happy with the code of 1 and 3 so I'll see if I come up with something better. I'm planning on creating a separate PR for each of them so we can have a dedicated discussion for each. The fix is different anyway so I think it makes sense.

sonologico · 2022-05-23T16:53:40Z

This is great. Thank you. Splitting PRs would be good.

shonfeder · 2022-05-23T20:28:06Z

Yeah, this is really great work!

I agree that refactoring it into independent PRs could help keep things very clean.

One heads up is should we be aware of the potential with conflicts with #266. I'll work on trying to get that merged in the next day or two, which should help clear up that risk one way or another :)

shonfeder · 2022-05-24T01:47:44Z

I would be fine with review this from the current PR. While I appreciate the intention to keep the PRs clean and focused, this group of spec violations was already grouped into a single issue, and the changeset is small enough here, and close enough related, that I think it makes sense to review in one PR.

tests/extract_tests.ml

shonfeder

Left some initial questions and suggestions.

I think your additions generally read ok and fits with the existing code around it.

My biggest ask is that you provide comments to explain the logic you're introducing. This is something we are sorely lacking in the existing parsing code, and it would be awesome to start sharing the knowledge while it is fresh in people's minds after they have to reconstruct it from the code.

All of my other suggestions here are negotiable if you have strong feelings that point in another direction.

src/parser.ml

shonfeder · 2022-05-24T01:58:40Z

src/parser.ml

+                is_opener x
+                && (n1 + n2) mod 3 = 0
+                && n1 mod 3 != 0
+                && n2 mod 3 != 0


Could you either add a comment explaining the logic for this conditional, or define a predicate that provided a "self-documenting" explanation of what is being tested in this condition?

I added an is_match function with a description coming from the spec: https://github.com/ocaml/omd/pull/269/files#diff-2f56b48efe4c0b849c5b527b223a6c8ab85abd599e10f8b465f5c9801449a468R921

Not convinced by the function's name though but I couldn't come up with anything better. Open to any better alternative :)

src/parser.ml

shonfeder · 2022-05-24T02:07:18Z

src/parser.ml

+                let xs =
+                  if n1 >= 2 && n2 >= 2 then
+                    if n2 > 2 then Emph (Other, post, q2, n2 - 2) :: xs else xs
+                  else if n2 > 1 then Emph (Punct, post, q2, n2 - 1) :: xs
+                  else xs
+                in
+                let r =
+                  let il = concat (List.map to_r (parse_emph (List.rev acc))) in
+                  if n1 >= 2 && n2 >= 2 then R (Strong ([], il)) :: xs
+                  else R (Emph ([], il)) :: xs
+                in
+                let r =
+                  if n1 >= 2 && n2 >= 2 then
+                    if n1 > 2 then Emph (pre, Other, q1, n1 - 2) :: r else r
+                  else if n1 > 1 then Emph (pre, Punct, q1, n1 - 1) :: r
+                  else r
+                in
+                parse_emph r


I know you didn't add this code, but if you happened to have reasoned out what it does and can help illuminate with some comments, that would be a big help. I find it pretty inscrutable! If not, not worries :)

src/parser.ml

shonfeder · 2022-05-24T02:14:10Z

src/parser.ml

+              if not is_next_closer_same then loop (x :: acc) xs1
+              else
+                let xs' = parse_emph xs in
+                if xs' = xs then loop (x :: acc) xs1 else loop acc xs'


I haven't puzzled out this function enough, do you happen to know why we are doing a (polymorphic) equality comparison on the lists here

I have no idea 😬 I could investigate but that comparison doesn't seem to be needed as the following code successfully pass all the tests.

diff --git a/src/parser.ml b/src/parser.ml index e90b96d..46f37b6 100644 --- a/src/parser.ml +++ b/src/parser.ml @@ -959,9 +959,7 @@ module Pre = struct | Some (_, _, q3, _) -> q2 = q3 in if not is_next_closer_same then loop (x :: acc) xs1 - else - let xs' = parse_emph xs in - if xs' = xs then loop (x :: acc) xs1 else loop acc xs' + else loop acc (parse_emph xs) | x :: xs -> loop (x :: acc) xs | [] -> x :: List.rev acc in

That's already a good investigation! I'll make note to followup here with a change that removes this bit. Thanks!

I tried removing that equality comparison in master

diff --git a/src/parser.ml b/src/parser.ml index 6342989..053ca50 100644 --- a/src/parser.ml +++ b/src/parser.ml @@ -926,9 +926,8 @@ module Pre = struct else r in parse_emph r - | (Emph _ as x) :: xs1 as xs when is_opener x -> - let xs' = parse_emph xs in - if xs' = xs then loop (x :: acc) xs1 else loop acc xs' + | (Emph _ as x) :: _ as xs when is_opener x -> + loop acc (parse_emph xs) | x :: xs -> loop (x :: acc) xs | [] -> x :: List.rev acc in

When running the tests, it gets stuck at 99%. I assume it's because of an infinite loop. So it appears that this code is useful in master but becomes unnecessary with that PR.

tatchi · 2022-05-24T20:47:56Z

Thanks for the review! 😊 I would have preferred to split that PR into 3 smaller ones, but now that you started the review process, let's continue here :)

I will address your comments and try to do my best to document my changes.

tatchi · 2022-05-28T16:41:30Z

I should have addressed your comments. I also added some comments to document the code I added. That wasn't super easy to explain so hopefully it's more or less understandable 😅

I'm wondering if the functions find_next_emph I introduced are ideal. Instead of looking at the "future" (i.e what we haven't read yet) to make a decision, I'm wondering if we couldn't continue to read in any case but store some info in the acc (i.e what was the last delimiter we read) to make a decision later on.

I'm not sure if that makes sense and if that's feasible. I'll try and see how far I can go.

shonfeder

I have a few followup suggestions, but looks excellent to me.

Thank you very much for your careful work here. The comments are a huge help in understanding what's going on in this code and, moreover and more importantly, set a good example for the rest of us to follow as we continue maintaining the parser.

shonfeder · 2022-05-29T14:55:57Z

src/parser.ml

+    (*
+      - *foo**bar**baz*
+
+            *foo** -> the second delimiter ** is both an opening and closing delimiter. 
+                      The sum of the length of both delimiters is 3, so they can't be matched.
+
+            **bar** -> they are both opening and closing delemiters. 
+                       Their sum is 4 which is not a multiple of 3 so they can be matched to produce <strong>bar</strong>
+
+            The end result is: <em>foo<strong>bar</strong>baz</em>
+
+      - *foo***bar**baz*
+
+            *foo*** -> *** is both an opening and closing delimiter. 
+                       Their sum is 4 so they can be matched to produce: <em>foo</em>**
+
+            **bar** -> they are both opening and closing delemiters. 
+                       Their sum is 4 which is not a multiple of 3 so they can be matched to produce <strong>bar</strong>
+
+            The end result is: <em>foo</em><strong>bar</strong>baz*
+
+      - ***foo***bar**baz*
+
+            ***foo*** -> the second delimiter *** is both an opening and closing delimiter. 
+                         Their sum is 6 which is a multiple of 3. However, both lengths are multiples of 3
+                         so they can be matched to produce: <em><strong>foo</strong></em>
+
+            bar**baz* -> ** is both an opening and closing delimiter.
+                         Their sum is 3 so they can't be matched
+
+            The end result is: <em><strong>foo</strong></em>bar**baz*
+      *)


This is wonderful documentation. Thank you very much!

src/parser.ml

shonfeder · 2022-05-29T15:04:09Z

src/parser.ml

+                   *foo**bar*baz*  The second delimiter that's both an opener/closer ( ** before bar)
+                                   doesn't match with the next delimiter ( * after bar). **bar will be
+                                   considered as regular text. The end result will be: <em>foo**bar</em>baz*


Dang, this is so subtle. Having the comment explaining really helps.

shonfeder · 2022-05-29T15:06:13Z

src/parser.ml

+              if not is_next_closer_same then loop (x :: acc) xs1
+              else
+                let xs' = parse_emph xs in
+                if xs' = xs then loop (x :: acc) xs1 else loop acc xs'


That's already a good investigation! I'll make note to followup here with a change that removes this bit. Thanks!

shonfeder · 2022-05-29T15:11:34Z

I would have preferred to split that PR into 3 smaller ones, but now that you started the review process, let's continue here :)

I'm sorry about that! I didn't mean to push you into doing just one PR. I'll refrain from doing a review before you indicate you're ready in the future.

As soon as you are happy with this one, please convert it from a draft PR. All that's left to be done there is to add an entry into the changelog (I'll do that if you haven't beaten me to it by the time this is converted from a draft).

Thanks again for your excellent work here on fixing the parser! :)

shonfeder · 2022-07-24T22:27:56Z

Hi, @tatchi! Just checking here. If you're happy with the state of things, I'd be happy to make a small add here to update the changelog and merge this in. But I don't want to step on your toes in case you had been planning to do more work on this.

tatchi · 2022-07-25T16:15:10Z

Hi, @tatchi! Just checking here. If you're happy with the state of things, I'd be happy to make a small add here to update the changelog and merge this in. But I don't want to step on your toes in case you had been planning to do more work on this.

Hi @shonfeder, everything looks good to me except these two find_next_* functions that I introduced. I'm actually not sure if that matters much, but I have the feeling that it would be better not to use such functions that look ahead but instead store info about what we've parsed so far to make future decisions.

I'm not sure if that will be feasible but I would like at least to give it a try. Good news is that I'll have time to look at it this week, so expect an outcome by the end of the week 😁

tatchi · 2022-07-28T15:44:32Z

I discovered an emphasis parsing bug that was not covered by the conformance tests. I added an extra test in 46d6a61 with the fix in 6e17dd3

For the rest, I haven't found a way to remove these two find_next_* functions, but I don't think it's important.

I marked the PR as ready for review so it can be merged now 😁

shonfeder · 2022-08-01T21:21:48Z

Many thanks for the great work here, @tatchi!

@cuihtlauac

CHANGES: - Expose the HTML escape function `htmlentities` (ocaml-community/omd#295 @cuihtlauac) - Support generation of identifiers in headers (ocaml-community/omd#294, @tatchi) - Support GitHub-Flavoured Markdown tables (ocaml-community/omd#292, @bobatkey) - Update parser to support CommonMark Spec 0.30 (ocaml-community/omd#266, @SquidDev) - Preserve the order of input files in the HTML output to stdout (ocaml-community/omd#258, @patricoferris) - Fix all deviations from CommonMark Spec 0.30 (ocaml-community/omd#284, ocaml-community/omd#283, ocaml-community/omd#278, ocaml-community/omd#277, ocaml-community/omd#269, @tatchi)

tatchi changed the title ~~fix some emphasis parsing~~ [DRAFT] fix some emphasis parsing May 13, 2022

shonfeder mentioned this pull request May 24, 2022

Update spec to 0.30 #266

Merged

shonfeder reviewed May 24, 2022

View reviewed changes

tests/extract_tests.ml Outdated Show resolved Hide resolved

shonfeder reviewed May 24, 2022

View reviewed changes

tatchi added 4 commits May 24, 2022 06:58

fix 410,411, 414 and 428

b386319

fix 415 and 416

3a80780

fmt

3f7cdca

fix 468 and 469

72ea961

tatchi force-pushed the fix-some-emph-parsing branch from 7e6a712 to 72ea961 Compare May 24, 2022 05:29

tatchi added 4 commits May 24, 2022 08:43

move find_next_emph to the outer scope

d51b798

move find_next_closer_emph to the outer scope

0b6885d

extract is_match function

ed8a673

refactor

5bc6c7f

tatchi added 2 commits May 24, 2022 22:53

better consistency in naming

8db3786

add comments

d67a859

shonfeder reviewed May 29, 2022

View reviewed changes

tmattio mentioned this pull request Jun 20, 2022

Port 1.3 to dune #273

Merged

address comments

6e8971c

tatchi added 3 commits July 27, 2022 19:20

make some variable names clearer

7128e69

add a failing test

46d6a61

fix extra test

6e17dd3

tatchi marked this pull request as ready for review July 28, 2022 15:40

tatchi changed the title ~~[DRAFT] fix some emphasis parsing~~ fix some emphasis parsing Jul 28, 2022

Merge branch 'master' into fix-some-emph-parsing

dbe7961

shonfeder enabled auto-merge (rebase) August 1, 2022 21:20

shonfeder disabled auto-merge August 1, 2022 21:21

shonfeder enabled auto-merge (squash) August 1, 2022 21:23

shonfeder merged commit 3077325 into ocaml-community:master Aug 1, 2022

shonfeder mentioned this pull request Dec 12, 2022

[new release] omd (2.0.0.alpha3) ocaml/opam-repository#22654

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix some emphasis parsing #269

fix some emphasis parsing #269

tatchi commented May 13, 2022

tatchi commented May 19, 2022

sonologico commented May 23, 2022

shonfeder commented May 23, 2022

shonfeder commented May 24, 2022

shonfeder left a comment

shonfeder May 24, 2022

tatchi May 28, 2022

shonfeder May 24, 2022

shonfeder May 24, 2022

tatchi May 24, 2022

shonfeder May 29, 2022

tatchi Jul 25, 2022

tatchi commented May 24, 2022

tatchi commented May 28, 2022

shonfeder left a comment

shonfeder May 29, 2022

shonfeder May 29, 2022

shonfeder May 29, 2022

shonfeder commented May 29, 2022

shonfeder commented Jul 24, 2022

tatchi commented Jul 25, 2022

tatchi commented Jul 28, 2022

shonfeder commented Aug 1, 2022

fix some emphasis parsing #269

fix some emphasis parsing #269

Conversation

tatchi commented May 13, 2022

tatchi commented May 19, 2022

sonologico commented May 23, 2022

shonfeder commented May 23, 2022

shonfeder commented May 24, 2022

shonfeder left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tatchi commented May 24, 2022

tatchi commented May 28, 2022

shonfeder left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shonfeder commented May 29, 2022

shonfeder commented Jul 24, 2022

tatchi commented Jul 25, 2022

tatchi commented Jul 28, 2022

shonfeder commented Aug 1, 2022