Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix some emphasis parsing #269

Merged
merged 15 commits into from
Aug 1, 2022

Conversation

tatchi
Copy link
Collaborator

@tatchi tatchi commented May 13, 2022

Not ready for review yet. Opening it as a draft to give an overview of what I'm currently working on.

@tatchi tatchi changed the title fix some emphasis parsing [DRAFT] fix some emphasis parsing May 13, 2022
@tatchi
Copy link
Collaborator Author

tatchi commented May 19, 2022

Okay, I think I managed to fix all the broken tests mentioned in #244. At least tests are passing now and nothing else broke 😁

It's the first time I'm working on a parser, and that was not easy 🥵😅

Based on the expected outcome, I first tried to guess what should be the solution until I told myself that it should probably be a good idea to read the spec. First time that I'm reading a spec (or rather a part of it), and that was very helpful! 😁

There are 3 main commits:

  1. b52495a that fixes 410,411, 414 and 428
  2. cceeb2c that fixes 415 and 416
  3. 7e6a712 that fixes 468 and 469

I'm not very happy with the code of 1 and 3 so I'll see if I come up with something better. I'm planning on creating a separate PR for each of them so we can have a dedicated discussion for each. The fix is different anyway so I think it makes sense.

@sonologico
Copy link
Collaborator

This is great. Thank you. Splitting PRs would be good.

@shonfeder
Copy link
Collaborator

Yeah, this is really great work!

I agree that refactoring it into independent PRs could help keep things very clean.

One heads up is should we be aware of the potential with conflicts with #266. I'll work on trying to get that merged in the next day or two, which should help clear up that risk one way or another :)

@shonfeder shonfeder mentioned this pull request May 24, 2022
@shonfeder
Copy link
Collaborator

I would be fine with review this from the current PR. While I appreciate the intention to keep the PRs clean and focused, this group of spec violations was already grouped into a single issue, and the changeset is small enough here, and close enough related, that I think it makes sense to review in one PR.

tests/extract_tests.ml Outdated Show resolved Hide resolved
Copy link
Collaborator

@shonfeder shonfeder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some initial questions and suggestions.

I think your additions generally read ok and fits with the existing code around it.

My biggest ask is that you provide comments to explain the logic you're introducing. This is something we are sorely lacking in the existing parsing code, and it would be awesome to start sharing the knowledge while it is fresh in people's minds after they have to reconstruct it from the code.

All of my other suggestions here are negotiable if you have strong feelings that point in another direction.

src/parser.ml Outdated Show resolved Hide resolved
src/parser.ml Outdated
Comment on lines 973 to 916
is_opener x
&& (n1 + n2) mod 3 = 0
&& n1 mod 3 != 0
&& n2 mod 3 != 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you either add a comment explaining the logic for this conditional, or define a predicate that provided a "self-documenting" explanation of what is being tested in this condition?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an is_match function with a description coming from the spec: https://github.com/ocaml/omd/pull/269/files#diff-2f56b48efe4c0b849c5b527b223a6c8ab85abd599e10f8b465f5c9801449a468R921

Not convinced by the function's name though but I couldn't come up with anything better. Open to any better alternative :)

src/parser.ml Outdated Show resolved Hide resolved
src/parser.ml Outdated Show resolved Hide resolved
src/parser.ml Outdated
Comment on lines 993 to 950
let xs =
if n1 >= 2 && n2 >= 2 then
if n2 > 2 then Emph (Other, post, q2, n2 - 2) :: xs else xs
else if n2 > 1 then Emph (Punct, post, q2, n2 - 1) :: xs
else xs
in
let r =
let il = concat (List.map to_r (parse_emph (List.rev acc))) in
if n1 >= 2 && n2 >= 2 then R (Strong ([], il)) :: xs
else R (Emph ([], il)) :: xs
in
let r =
if n1 >= 2 && n2 >= 2 then
if n1 > 2 then Emph (pre, Other, q1, n1 - 2) :: r else r
else if n1 > 1 then Emph (pre, Punct, q1, n1 - 1) :: r
else r
in
parse_emph r
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you didn't add this code, but if you happened to have reasoned out what it does and can help illuminate with some comments, that would be a big help. I find it pretty inscrutable! If not, not worries :)

src/parser.ml Outdated Show resolved Hide resolved
src/parser.ml Outdated Show resolved Hide resolved
src/parser.ml Outdated Show resolved Hide resolved
src/parser.ml Outdated
if not is_next_closer_same then loop (x :: acc) xs1
else
let xs' = parse_emph xs in
if xs' = xs then loop (x :: acc) xs1 else loop acc xs'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't puzzled out this function enough, do you happen to know why we are doing a (polymorphic) equality comparison on the lists here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea 😬 I could investigate but that comparison doesn't seem to be needed as the following code successfully pass all the tests.

diff --git a/src/parser.ml b/src/parser.ml
index e90b96d..46f37b6 100644
--- a/src/parser.ml
+++ b/src/parser.ml
@@ -959,9 +959,7 @@ module Pre = struct
                 | Some (_, _, q3, _) -> q2 = q3
               in
               if not is_next_closer_same then loop (x :: acc) xs1
-              else
-                let xs' = parse_emph xs in
-                if xs' = xs then loop (x :: acc) xs1 else loop acc xs'
+              else loop acc (parse_emph xs)
           | x :: xs -> loop (x :: acc) xs
           | [] -> x :: List.rev acc
         in

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's already a good investigation! I'll make note to followup here with a change that removes this bit. Thanks!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried removing that equality comparison in master

diff --git a/src/parser.ml b/src/parser.ml
index 6342989..053ca50 100644
--- a/src/parser.ml
+++ b/src/parser.ml
@@ -926,9 +926,8 @@ module Pre = struct
                 else r
               in
               parse_emph r
-          | (Emph _ as x) :: xs1 as xs when is_opener x ->
-              let xs' = parse_emph xs in
-              if xs' = xs then loop (x :: acc) xs1 else loop acc xs'
+          | (Emph _ as x) :: _ as xs when is_opener x ->
+              loop acc (parse_emph xs)
           | x :: xs -> loop (x :: acc) xs
           | [] -> x :: List.rev acc
         in

When running the tests, it gets stuck at 99%. I assume it's because of an infinite loop. So it appears that this code is useful in master but becomes unnecessary with that PR.

@tatchi
Copy link
Collaborator Author

tatchi commented May 24, 2022

Thanks for the review! 😊 I would have preferred to split that PR into 3 smaller ones, but now that you started the review process, let's continue here :)

I will address your comments and try to do my best to document my changes.

@tatchi
Copy link
Collaborator Author

tatchi commented May 28, 2022

I should have addressed your comments. I also added some comments to document the code I added. That wasn't super easy to explain so hopefully it's more or less understandable 😅

I'm wondering if the functions find_next_emph I introduced are ideal. Instead of looking at the "future" (i.e what we haven't read yet) to make a decision, I'm wondering if we couldn't continue to read in any case but store some info in the acc (i.e what was the last delimiter we read) to make a decision later on.

I'm not sure if that makes sense and if that's feasible. I'll try and see how far I can go.

Copy link
Collaborator

@shonfeder shonfeder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few followup suggestions, but looks excellent to me.

Thank you very much for your careful work here. The comments are a huge help in understanding what's going on in this code and, moreover and more importantly, set a good example for the rest of us to follow as we continue maintaining the parser.

Comment on lines +922 to +953
(*
- *foo**bar**baz*

*foo** -> the second delimiter ** is both an opening and closing delimiter.
The sum of the length of both delimiters is 3, so they can't be matched.

**bar** -> they are both opening and closing delemiters.
Their sum is 4 which is not a multiple of 3 so they can be matched to produce <strong>bar</strong>

The end result is: <em>foo<strong>bar</strong>baz</em>

- *foo***bar**baz*

*foo*** -> *** is both an opening and closing delimiter.
Their sum is 4 so they can be matched to produce: <em>foo</em>**

**bar** -> they are both opening and closing delemiters.
Their sum is 4 which is not a multiple of 3 so they can be matched to produce <strong>bar</strong>

The end result is: <em>foo</em><strong>bar</strong>baz*

- ***foo***bar**baz*

***foo*** -> the second delimiter *** is both an opening and closing delimiter.
Their sum is 6 which is a multiple of 3. However, both lengths are multiples of 3
so they can be matched to produce: <em><strong>foo</strong></em>

bar**baz* -> ** is both an opening and closing delimiter.
Their sum is 3 so they can't be matched

The end result is: <em><strong>foo</strong></em>bar**baz*
*)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wonderful documentation. Thank you very much!

src/parser.ml Outdated Show resolved Hide resolved
src/parser.ml Outdated Show resolved Hide resolved
src/parser.ml Outdated Show resolved Hide resolved
Comment on lines +973 to +975
*foo**bar*baz* The second delimiter that's both an opener/closer ( ** before bar)
doesn't match with the next delimiter ( * after bar). **bar will be
considered as regular text. The end result will be: <em>foo**bar</em>baz*
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dang, this is so subtle. Having the comment explaining really helps.

src/parser.ml Outdated
if not is_next_closer_same then loop (x :: acc) xs1
else
let xs' = parse_emph xs in
if xs' = xs then loop (x :: acc) xs1 else loop acc xs'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's already a good investigation! I'll make note to followup here with a change that removes this bit. Thanks!

@shonfeder
Copy link
Collaborator

I would have preferred to split that PR into 3 smaller ones, but now that you started the review process, let's continue here :)

I'm sorry about that! I didn't mean to push you into doing just one PR. I'll refrain from doing a review before you indicate you're ready in the future.

As soon as you are happy with this one, please convert it from a draft PR. All that's left to be done there is to add an entry into the changelog (I'll do that if you haven't beaten me to it by the time this is converted from a draft).

Thanks again for your excellent work here on fixing the parser! :)

@tmattio tmattio mentioned this pull request Jun 20, 2022
@shonfeder
Copy link
Collaborator

Hi, @tatchi! Just checking here. If you're happy with the state of things, I'd be happy to make a small add here to update the changelog and merge this in. But I don't want to step on your toes in case you had been planning to do more work on this.

@tatchi
Copy link
Collaborator Author

tatchi commented Jul 25, 2022

Hi, @tatchi! Just checking here. If you're happy with the state of things, I'd be happy to make a small add here to update the changelog and merge this in. But I don't want to step on your toes in case you had been planning to do more work on this.

Hi @shonfeder, everything looks good to me except these two find_next_* functions that I introduced. I'm actually not sure if that matters much, but I have the feeling that it would be better not to use such functions that look ahead but instead store info about what we've parsed so far to make future decisions.

I'm not sure if that will be feasible but I would like at least to give it a try. Good news is that I'll have time to look at it this week, so expect an outcome by the end of the week 😁

@tatchi tatchi marked this pull request as ready for review July 28, 2022 15:40
@tatchi tatchi changed the title [DRAFT] fix some emphasis parsing fix some emphasis parsing Jul 28, 2022
@tatchi
Copy link
Collaborator Author

tatchi commented Jul 28, 2022

I discovered an emphasis parsing bug that was not covered by the conformance tests. I added an extra test in 46d6a61 with the fix in 6e17dd3

For the rest, I haven't found a way to remove these two find_next_* functions, but I don't think it's important.

I marked the PR as ready for review so it can be merged now 😁

@shonfeder shonfeder enabled auto-merge (rebase) August 1, 2022 21:20
@shonfeder shonfeder disabled auto-merge August 1, 2022 21:21
@shonfeder
Copy link
Collaborator

Many thanks for the great work here, @tatchi!

@shonfeder shonfeder enabled auto-merge (squash) August 1, 2022 21:23
@shonfeder shonfeder merged commit 3077325 into ocaml-community:master Aug 1, 2022
mseri pushed a commit to ocaml/opam-repository that referenced this pull request Dec 13, 2022
CHANGES:

- Expose the HTML escape function `htmlentities` (ocaml-community/omd#295 @cuihtlauac)

- Support generation of identifiers in headers (ocaml-community/omd#294, @tatchi)

- Support GitHub-Flavoured Markdown tables (ocaml-community/omd#292, @bobatkey)

- Update parser to support CommonMark Spec 0.30 (ocaml-community/omd#266, @SquidDev)

- Preserve the order of input files in the HTML output to stdout (ocaml-community/omd#258,
  @patricoferris)

- Fix all deviations from CommonMark Spec 0.30 (ocaml-community/omd#284, ocaml-community/omd#283, ocaml-community/omd#278, ocaml-community/omd#277, ocaml-community/omd#269,
  @tatchi)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants