[WIP] add a CST structure #306

tatchi · 2023-05-01T13:47:25Z

This is a WIP to add a CST structure that preserves some details of the original markdown syntax. This is a draft PR and still needs a lot of work. I'm opening this to open up discussion and get early feedback.

The first two commits are from @patricoferris' work on https://github.com/patricoferris/omd/tree/omd-print. There are a few commits after that are not very relevant.

a3a8fae is where the CST structure is added. This is a very basic implementation, I basically copied a lot of the current AST code and implemented functions that go from AST to CST. I'm not sure I understood the discussion in #223 regarding the CST structure and how it should be implemented. I'm pretty sure there's a much better solution than what I have here.

Subsequent commits add details to the CST so that information is not lost when trying to print the structure back to string.

What I realized is that there is some information that we need to keep in order not to change the "meaning" of the markdown.

This is the case with:

\## hello

In master and when parsing the above markdown into an AST structure, we correctly parse it as a regular text and not a heading due to the escape char \, but it's the escape character is not preserved in the AST:

# Omd.of_string "\\## hello";;
- : Omd.doc = [Omd.Paragraph ([], Omd__.Ast_inline.Text ([], "## hello"))]

So when we parse it back it becomes a heading which is obviously not correct and need to be fixed

Besides that, there are other missing pieces of information that make the string we generate different from the original, but don't change the "meaning" of the markdown. That's the case with the emphasis character, for example.

# Omd.of_string "__hello__";;
- : Omd.doc =
[Omd.Paragraph ([],
  Omd__.Ast_inline.Strong ([], Omd__.Ast_inline.Text ([], "hello")))]

We don't store in the AST if the emphasis character is _ or *. But in the end, we can choose whatever we want when we print the AST back to a string, it won't change the "meaning" and the HTML will be the same. Actually, Pandoc doesn't keep this information either:

printf "__hello__" | pandoc --from commonmark --to json | pandoc --from json --to commonmark
**hello**

I'm wondering what we're aiming for in our case? Do we strictly want to print back the exact same string we parsed, or is it fine as long as the markdown result/HTML output is the same?

patricoferris and others added 12 commits April 23, 2023 13:45

initial implementation

789eb5c

wip

1166e7c

fmt

4ab669a

disable auto identifiers for print tests

815e9a3

disable table and def list tests

8b9ac0e

explicitely list all failing tests

2580623

add a basic cst structure

a3a8fae

keep strong emph_style in cst

fcf5de6

keep emph emph_style in cst

2690062

add heading_type in cst

3406908

keep escape character \ in cst

bfee91a

keep link type in cst

dd13afe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] add a CST structure #306

[WIP] add a CST structure #306

tatchi commented May 1, 2023

[WIP] add a CST structure #306

Are you sure you want to change the base?

[WIP] add a CST structure #306

Conversation

tatchi commented May 1, 2023