Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] add a CST structure #306

Draft
wants to merge 12 commits into
base: master
Choose a base branch
from
Draft

[WIP] add a CST structure #306

wants to merge 12 commits into from

Conversation

tatchi
Copy link
Collaborator

@tatchi tatchi commented May 1, 2023

This is a WIP to add a CST structure that preserves some details of the original markdown syntax. This is a draft PR and still needs a lot of work. I'm opening this to open up discussion and get early feedback.

The first two commits are from @patricoferris' work on https://github.com/patricoferris/omd/tree/omd-print. There are a few commits after that are not very relevant.

a3a8fae is where the CST structure is added. This is a very basic implementation, I basically copied a lot of the current AST code and implemented functions that go from AST to CST. I'm not sure I understood the discussion in #223 regarding the CST structure and how it should be implemented. I'm pretty sure there's a much better solution than what I have here.

Subsequent commits add details to the CST so that information is not lost when trying to print the structure back to string.

What I realized is that there is some information that we need to keep in order not to change the "meaning" of the markdown.

This is the case with:

\## hello

In master and when parsing the above markdown into an AST structure, we correctly parse it as a regular text and not a heading due to the escape char \, but it's the escape character is not preserved in the AST:

# Omd.of_string "\\## hello";;
- : Omd.doc = [Omd.Paragraph ([], Omd__.Ast_inline.Text ([], "## hello"))]

So when we parse it back it becomes a heading which is obviously not correct and need to be fixed

Besides that, there are other missing pieces of information that make the string we generate different from the original, but don't change the "meaning" of the markdown. That's the case with the emphasis character, for example.

# Omd.of_string "__hello__";;
- : Omd.doc =
[Omd.Paragraph ([],
  Omd__.Ast_inline.Strong ([], Omd__.Ast_inline.Text ([], "hello")))]

We don't store in the AST if the emphasis character is _ or *. But in the end, we can choose whatever we want when we print the AST back to a string, it won't change the "meaning" and the HTML will be the same. Actually, Pandoc doesn't keep this information either:

printf "__hello__" | pandoc --from commonmark --to json | pandoc --from json --to commonmark
**hello**

I'm wondering what we're aiming for in our case? Do we strictly want to print back the exact same string we parsed, or is it fine as long as the markdown result/HTML output is the same?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants