Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pluggable parsing #113

Closed
wants to merge 3 commits into from
Closed

Conversation

bhagany
Copy link
Contributor

@bhagany bhagany commented Nov 26, 2016

This is a proposal for a pluggable content parsing system that can be extended beyond only markdown, both in user code and in perun itself.

Changes:

  • Abstracted the logic of markdown into the more general content task. It takes a vector of keywords that name markup languages to parse, which defaults to [:markdown]. Options to each language parser can be passed through using a map that looks like, for example {:markdown {:smarts true} :asciidoc {"attributes" {"backend" "docbook"}}
  • Defined the content-parser multimethod as an extension point for plugging into the content task
  • Reimplemented markdown's functionality by implementing a content-parser method for :markdown, so that by default (content) is equivalent to (markdown)
  • Added an :asciidoc method for content-parser, to demonstrate the addition of new parsers

I'm aware of one issue with the asciidoc parser: if the yaml metadata block at the beginning of a file, and the asciidoc contains any ---- delimited blocks, an error will be thrown. I'd appreciate some input on this.

Let me know what you think!

A pluggable content system is necessary to facilitate flexible content generation. The new `content` task is intended to allow easy extension of the content system so that it can grow beyond only markdown.
@martinklepsch
Copy link
Contributor

Will take a closer look later but in the meantime have you seen #50?

@bhagany
Copy link
Contributor Author

bhagany commented Nov 26, 2016

I hadn't! I'll take a look.

@bhagany
Copy link
Contributor Author

bhagany commented Nov 26, 2016

The asciidoc implementation in this PR certainly leaves something to be desired compared to the one in #100. However, I only picked asciidoc as an example because it's the first thing listed here: https://en.wikipedia.org/wiki/Lightweight_markup_language :). The real feature here is a unified, user-pluggable content pipeline. If this idea gets good feedback, I would advocate for removing asciidoc from this PR.

@martinklepsch
Copy link
Contributor

@bhagany so one general thing: instead of the multimethod we could just invoke the task multiple times (as often as you you have different content types) and pass the "spec" that the multimethods return to the task as a task option. For some supported content types we could provide these "specs" as part of the std-lib.

@martinklepsch
Copy link
Contributor

@bhagany also it might be nice to differentiate which content transformation was used for a given entry in the fileset.

On that note: I'm still thinking it wouldn't hurt to use specific keys for the results like :perun.content/markdown & :perun.content/asciidoctor. If you then want to render all with the same render-fn you'd need some (or (:perun.content/markdown e) (:perun.content/asciidoctor e)) but it's more explicit and clear overall imo. (This would also alleviate some of the confusion around :body and :content and whatnot I guess.)

@bhagany
Copy link
Contributor Author

bhagany commented Nov 27, 2016

It's interesting that you bring up the keyword thing, because that was the starting point of the thought process that led to this PR. I eventually came to the conclusion that markup-language-specific metadata keys aren't all that useful. Some points about that:

  • Language-specific keys unnecessarily couple your rendering code to your choice of input language, and you don't really gain anything by coupling input and rendering in this way. For the renderer, :perun.content/markdown "<p>foo</p>" is exactly the same as :perun.content/asciidoc "<p>foo</p>"
  • If downstream tasks do have a reason to care about the input language that I'm not anticipating, this is better served by a separate key/value pair, like {:perun.content/input-language :markdown} or something like that.
  • Separate keys for input languages would make it difficult to write general post-processing tasks. For example, if you wrote a task for applying Tachyons classes to your content, and wanted to distribute it as a 3rd party library, language-specific keys get in your way.
  • it would be kind of weird to have a key called :perun.content/markdown attached to a value that isn't markdown, but is instead HTML.

That said, I do think there's some tension here, both in parsing input and in rendering output. The tension arises when we want multiple ways to produce the same kind of thing. It feels weird to have multiple tasks set the same metadata key, and for the reasons outlined above, using separate keys for the same kinds of values isn't what I'd like to see either. In my opinion, the ideal situation is having a task set a key that corresponds to the task's name. canonical-url sets :canonical-url, permalink sets :permalink, and these are super easy to understand, so I made a content task that sets :content. I think this goes a long way to clarify what the :content key is for.

The best objection to this that I've been able to think of is that I'd rather keep perun's required dependencies to a minimum, and only pull in libraries that someone actually uses. I'm not sure how to dynamically pull in dependencies yet, but I think it's possible. If I'm on the wrong track there, please let me know - that's pretty close to a deal-breaker in adding more parsers and renderers to perun, imo.

Anyway, I'll try to replace the multimethod in the manner you described, but I don't quite understand what the objection is. I think that any implementation will have to involve a function call of some sort, if dynamically loading dependencies for the pods is a goal. I also think that the multimethod approach leads to a nicer API for content. Could you explain your rationale for wanting to replace it with "specs"?

@podviaznikov
Copy link
Member

I like ideas behind this PR.

In my opinion, the ideal situation is having a task set a key that corresponds to the task's name. canonical-url sets :canonical-url, permalink sets :permalink, and these are super easy to understand, so I made a content task that sets :content. I think this goes a long way to clarify what the :content key is for.

This seems very reasonable to me.

Also I personally see no big problem in pulling additional libraries for the included batteries. Other people might have different opinion, but for me extra libraries were not an issue.

@MartyGentillon
Copy link
Contributor

It's interesting to see the different approaches to this. I have been working on something similar where each parser has it's own task, that only affects the files that match the parser.

@MartyGentillon
Copy link
Contributor

MartyGentillon commented Nov 28, 2016

In my opinion, the ideal situation is having a task set a key that corresponds to the task's name. canonical-url sets :canonical-url, permalink sets :permalink, and these are super easy to understand, so I made a content task that sets :content. I think this goes a long way to clarify what the :content key is for.

This seems very reasonable to me.

Except that this leads to an explosion of keys, and suddenly later tasks have to care about where their input comes from. It shouldn't matter if the content was generated by parsing a single file, processing the contents of a directory, or any other way. Keys should be based on the intent of the output.

In fact, I would argue that, like Ring, we should be focused primarily on what the "request map" (or perhaps, for us, page map) contains. We should be writing a spec very much like this:

https://github.com/ring-clojure/ring/blob/master/SPEC

@MartyGentillon
Copy link
Contributor

If downstream tasks do have a reason to care about the input language that I'm not anticipating, this is better served by a separate key/value pair, like {:perun.content/input-language :markdown} or something like that.

This is a really good idea. It might also be good to have a trace stored in a key listing all the transformations applied to a file. something like

{:perun/trace [:content :render]}

Could be really handy for debugging.

@bhagany
Copy link
Contributor Author

bhagany commented Nov 28, 2016

Also I personally see no big problem in pulling additional libraries for the included batteries. Other people might have different opinion, but for me extra libraries were not an issue.

Noted. I don't feel strongly about it - I was following the lead of someone else in one of the discussions here, I think maybe @Deraen? I probably won't focus on this without more prodding, then.

Except that this leads to an explosion of keys, and suddenly later tasks have to care about where their input comes from. It shouldn't matter if the content was generated by parsing a single file, processing the contents of a directory, or any other way. Keys should be based on the intent of the output.

I think we actually agree here - I'm trying to avoid an explosion of keys, while improving the intuitiveness of the interface, by providing a common interface for input parsing.

In fact, I would argue that, like Ring, we should be focused primarily on what the "request map" (or perhaps, for us, page map) contains. We should be writing a spec very much like this:

I agree here as well. Metadata keys should follow a spec, and wherever possible, perun core task names should correspond to key names in the spec. For input parsing, this is possible (hence this pull request). However, for tasks like render and collection, I don't think it's possible, because while their outputs are the same kind of thing (a whole HTML page), their inputs are fundamentally different. This makes two tasks necessary, but different metadata keys for their outputs are undesirable. Further, I don't think that user-defined tasks, or tasks that transform an existing key in-place, should be subject to any key <-> task name correspondence guideline.

It might also be good to have a trace stored in a key listing all the transformations applied to a file. something like
{:perun/trace [:content :render]}
Could be really handy for debugging.

This is a great idea.

@bhagany
Copy link
Contributor Author

bhagany commented Nov 28, 2016

I don't feel strongly about it

Hahahahaha I just reread what I said about it being a dealbreaker above. I guess I am not terribly consistent, forgive me.

@bhagany bhagany mentioned this pull request Nov 28, 2016
@martinklepsch
Copy link
Contributor

{:perun/trace [:content :render]}

👍 I like this. I'm wondering if this could also be used in a nice way to indicate what kind of parser has been used (markdown, asciidoc, ...)? Otherwise also what @bhagany suggested {:perun.content/input-language :markdown} seems fine. I agree that different keywords are probably not worth it.

Anyway, I'll try to replace the multimethod in the manner you described, but I don't quite understand what the objection is. I think that any implementation will have to involve a function call of some sort, if dynamically loading dependencies for the pods is a goal. I also think that the multimethod approach leads to a nicer API for content. Could you explain your rationale for wanting to replace it with "specs"?

"Spec" has become a terribly overloaded word lately 😄 What I mean is this:

{:file-exts ["md" "markdown"]
 :parse-form (fn [files] `(io.perun.content.markdown/parse-markdown ~files ~options))
 :pod (create-pod markdown-deps)}

You now essentially pass them to the task by passing the languages option which contains keywords that the multimethod might deliver a spec as above for. What I'm suggesting is we could skip the multimethod and pass these maps to the task directly. if you have [io.perun :as p] then you could do something like

(content :languages [p/markdown p/asciidoc])
;; p/markdown and p/asciidocs are maps like the one in the previous snippet

This way you could just write a map like above and pass it to the task. No extending of multimethods from foreign namespaces, no possibility of conflicts (think custom/different markdown implementations). You could easily add another file extension if you wanted to by using plain assoc. It just feels like the more flexible style to me but maybe I'm over-optimizing on simplicity here :)

@bhagany
Copy link
Contributor Author

bhagany commented Nov 29, 2016

"Spec" has become a terribly overloaded word lately 😄 What I mean is this:

Okay, I understood you here. My initial thought was that if we wanted to keep required dependencies to a minimum, creating the pod outside of a function call wouldn't work. If dependency minimization isn't a goal, I don't think there's much of a difference between our implementations API-wise, and yours does have the advantages you listed, so I don't have any issue using yours. I will make that change soon.

@MartyGentillon
Copy link
Contributor

Also, don't forget to update the examples. (There really need to be some tests that make sure those always work.)

bhagany added a commit to bhagany/perun that referenced this pull request Dec 5, 2016
Future content parsers will need to use the same yaml parsing that
`markdown` does. This is a modified form of changes in hashobject#113.
@bhagany
Copy link
Contributor Author

bhagany commented Dec 5, 2016

Similarly to #109, I've split the ideas here into several different PR's and future PR's, and I'm pretty happy with that direction, so I'm closing this PR in favor of the others.

@bhagany bhagany closed this Dec 5, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants