Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Add" CSL YAML #278

Open
bdarcus opened this issue Jun 26, 2020 · 96 comments
Open

"Add" CSL YAML #278

bdarcus opened this issue Jun 26, 2020 · 96 comments

Comments

@bdarcus
Copy link
Member

bdarcus commented Jun 26, 2020

This and this suggests we can use our JSON schemas to validate a YAML alternative.

Pandoc already supports a YAML alternative (cc @jgm).

I suggest we do something with this, since it's zero work for us, and would give more options for users and developers.

Perhaps most sensible option is just adding a sentence to the spec that mentions this possibility, without requiring implementations to support it?

Edit: actually, we say nothing about input in the spec currently. So we would need to add a section on input data, and say that our schema can validate either json or yaml.

Proposal

Based on this discussion, we should:

  1. review the existing json schema for any possible adjustments we might want to make now. Date parts are one obvious discrepancy; are there any others?
  2. add an optional field for the content markup format to parse; the html subset could be default, and we could enumerate other options; say markup: org.

Originally posted by @bdarcus in #277 (comment)

@bdarcus bdarcus changed the title CSL YAML "Add" CSL YAML Jun 26, 2020
@denismaier
Copy link
Member

denismaier commented Jun 26, 2020

Is pandocs CSL YAML actually identical to CSL JSON? I'm not sure, but don't they differ at least in some aspects.
E.g. regarding dates: CSL JSON has this construct:

"issued":{"date-parts":[[2015,4,2]]},

Whereas the YAML:

  issued:
    - year: 2015
      month: 4
      day: 2

(I think I've tried to use the JSON schema for autocompletion and validation once, but I wasn't so lucky. Using Atom or VS Code as plain text reference managers could be very nice for some projects...)

@bdarcus
Copy link
Member Author

bdarcus commented Jun 26, 2020

Is pandocs CSL YAML actually identical to CSL JSON?

I hadn't checked, but wondering if devs like @jgm would find value in this.

Advantage is we have one schema, that is always in sync with the CSL spec.

@bwiernik
Copy link
Member

People definitely seem to like YAML in my research community. Easier to manually edit if needed, like BibTeX or RIS.

@bdarcus
Copy link
Member Author

bdarcus commented Jun 26, 2020

People definitely seem to like YAML in my research community. Easier to manually edit if needed, like BibTeX or RIS.

Exactly.

YAML is easily hand-editable. JSON isn't.

@denismaier
Copy link
Member

YAML is easily hand-editable. JSON isn't.

And with auto-completion it would be even better!

@denismaier
Copy link
Member

@retorquere Do you have any input here?

@bdarcus
Copy link
Member Author

bdarcus commented Jun 26, 2020

YAML is easily hand-editable. JSON isn't.

And with auto-completion it would be even better!

liuderchi/ide-yaml#56

@retorquere
Copy link

Is pandocs CSL YAML actually identical to CSL JSON? I'm not sure, but don't they differ at least in some aspects.
E.g. regarding dates: CSL JSON has this construct:

To the best of my knowledge, this is the only difference. circa and season are supported here, circa at the same level as date-parts, not per-date. But I do not know of a formal spec.

@jgm
Copy link

jgm commented Jun 26, 2020

Is pandoc's CSL YAML actually identical to CSL JSON?

No, not exactly. In addition to the difference noted (and maybe others which I've forgotten), the YAML bibliographies read by pandoc can have arbitrary pandoc markdown formatting. (And NOT the CSL HTML-ish formatting.) So it's not just a YAML translation of CSL JSON.

As I develop my new citeproc library, I may change things a bit to line things up more, while preserving backwards compatibility. For example, I think a date-parts field should be allowed, but I'd try to keep the current more elegant-looking syntax as an option too.

@bdarcus
Copy link
Member Author

bdarcus commented Jun 26, 2020 via email

@retorquere
Copy link

retorquere commented Jun 26, 2020

@jgm but the html markup that csl-json supports is also valid markdown, so for export, there'd be no problem.

Edit: wait, does pandoc only support markdown tags, and not the html tags?

What other tools besides pandoc read csl-yaml?

@bwiernik
Copy link
Member

I think accepting dates in either format would be fine in either schema. We just need to specify where the order of priority for redundant parts is (cf. there is an issue raised that we should specify in names that specific name parts are given priority over a literal field; we could do the same with dates--perhaps date-parts gets priority over the individual pieces).

With respect to markdown, I'm leery about making that universal, but I think we could add a flag to the data indicating that the data should be read as markdown.

@bwiernik
Copy link
Member

@retorquere I think he is saying that pandoc CSL YAML supports markdown syntax in addition to HTML syntax.

I'm a little concerned about assuming that, for example, _ or * always indicate text formatting.

@retorquere
Copy link

And the markdown escapes of those and others of course. Even if you only use html for markup. I hadn't thought of that before and I'll have to think about what to do for BBTs csl-yaml export.

I think it'd need to be explicitly marked if you want markdown processing, or markdown would have to be the default for csl-yaml. I'd rather not deal with ambiguity.

Are there other csl-yaml processors? If not, then I could have BBT default to markdown.

@bwiernik
Copy link
Member

As the main consumer of CSL YAML is currently and I suspect will remain pandoc, I think we could make the default markdown with an option to disable?

pandoc interprets the HTML markup, so I'd suggest BBT not worry about translating the HTML tags into markdown.

@retorquere
Copy link

I'm not worried about those, but about whether to escape if I find * or #. They just mean different things to markdown and html.

@bdarcus
Copy link
Member Author

bdarcus commented Jun 26, 2020

@retorquere I think he is saying that pandoc CSL YAML supports markdown syntax in addition to HTML syntax.

I read John to say above that he doesn't support the HTML.

@bwiernik
Copy link
Member

If a user is generating CSL YAML with BBT, I'd expect that not escaping markdown characters (BBT's current behavior) is the better default.

@bwiernik
Copy link
Member

bwiernik commented Jun 26, 2020

I read John to say above that he doesn't support the HTML.

I checked. The HTML markup is supported in CSL YAML by pandoc (which makes sense because it is valid HTML).

@bdarcus
Copy link
Member Author

bdarcus commented Jun 26, 2020

And with auto-completion it would be even better!

Did you get YAML auto-completion working with vscode?

Might be cool if we could have CSL extensions for both vscode and, per the thing I started, atom, so we could give people easy-to-install auto-completing editors for CSL styles and data.

@denismaier
Copy link
Member

And with auto-completion it would be even

And with auto-completion it would be even better!

Did you get YAML auto-completion working with vscode?

Might be cool if we could have CSL extensions for both vscode and, per the thing I started, atom, so we could give people easy-to-install auto-completing editors for CSL styles and data.

Both yes!

@denismaier
Copy link
Member

What doesn't work in vs code is style validation...

@bdarcus
Copy link
Member Author

bdarcus commented Jun 26, 2020

Seems vscode doesn't support relaxng validation, and is dependent on this issue to add it.

@bwiernik
Copy link
Member

Are RNC and XSD feature compatible? There are numerous XSD validators for vsscode? It might be possible to automatically generate an unofficial XSD schema from the RNC for use with editors.

@jgm
Copy link

jgm commented Jun 26, 2020

The HTML markup is supported in CSL YAML by pandoc (which makes sense because it is valid HTML).

Well, pandoc will pass through raw HTML as "RawInline" elements. And these will be emitted in HTML output. But if you target, say, LaTeX, they'll just be omitted. So it's not really supported.

@retorquere
Copy link

retorquere commented Jun 26, 2020

Well, pandoc will pass through raw HTML as "RawInline" elements. And these will be emitted in HTML output. But if you target, say, LaTeX, they'll just be omitted. So it's not really supported.

That and the fact that if you assume markdown, *word* becomes \emph{word}, and if you assume HTML, it becomes *word*. And if pandoc is the sole consumer of CSL-YAML, it seems better to escape.

@bdarcus
Copy link
Member Author

bdarcus commented Jun 28, 2020 via email

@jgm
Copy link

jgm commented Jun 28, 2020

@bdarcus -- So I guess this means limiting abstracts to one paragraph without any block-level formatting (no tables, lists, figures, etc.). This seems reasonable but I'm not up on abstract customs in different fields. You're also excluding hyperlinks, which would not be normal in a title but might appear in an abstract.

Math is tricky. I see the point of passing it through directly to the output format. However, if you're working with MathJax you often need to escape < as &lt; (to give just one example, see http://docs.mathjax.org/en/latest/input/tex/html.html) . Currently this would be completely garbled, since CSL doesn't recognize entities as such. So if you did the right thing and wrote x&lt;y, then CSL would garble it, but if you wrote x<y then it wouldn't work properly in your HTML output because <y would be interpreted by the browser as a tag.

One approach would be to have a <math format="...">..</math> tag, where everything inside is passed through verbatim. You could then specify the format as 'TeX' or 'HTML-escaped TeX' or 'MathML' or whatever, and the output processor would have to check this and deal with it appropriately.

Another approach would be to just insist on MathML. Note that MathML can include an annotation element into which one could put a plain-text fallback for RTF or whatever.

@bwiernik
Copy link
Member

Thinking about Zotero–pandoc compatibility as the major concern I have, that usually happens via interface with BBT. If there is a defined set of HTML-like tags that Zotero supports, I think BBT or similar export translators could convert those tags to Markdown syntax fairly easily.

For math, I don’t see a word processor plugin directly supporting math input. But, if a user stored it in TeX, they could convert that to a Word OOML equation as a simple post processing step, and it would work out of the box with pandoc.

@bdarcus
Copy link
Member Author

bdarcus commented Jun 28, 2020

@bdarcus -- So I guess this means limiting abstracts to one paragraph without any block-level formatting (no tables, lists, figures, etc.). This seems reasonable but I'm not up on abstract customs in different fields. You're also excluding hyperlinks, which would not be normal in a title but might appear in an abstract.

I have no firm position against these. I just wanted to keep this moving forward, and wasn't really focused on those cases because they're not the primary requirement for manuscript preparation.

But certainly we should consider them.

@bwiernik
Copy link
Member

bwiernik commented Jun 28, 2020

The two fields where this might come into play are abstract and note (e.g., used in annotated bibliographies for example). I could definitely see line breaks in both of those. Currently citeproc-js just disregards line breaks and renders without any white space at all.

Abstracts often have subheadings—those are usually set with bold, rather than heading markers.

Abstracts probably won’t have lists or tables, but note might. Formally supporting that might be ought of scope? But could be nice. I don’t know if rich text supports these (same with links?), so that might a thing left to individual applications/processors to decide.

@dstillman
Copy link

possibly substitute numbered placeholders for the tags to avoid any unexpected HTML/RTF processing

This thing I said above didn't really make sense — it would work for a one-off bibliography but not if we were embedding CSL-JSON in a document for future processing.

One approach would be to have a <math format="...">..</math> tag, where everything inside is passed through verbatim. You could then specify the format as 'TeX' or 'HTML-escaped TeX' or 'MathML' or whatever, and the output processor would have to check this and deal with it appropriately.

I don't think passing anything through verbatim (as in, not processed according to the output format) actually works — depending on the output format, it could very well mean invalid/unescaped markup, and if the calling application doesn't know about it and deal with it appropriately, it's potentially a security flaw.

Another approach would be to just insist on MathML. Note that MathML can include an annotation element into which one could put a plain-text fallback for RTF or whatever.

I don't think processing MathML should be the citation processor's job, for the reasons I give above: the output format abilities may not be the same as the target application abilities, it would require a duplicate bundled math processor, and it just seems generally unreasonable to ask of a citation processor.

But a version of this might work:

  1. Require MathML, and expect the citation processor to support an API for math handling. There's no reason citeproc-js needs a copy of MathJax — it just needs Zotero to provide a function that runs the MathML through its own copy of MathJax and return the necessary output for the format and word processor being used. If a math processor isn't provided, the citation processor could use the annotation field if present or embed/throw an error if not.

  2. Support some generic mechanism for embedding typed content that needed to be handled by the calling application. This could potentially even be used for abstracts — instead of adding support for more rich-text formatting to CSL, the processor could call a function provided by the application that took type="html", data="<marquee>What a paper!</marquee>", output="RTF" to a function provided by the application that returned an appropriate string to insert into the output (which now could even be a placeholder for post-processing by the application). The same could be used for passing math — e.g., type="mathml", data="<math>…</math>", output="RTF", and it would be up to the calling application what to do if the output format and/or target application couldn't handle math. If an appropriate handler wasn't available, it could be handled as regular text (e.g., HTML-encoded) or an error could be thrown/embedded by the processor. This would keep the citation processor from needing to bundle huge processors that the calling application likely already has (a math processor, a HTML parser/sanitizer, etc.).

@bwiernik
Copy link
Member

For the case of Zotero's Word integration, would either of those solutions enable, for example, the title and abstract of this item to appear in Word as a math environment equation?

@dstillman
Copy link

I actually kind of doubt it — I suspect you can't embed an equation element in the text of a Word field, which is what we would need to do. So while I don't know for sure, realistically output="RTF" probably means trying to convert MathML to UnicodeMath or AsciiMath. Still, that seems like more of a problem for a calling application that wants to deal with it — and which might need to do it in other contexts as well — than for a citation processor.

@bwiernik
Copy link
Member

It seems like math, tables, lists might be beyond the scope of CSL; these might be things that we recommend applications support (e.g., math everywhere, tables and lists in abstract and note), but that is really up to the application to define?

@bwiernik
Copy link
Member

@dstillman UnicodeMath would be a good compromise to be able to convert unlinked citations/bibliographies to equations with one click or a macro

bdarcus added a commit that referenced this issue Jun 30, 2020
As part of #278, and to harmonize the JSON and YAML representations 
around a much more concise and expressive date format, this adds a 
an option to use EDTF; either as a preferred string on any date, or as 
an "edtf" string property on the more verbose alternative object 
representation.

While EDTF was originally an initiative of the US Library of Congress, 
ISO adopted it as part of 8601-2 in 2019.

Note: The current regular expression pattern only checks for valid 
characters.
@bdarcus
Copy link
Member Author

bdarcus commented Jun 30, 2020 via email

@bwiernik
Copy link
Member

Could we just leave that an open field and leave it up to processors to designate the markup they support?

@bdarcus
Copy link
Member Author

bdarcus commented Jun 30, 2020 via email

@bdarcus
Copy link
Member Author

bdarcus commented Jul 3, 2020

@larsgw suggested in this comment that we consider having two input schemas: one for humans (yaml + edtf), and the other for machines (json + structured data object).

I wasn't sure how easy or possible this was in json schema, but the below appears (though I am not 100% certain) to work.

Edit: no, it's not possible. it seems. In that case, we should probably just continue as planned.

@bdarcus
Copy link
Member Author

bdarcus commented Jul 3, 2020

Also, @jgm, am I correct that your current date model supports ranges? If yes how do you define an open-ended range?

@jgm
Copy link

jgm commented Jul 3, 2020

It seems that this works with pandoc-citeproc to specify an open range:

  issued:
  - year: 2042
  - {}

But I wouldn't worry too much about my data model, since I'm planning to transition eventually to the new citeproc library I'm writing. It already passes more citeproc tests than pandoc-citeproc, and it's much faster and more maintainable. It uses the date-parts model that is part of current CSL.

@denismaier
Copy link
Member

denismaier commented Jul 3, 2020 via email

bwiernik pushed a commit to bwiernik/schema that referenced this issue Jul 8, 2020
…language#284)

As part of citation-style-language#278, and to harmonize the JSON and YAML representations 
around a much more concise and expressive date format, this adds a 
an option to use EDTF; either as a preferred string on any date, or as 
an "edtf" string property on the more verbose alternative object 
representation.

While EDTF was originally an initiative of the US Library of Congress, 
ISO adopted it as part of 8601-2 in 2019.

Note: The current regular expression pattern only checks for valid 
characters.
@bdarcus
Copy link
Member Author

bdarcus commented Jul 9, 2020

It seems that this works with pandoc-citeproc to specify an open range:

  issued:
  - year: 2042
  - {}

Here's what I have in #301 @jgm:

issued:
- date-parts:
  - 2000
- {}

So it merges your model and the 1.0 JSON model to match the EDTF model (which is a date, and in levels 0 and 1, a date range, which is two-item list of dates).

I believe date parts is better as an object (as you have), but I guess for compatibility we should keep the array. Anyone want to make the argument we should change this too? If yes, please state your case on #301. If not, we'll keep as is.

The human-readable preference, of course, would be the preferred EDTF string:

issued: 2000/..

bdarcus added a commit that referenced this issue Jul 12, 2020
This adds a definitionf for rich-text variables, and title-string
definitions that uses that variables, and then redefines all title and
other fields to use these definitions.

Addresses in part #278
@bdarcus bdarcus mentioned this issue Jul 12, 2020
2 tasks
bdarcus added a commit that referenced this issue Jul 12, 2020
This adds a definitionf for rich-text variables, and title-string
definitions that uses that variables, and then redefines all title and
other fields to use these definitions.

Addresses in part #278
bdarcus added a commit that referenced this issue Jul 14, 2020
This adds an experimental csl-rich-text.yaml schema that defines a
structure for rich-text formatting in JSON.

Also adds a definition for rich-text variables, and title-string
definitions that uses that variable, and then redefines all title and
other fields to use these definitions. So it will be easy to merge the
rich text support in the future.

Addresses in part #278
bdarcus added a commit that referenced this issue Jul 14, 2020
This adds an experimental csl-rich-text.yaml schema that defines a
structure for rich-text formatting in JSON.

Also adds a definition for rich-text variables, and title-string
definitions that uses that variable, and then redefines all title and
other fields to use these definitions. So it will be easy to merge the
rich text support in the future.

Addresses in part #278
bdarcus added a commit that referenced this issue Jul 15, 2020
This adds an experimental csl-rich-text.yaml schema that defines a
structure for rich-text formatting in JSON.

Also adds a definition for rich-text variables, and title-string
definitions that uses that variable, and then redefines all title and
other fields to use these definitions. So it will be easy to merge the
rich text support in the future.

Addresses in part #278
bdarcus added a commit that referenced this issue Jul 26, 2020
As part of #278, and to harmonize the JSON and YAML representations 
around a much more concise and expressive date format, this adds a 
an option to use EDTF; either as a preferred string on any date, or as 
an "edtf" string property on the more verbose alternative object 
representation.

While EDTF was originally an initiative of the US Library of Congress, 
ISO adopted it as part of 8601-2 in 2019.

Note: The current regular expression pattern only checks for valid 
characters.
@bdarcus
Copy link
Member Author

bdarcus commented Jun 3, 2022

Now close to two years later, I merged today #420, with examples of validating completion against the current v1.1 branch version of the schema (that allows EDTF for dates). It actually works pretty well for humans and machines, I'd say.

Much of this long thread contains very useful thoughts on a more narrow aspect of this; the question of the markup, etc. within the fields. #315 was an experiment for that, though I have no idea if the idea is any good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants