"Add" CSL YAML #278

bdarcus · 2020-06-26T13:57:05Z

This and this suggests we can use our JSON schemas to validate a YAML alternative.

Pandoc already supports a YAML alternative (cc @jgm).

I suggest we do something with this, since it's zero work for us, and would give more options for users and developers.

Perhaps most sensible option is just adding a sentence to the spec that mentions this possibility, without requiring implementations to support it?

Edit: actually, we say nothing about input in the spec currently. So we would need to add a section on input data, and say that our schema can validate either json or yaml.

Proposal

Based on this discussion, we should:

review the existing json schema for any possible adjustments we might want to make now. Date parts are one obvious discrepancy; are there any others?
add an optional field for the content markup format to parse; the html subset could be default, and we could enumerate other options; say markup: org.

Originally posted by @bdarcus in #277 (comment)

The text was updated successfully, but these errors were encountered:

denismaier · 2020-06-26T14:23:50Z

Is pandocs CSL YAML actually identical to CSL JSON? I'm not sure, but don't they differ at least in some aspects.
E.g. regarding dates: CSL JSON has this construct:

"issued":{"date-parts":[[2015,4,2]]},

Whereas the YAML:

  issued:
    - year: 2015
      month: 4
      day: 2

(I think I've tried to use the JSON schema for autocompletion and validation once, but I wasn't so lucky. Using Atom or VS Code as plain text reference managers could be very nice for some projects...)

bdarcus · 2020-06-26T14:25:43Z

Is pandocs CSL YAML actually identical to CSL JSON?

I hadn't checked, but wondering if devs like @jgm would find value in this.

Advantage is we have one schema, that is always in sync with the CSL spec.

bwiernik · 2020-06-26T14:39:44Z

People definitely seem to like YAML in my research community. Easier to manually edit if needed, like BibTeX or RIS.

bdarcus · 2020-06-26T14:43:55Z

People definitely seem to like YAML in my research community. Easier to manually edit if needed, like BibTeX or RIS.

Exactly.

YAML is easily hand-editable. JSON isn't.

denismaier · 2020-06-26T14:45:18Z

YAML is easily hand-editable. JSON isn't.

And with auto-completion it would be even better!

denismaier · 2020-06-26T14:49:14Z

@retorquere Do you have any input here?

bdarcus · 2020-06-26T14:49:42Z

YAML is easily hand-editable. JSON isn't.

And with auto-completion it would be even better!

liuderchi/ide-yaml#56

retorquere · 2020-06-26T14:59:57Z

Is pandocs CSL YAML actually identical to CSL JSON? I'm not sure, but don't they differ at least in some aspects.
E.g. regarding dates: CSL JSON has this construct:

To the best of my knowledge, this is the only difference. circa and season are supported here, circa at the same level as date-parts, not per-date. But I do not know of a formal spec.

jgm · 2020-06-26T16:51:53Z

Is pandoc's CSL YAML actually identical to CSL JSON?

No, not exactly. In addition to the difference noted (and maybe others which I've forgotten), the YAML bibliographies read by pandoc can have arbitrary pandoc markdown formatting. (And NOT the CSL HTML-ish formatting.) So it's not just a YAML translation of CSL JSON.

As I develop my new citeproc library, I may change things a bit to line things up more, while preserving backwards compatibility. For example, I think a date-parts field should be allowed, but I'd try to keep the current more elegant-looking syntax as an option too.

bdarcus · 2020-06-26T17:24:56Z

When we originally designed the json, focus was on machines. But given evolution since, now might be a time to rethink some of the decisions, so we end with a solid representation well-suited to humans?

retorquere · 2020-06-26T17:36:03Z

@jgm but the html markup that csl-json supports is also valid markdown, so for export, there'd be no problem.

Edit: wait, does pandoc only support markdown tags, and not the html tags?

What other tools besides pandoc read csl-yaml?

bwiernik · 2020-06-26T17:38:33Z

I think accepting dates in either format would be fine in either schema. We just need to specify where the order of priority for redundant parts is (cf. there is an issue raised that we should specify in names that specific name parts are given priority over a literal field; we could do the same with dates--perhaps date-parts gets priority over the individual pieces).

With respect to markdown, I'm leery about making that universal, but I think we could add a flag to the data indicating that the data should be read as markdown.

bwiernik · 2020-06-26T17:46:42Z

@retorquere I think he is saying that pandoc CSL YAML supports markdown syntax in addition to HTML syntax.

I'm a little concerned about assuming that, for example, _ or * always indicate text formatting.

retorquere · 2020-06-26T17:53:59Z

And the markdown escapes of those and others of course. Even if you only use html for markup. I hadn't thought of that before and I'll have to think about what to do for BBTs csl-yaml export.

I think it'd need to be explicitly marked if you want markdown processing, or markdown would have to be the default for csl-yaml. I'd rather not deal with ambiguity.

Are there other csl-yaml processors? If not, then I could have BBT default to markdown.

bwiernik · 2020-06-26T18:01:07Z

As the main consumer of CSL YAML is currently and I suspect will remain pandoc, I think we could make the default markdown with an option to disable?

pandoc interprets the HTML markup, so I'd suggest BBT not worry about translating the HTML tags into markdown.

retorquere · 2020-06-26T18:06:50Z

I'm not worried about those, but about whether to escape if I find * or #. They just mean different things to markdown and html.

bdarcus · 2020-06-26T18:18:16Z

@retorquere I think he is saying that pandoc CSL YAML supports markdown syntax in addition to HTML syntax.

I read John to say above that he doesn't support the HTML.

bwiernik · 2020-06-26T18:19:29Z

If a user is generating CSL YAML with BBT, I'd expect that not escaping markdown characters (BBT's current behavior) is the better default.

bwiernik · 2020-06-26T18:22:26Z

I read John to say above that he doesn't support the HTML.

I checked. The HTML markup is supported in CSL YAML by pandoc (which makes sense because it is valid HTML).

bdarcus · 2020-06-26T18:35:06Z

And with auto-completion it would be even better!

Did you get YAML auto-completion working with vscode?

Might be cool if we could have CSL extensions for both vscode and, per the thing I started, atom, so we could give people easy-to-install auto-completing editors for CSL styles and data.

denismaier · 2020-06-26T18:38:07Z

And with auto-completion it would be even

And with auto-completion it would be even better!

Did you get YAML auto-completion working with vscode?

Might be cool if we could have CSL extensions for both vscode and, per the thing I started, atom, so we could give people easy-to-install auto-completing editors for CSL styles and data.

Both yes!

denismaier · 2020-06-26T18:49:43Z

What doesn't work in vs code is style validation...

bdarcus · 2020-06-26T19:00:00Z

Seems vscode doesn't support relaxng validation, and is dependent on this issue to add it.

bwiernik · 2020-06-26T19:00:03Z

Are RNC and XSD feature compatible? There are numerous XSD validators for vsscode? It might be possible to automatically generate an unofficial XSD schema from the RNC for use with editors.

jgm · 2020-06-26T19:00:39Z

The HTML markup is supported in CSL YAML by pandoc (which makes sense because it is valid HTML).

Well, pandoc will pass through raw HTML as "RawInline" elements. And these will be emitted in HTML output. But if you target, say, LaTeX, they'll just be omitted. So it's not really supported.

retorquere · 2020-06-26T19:03:08Z

Well, pandoc will pass through raw HTML as "RawInline" elements. And these will be emitted in HTML output. But if you target, say, LaTeX, they'll just be omitted. So it's not really supported.

That and the fact that if you assume markdown, *word* becomes \emph{word}, and if you assume HTML, it becomes *word*. And if pandoc is the sole consumer of CSL-YAML, it seems better to escape.

bdarcus · 2020-06-28T13:59:05Z

I've added a new linked issue to the documentation repo, specific to the sub-field formatting discussion that has mostly been the focus of this thread. Hoping to do a PR later today so we can get more concrete. We've discussed all the issues, and now know enough to be specific, I think. For this issue, whether to add a YAML representation that validates against the JSON schema, I think we should keep this open, and I think we should see if we can get it to work.to everyone's satisfaction. I expect if we do this, it will result in one or more PRs on the JSON schema (say to @jgm's point on date representation), and possibly one on the documentation repo (simply to mention the YAML format, and that one can validate it against the JSON schema). I've updated the top post to reflect what I think are next steps.

jgm · 2020-06-28T18:11:49Z

@bdarcus -- So I guess this means limiting abstracts to one paragraph without any block-level formatting (no tables, lists, figures, etc.). This seems reasonable but I'm not up on abstract customs in different fields. You're also excluding hyperlinks, which would not be normal in a title but might appear in an abstract.

Math is tricky. I see the point of passing it through directly to the output format. However, if you're working with MathJax you often need to escape < as < (to give just one example, see http://docs.mathjax.org/en/latest/input/tex/html.html) . Currently this would be completely garbled, since CSL doesn't recognize entities as such. So if you did the right thing and wrote x<y, then CSL would garble it, but if you wrote x<y then it wouldn't work properly in your HTML output because <y would be interpreted by the browser as a tag.

One approach would be to have a <math format="...">..</math> tag, where everything inside is passed through verbatim. You could then specify the format as 'TeX' or 'HTML-escaped TeX' or 'MathML' or whatever, and the output processor would have to check this and deal with it appropriately.

Another approach would be to just insist on MathML. Note that MathML can include an annotation element into which one could put a plain-text fallback for RTF or whatever.

bwiernik · 2020-06-28T18:52:58Z

Thinking about Zotero–pandoc compatibility as the major concern I have, that usually happens via interface with BBT. If there is a defined set of HTML-like tags that Zotero supports, I think BBT or similar export translators could convert those tags to Markdown syntax fairly easily.

For math, I don’t see a word processor plugin directly supporting math input. But, if a user stored it in TeX, they could convert that to a Word OOML equation as a simple post processing step, and it would work out of the box with pandoc.

bdarcus · 2020-06-28T19:49:09Z

@bdarcus -- So I guess this means limiting abstracts to one paragraph without any block-level formatting (no tables, lists, figures, etc.). This seems reasonable but I'm not up on abstract customs in different fields. You're also excluding hyperlinks, which would not be normal in a title but might appear in an abstract.

I have no firm position against these. I just wanted to keep this moving forward, and wasn't really focused on those cases because they're not the primary requirement for manuscript preparation.

But certainly we should consider them.

bwiernik · 2020-06-28T20:54:45Z

The two fields where this might come into play are abstract and note (e.g., used in annotated bibliographies for example). I could definitely see line breaks in both of those. Currently citeproc-js just disregards line breaks and renders without any white space at all.

Abstracts often have subheadings—those are usually set with bold, rather than heading markers.

Abstracts probably won’t have lists or tables, but note might. Formally supporting that might be ought of scope? But could be nice. I don’t know if rich text supports these (same with links?), so that might a thing left to individual applications/processors to decide.

dstillman · 2020-06-28T23:12:53Z

possibly substitute numbered placeholders for the tags to avoid any unexpected HTML/RTF processing

This thing I said above didn't really make sense — it would work for a one-off bibliography but not if we were embedding CSL-JSON in a document for future processing.

One approach would be to have a <math format="...">..</math> tag, where everything inside is passed through verbatim. You could then specify the format as 'TeX' or 'HTML-escaped TeX' or 'MathML' or whatever, and the output processor would have to check this and deal with it appropriately.

I don't think passing anything through verbatim (as in, not processed according to the output format) actually works — depending on the output format, it could very well mean invalid/unescaped markup, and if the calling application doesn't know about it and deal with it appropriately, it's potentially a security flaw.

Another approach would be to just insist on MathML. Note that MathML can include an annotation element into which one could put a plain-text fallback for RTF or whatever.

I don't think processing MathML should be the citation processor's job, for the reasons I give above: the output format abilities may not be the same as the target application abilities, it would require a duplicate bundled math processor, and it just seems generally unreasonable to ask of a citation processor.

But a version of this might work:

Require MathML, and expect the citation processor to support an API for math handling. There's no reason citeproc-js needs a copy of MathJax — it just needs Zotero to provide a function that runs the MathML through its own copy of MathJax and return the necessary output for the format and word processor being used. If a math processor isn't provided, the citation processor could use the annotation field if present or embed/throw an error if not.
Support some generic mechanism for embedding typed content that needed to be handled by the calling application. This could potentially even be used for abstracts — instead of adding support for more rich-text formatting to CSL, the processor could call a function provided by the application that took type="html", data="<marquee>What a paper!</marquee>", output="RTF" to a function provided by the application that returned an appropriate string to insert into the output (which now could even be a placeholder for post-processing by the application). The same could be used for passing math — e.g., type="mathml", data="<math>…</math>", output="RTF", and it would be up to the calling application what to do if the output format and/or target application couldn't handle math. If an appropriate handler wasn't available, it could be handled as regular text (e.g., HTML-encoded) or an error could be thrown/embedded by the processor. This would keep the citation processor from needing to bundle huge processors that the calling application likely already has (a math processor, a HTML parser/sanitizer, etc.).

bwiernik · 2020-06-28T23:20:11Z

For the case of Zotero's Word integration, would either of those solutions enable, for example, the title and abstract of this item to appear in Word as a math environment equation?

dstillman · 2020-06-28T23:40:29Z

I actually kind of doubt it — I suspect you can't embed an equation element in the text of a Word field, which is what we would need to do. So while I don't know for sure, realistically output="RTF" probably means trying to convert MathML to UnicodeMath or AsciiMath. Still, that seems like more of a problem for a calling application that wants to deal with it — and which might need to do it in other contexts as well — than for a citation processor.

bwiernik · 2020-06-29T00:22:51Z

It seems like math, tables, lists might be beyond the scope of CSL; these might be things that we recommend applications support (e.g., math everywhere, tables and lists in abstract and note), but that is really up to the application to define?

bwiernik · 2020-06-29T00:24:00Z

@dstillman UnicodeMath would be a good compromise to be able to convert unlinked citations/bibliographies to equations with one click or a macro

As part of #278, and to harmonize the JSON and YAML representations around a much more concise and expressive date format, this adds a an option to use EDTF; either as a preferred string on any date, or as an "edtf" string property on the more verbose alternative object representation. While EDTF was originally an initiative of the US Library of Congress, ISO adopted it as part of 8601-2 in 2019. Note: The current regular expression pattern only checks for valid characters.

bdarcus · 2020-06-30T21:53:46Z

Date issue now hopefully solved with the EDTF addition. Do we want to add the optional property for the markup that @bwiernik suggested? If yes, suggested values? - html subset (default? or maybe this is default on json, and markdown on yaml?) - markdown (is this the value for the pandoc syntax too?) - org The above make sense because they have citation support in their ecosystems (and org will be getting native citation support soon). Not sure if any others would apply? LaTeX, but that seems a PITA to support, and superfluous given bibtex/biblatex?

bwiernik · 2020-06-30T21:58:10Z

Could we just leave that an open field and leave it up to processors to designate the markup they support?

bdarcus · 2020-06-30T22:12:40Z

Not sure, but I suppose.

bdarcus · 2020-07-03T18:44:16Z

@larsgw suggested in this comment that we consider having two input schemas: one for humans (yaml + edtf), and the other for machines (json + structured data object).

I wasn't sure how easy or possible this was in json schema, ~~but the below appears (though I am not 100% certain) to work.~~

Edit: no, it's not possible. it seems. In that case, we should probably just continue as planned.

bdarcus · 2020-07-03T18:53:21Z

Also, @jgm, am I correct that your current date model supports ranges? If yes how do you define an open-ended range?

jgm · 2020-07-03T19:01:07Z

It seems that this works with pandoc-citeproc to specify an open range:

  issued:
  - year: 2042
  - {}

But I wouldn't worry too much about my data model, since I'm planning to transition eventually to the new citeproc library I'm writing. It already passes more citeproc tests than pandoc-citeproc, and it's much faster and more maintainable. It uses the date-parts model that is part of current CSL.

denismaier · 2020-07-03T19:38:34Z

Wow, that was fast. Do you already have more concrete plans when we can expect the new library?

…language#284) As part of citation-style-language#278, and to harmonize the JSON and YAML representations around a much more concise and expressive date format, this adds a an option to use EDTF; either as a preferred string on any date, or as an "edtf" string property on the more verbose alternative object representation. While EDTF was originally an initiative of the US Library of Congress, ISO adopted it as part of 8601-2 in 2019. Note: The current regular expression pattern only checks for valid characters.

bdarcus · 2020-07-09T14:25:25Z

It seems that this works with pandoc-citeproc to specify an open range:
  issued:
  - year: 2042
  - {}

Here's what I have in #301 @jgm:

issued:
- date-parts:
  - 2000
- {}

So it merges your model and the 1.0 JSON model to match the EDTF model (which is a date, and in levels 0 and 1, a date range, which is two-item list of dates).

I believe date parts is better as an object (as you have), but I guess for compatibility we should keep the array. Anyone want to make the argument we should change this too? If yes, please state your case on #301. If not, we'll keep as is.

The human-readable preference, of course, would be the preferred EDTF string:

issued: 2000/..

This adds a definitionf for rich-text variables, and title-string definitions that uses that variables, and then redefines all title and other fields to use these definitions. Addresses in part #278

This adds an experimental csl-rich-text.yaml schema that defines a structure for rich-text formatting in JSON. Also adds a definition for rich-text variables, and title-string definitions that uses that variable, and then redefines all title and other fields to use these definitions. So it will be easy to merge the rich text support in the future. Addresses in part #278

As part of #278, and to harmonize the JSON and YAML representations around a much more concise and expressive date format, this adds a an option to use EDTF; either as a preferred string on any date, or as an "edtf" string property on the more verbose alternative object representation. While EDTF was originally an initiative of the US Library of Congress, ISO adopted it as part of 8601-2 in 2019. Note: The current regular expression pattern only checks for valid characters.

bdarcus · 2022-06-03T17:20:33Z

Now close to two years later, I merged today #420, with examples of validating completion against the current v1.1 branch version of the schema (that allows EDTF for dates). It actually works pretty well for humans and machines, I'd say.

Much of this long thread contains very useful thoughts on a more narrow aspect of this; the question of the markup, etc. within the fields. #315 was an experiment for that, though I have no idea if the idea is any good.

bdarcus added enhancement input labels Jun 26, 2020

bdarcus changed the title ~~CSL YAML~~ "Add" CSL YAML Jun 26, 2020

bdarcus mentioned this issue Jun 26, 2020

Add JSON schemas to JSON store #279

Open

bdarcus mentioned this issue Jun 28, 2020

(input): Add EDTF as an option for date representation #284

Merged

bdarcus mentioned this issue Jul 9, 2020

input: Align date-parts and edtf string models #301

Merged

bdarcus mentioned this issue Jul 12, 2020

input: rich text #315

Merged

2 tasks

"Add" CSL YAML #278

"Add" CSL YAML #278

Comments

bdarcus commented Jun 26, 2020 • edited Loading

Proposal

denismaier commented Jun 26, 2020 • edited Loading

bdarcus commented Jun 26, 2020 • edited Loading

bwiernik commented Jun 26, 2020

bdarcus commented Jun 26, 2020

denismaier commented Jun 26, 2020

denismaier commented Jun 26, 2020

bdarcus commented Jun 26, 2020

retorquere commented Jun 26, 2020

jgm commented Jun 26, 2020

bdarcus commented Jun 26, 2020 via email

retorquere commented Jun 26, 2020 • edited Loading

bwiernik commented Jun 26, 2020

bwiernik commented Jun 26, 2020

retorquere commented Jun 26, 2020

bwiernik commented Jun 26, 2020

retorquere commented Jun 26, 2020

bdarcus commented Jun 26, 2020

bwiernik commented Jun 26, 2020

bwiernik commented Jun 26, 2020 • edited Loading

bdarcus commented Jun 26, 2020 • edited Loading

denismaier commented Jun 26, 2020

denismaier commented Jun 26, 2020

bdarcus commented Jun 26, 2020

bwiernik commented Jun 26, 2020

jgm commented Jun 26, 2020

retorquere commented Jun 26, 2020 • edited Loading

bdarcus commented Jun 28, 2020 via email • edited Loading

jgm commented Jun 28, 2020

bwiernik commented Jun 28, 2020

bdarcus commented Jun 28, 2020 • edited Loading

bwiernik commented Jun 28, 2020 • edited Loading

dstillman commented Jun 28, 2020

bwiernik commented Jun 28, 2020

dstillman commented Jun 28, 2020

bwiernik commented Jun 29, 2020

bwiernik commented Jun 29, 2020

bdarcus commented Jun 30, 2020 via email

bwiernik commented Jun 30, 2020

bdarcus commented Jun 30, 2020 via email

bdarcus commented Jul 3, 2020 • edited Loading

bdarcus commented Jul 3, 2020 • edited Loading

jgm commented Jul 3, 2020

denismaier commented Jul 3, 2020 via email • edited Loading

bdarcus commented Jul 9, 2020 • edited Loading

bdarcus commented Jun 3, 2022 • edited Loading

bdarcus commented Jun 26, 2020 •

edited

Loading

denismaier commented Jun 26, 2020 •

edited

Loading

bdarcus commented Jun 26, 2020 •

edited

Loading

retorquere commented Jun 26, 2020 •

edited

Loading

bwiernik commented Jun 26, 2020 •

edited

Loading

bdarcus commented Jun 26, 2020 •

edited

Loading

retorquere commented Jun 26, 2020 •

edited

Loading

bdarcus commented Jun 28, 2020 via email •

edited

Loading

bdarcus commented Jun 28, 2020 •

edited

Loading

bwiernik commented Jun 28, 2020 •

edited

Loading

bdarcus commented Jul 3, 2020 •

edited

Loading

bdarcus commented Jul 3, 2020 •

edited

Loading

denismaier commented Jul 3, 2020 via email •

edited

Loading

bdarcus commented Jul 9, 2020 •

edited

Loading

bdarcus commented Jun 3, 2022 •

edited

Loading