input: rich text #315

bdarcus · 2020-07-12T15:06:37Z

Description

This adds support for rich-text handling to the JSON schema, by defining new formatted string objects, that can be mixed with strings in an array.

Note that I have been conservative on where I allow rich-text formatting. I don't think we should allow it everywhere, and I think we should only explicitly allow it where it make sense.

Type of change

This is a breaking change
This change requires a documentation update

cc @jgm

bwiernik

Comments on your questions:

where to put the markup property? Perhaps we need a top level object with such metadata? Also, should it be an uncontrolled string, or some kind of enum list?

A top-level object is probably a good thing in any event to specify the schema version (e.g., with date-parts being removed). Conceivably a processor might want markup in the native language for the implementation, so I'd say free text, perhaps with the specification that strings be in lowercase?

the current document language I have suggests only the HTML subset is allowed on JSON. Is that right?

maybe more for the spec, but what to say about stuff beyond the things a processor has to support? Strip (ignore)? But would that open a can of worms?

strikethrough and verbatim/code should be added to the enumerated list.

Some explicit language about other other HTML syntax one way or the other should be given. Given the previous discussion, it might be reasonable to indicate that other markup syntax may be supported at the discretion of the application. Something like Dan's final suggestion from the other thread might be good. Indicate that other markup should be processed by the calling application and returned to the processor in a format that is appropriate for the target output format (e.g., HTML, RTF, etc.).

Dan also noted a potential security risk of unescaped markup.

math:
a. seems sensible to say MathML for JSON/HTML (but what about @bwiernik's suggestion of unicodemath?)

I think we should go with Dan's suggestions here. Keeping with the JSON/HTML calling for well-formed HTML, I think that calling for MathML in the JSON schema is good. Applications can choose to support LaTeX, AsciiMath, UnicodeMath, etc. as they see fit.

b. what about YAML?

pandoc currently calls for LaTeX equations in pandoc Markdown's syntax ( $...$ ). That is probably the most widely-used and well-supported somewhat-human-readable syntax. I think we can leave this as an open option for processors. I could imagine processors that target CSL YAML choosing to support LaTeX, AsciiMath, UnicodeMath, depending on their audience.

schemas/input/csl-data.json

bdarcus · 2020-07-12T17:14:00Z

strikethrough and verbatim/code should be added to the enumerated list.

I forgot about this.

Some explicit language about other other HTML syntax one way or the other should be given. Given the previous discussion, it might be reasonable to indicate that other markup syntax may be supported at the discretion of the application. Something like Dan's final suggestion from the other thread might be good. Indicate that other markup should be processed by the calling application and returned to the processor in a format that is appropriate for the target output format (e.g., HTML, RTF, etc.).

This will be for the spec though; correct?

As in, we don't need to worry about it here.

bwiernik · 2020-07-12T17:18:41Z

I think we need some comment here about the syntax structure for other markup (e.g., if we go with <span class="markup-html"> to indicate the contained has markup (ala Dan's second suggestion).

bdarcus · 2020-07-12T17:22:50Z

I think we need some comment here about the syntax structure for other markup (e.g., if we go with <span class="markup-html"> to indicate the contained has markup (ala Dan's second suggestion).

Yeah, I don't really understand his comment, or how it would impact this PR.

jgm · 2020-07-13T05:43:09Z

Instead of using some HTML subset for this, another alternative would be to represent the rich text structure using JSON structures (as we do with pandoc's json output). There are many models you could adopt here, but here is the sort of thing I have in mind:

{
  "title": "This is just a string with no markup",
  "subtitle": [
    "The plot of ",
    {
      "formatting": "italic",
      "text": [
        {
          "formatting": "nocase",
          "text": "Paradise Lost"
        }
      ]
    }
  ]
}

or more compactly

{
  "title": "This is just a string with no markup",
  "subtitle": [
    "The plot of ",
    {
      "italic": [
        {
          "nocase": "Paradise Lost"
        }
      ]
    }
  ]
}

This would avoid the issues about supporting some but not all HTML in the JSON representation, and avoid the need for parsing HTML-ish tags.

bdarcus · 2020-07-13T09:12:01Z

@jgm - with this approach, how would you deal with math, and in YAML?

bdarcus · 2020-07-13T13:00:28Z

@jgm - with this approach, how would you deal with math ...

I guess the model you're suggesting would be something like this (in JSON Schema, but YAML syntax)?

Edit: updated based on the reply.

---
definitions:
  rich-text-content:
    oneOf:
      - "$ref": "#/definitions/rich-text"
      - type: string
  rich-text:
    type: array
    items:
      anyOf:
        italic:
          title: Italicized Text
          "#ref": "#/definitions/rich-text-content"
        bold:
          title: Bold Text
          "#ref": "#/definitions/rich-text-content"
        sc:
          title: Small-Cap Text
          "#ref": "#/definitions/rich-text-content"
        preserve:
          title: Preserve Case Text
          "#ref": "#/definitions/rich-text-content"
        code:
          title: Code/Verbatim Text
          "#ref": "#/definitions/rich-text-content"
        quote:
          title: Quote
          "#ref": "#/definitions/rich-text-content"
        strike:
          title: Strike-Through Text
          "#ref": "#/definitions/rich-text-content"
        math:
          title: Math
          type: object
          properties:
            content: string
            format:
              enum:
                - tex
                - mathml

jgm · 2020-07-13T16:08:24Z

I wasn't thinking of YAML, just JSON. It seems to me that these fill different needs. JSON is the interchange format. YAML is for humans to read and write, and some kind of markup (e.g. markdown) would be convenient to have there. I don't really see why you're trying to add a YAML format at all, actually. Seems better to let there be different YAML variants appropriate for different purposes, but all interconvertible with the canonical JSON. I'd plan to continue allowing pandoc users to use pandoc's markdown in YAML, for example, even if citeproc specified something else.

As for math, you could either have a math element that takes mathml:

{ "math": "<math display=\"block\" xmlns=\"http://www.w3.org/1998/Math/MathML\"><mrow><mi>x</mi><mo>=</mo><msup><mi>y</mi><mn>2</mn></msup></mrow></math>" }

or you could have variants for mathml and tex:

{ "tex-math": "x=y^2" }
{ "mathml-math": "x=y^2" }

EDIT: or maybe

{ "math": "x=y^2",
  "format": "tex" }

bdarcus · 2020-07-13T16:59:10Z

The reason why is the tech landscape around the two languages and tools have converged such that they are virtually interchangeable. We can validate the JSON with either a JSON or YAML version of the schema, and the same for YAML data. So why not kill two birds with one stone IF it's practical? Granted, there are these two places where the human vs machine priorities do diverge: this case, and dates. But it seems easy enough to handle. I'd expect in YAML people will prioritize EDTF for dates, and markdown for subfield markup. BTW, I haven't checked, but believe the recent changes in dates means the CSL JSON schema will validate the current pandoc YAML.

bwiernik · 2020-07-13T17:25:21Z

@bdarcus Do you need to add something to permit non-attribute string elements to an array?

I actually rather like the JSON structure idea. That leaves it clearly up to the applications to decide how to represent these features internally, and it would also make it fairly easy to flexibly handle features the serving or target application doesn't support (e.g., Zotero could just display an AsciiMath translation of a math element in a Word document field, which don't support full equations; then, when field codes are removed, Zotero can convert the math to a proper Word Math environment using the MathML).

I think the 7 simple text markup options can be as you specify @bdarcus. For other markup types, I suggest we have three attributes:

the contents, labeled by the type of markup (e.g., math)
a format attribute, describing the format of the contained element (e.g., html, mathml, tex, unicodemath, asciimath)
an option text or display attribute, which gives the contents to display in unsupported environments; this would include only regular text and the 7 simple text markup options.
- The purpose here is to make it clear for the processor what to do when a feature isn't supported.
- If there is no display attribute, the contents is omitted.

Using my Zotero–Word example again, the item might have these data:

{ "math": "<math display=\"block\" xmlns=\"http://www.w3.org/1998/Math/MathML\"><mrow><mi>x</mi><mo>=</mo><msup><mi>y</mi><mn>2</mn></msup></mrow></math>",
"format": "mathml" ,
"display": "x = y^2"}

When Zotero's field codes are active, the bibliography shows the display option. When it converts the field codes to regular text, it replaces that with the MathML content to give a full equation display.

bdarcus · 2020-07-13T17:39:32Z

@bdarcus Do you need to add something to permit non-attribute string elements to an array?

I'm not following here @bwiernik; can you restate?

Attributes aren't a feature of JSON, for example; just strings, arrays, objects.

jgm · 2020-07-13T17:39:38Z

Note that a tex fallback can be included directly in mathml:

<math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>x</mi><mo>=</mo><msup><mi>y</mi><mn>2</mn></msup></mrow><annotation encoding="application/x-tex">x=y^2</annotation></semantics></math>

bwiernik · 2020-07-13T17:45:35Z

Sorry, but "attributes", I meant "properties"--I get the jargon of these various formats mixed up sometimes. Under rich-text, you specify that items in the array can be anyOf then enumerate 8 properties. I take it that bare (non-property) strings can always be included in an array, or do you need to specifically allow them?

jgm · 2020-07-13T17:46:51Z

So why not kill two birds with one stone IF it's practical?

Even if some kind of markdown is supported in certain YAML fields, it may not be as expressive as pandoc's markdown. So I'd have to convert to CSL JSON (or an equivalent structure) anyway. Other people might have applications where they want to write bibliographic data in reST or HTML or whatever. If you specify the CSL JSON format people will know what they need to convert to. I just don't see what's gained by specifying formatting in the YAML format, or specifying the YAML at all. Why not just say: if you have an application that represents bibliographic data in YAML, just make sure it can be converted into CSL JSON for processing by a CSL processor?

bwiernik · 2020-07-13T17:49:34Z

@jgm That works for mathml, but what about, for example, <br> or lists in HTML? Or a tex equation? I'm thinking about this as a general solution to providing fallbacks to more complex markup.

Even if some kind of markdown is supported in certain YAML fields, it may not be as expressive as pandoc's markdown.

I think we've reached the consensus that markup in YAML is open to the application to decide, but that the format should be specified.

bdarcus · 2020-07-13T17:51:09Z

Why not just say: if you have an application that represents bibliographic data in YAML, just make sure it can be converted into CSL JSON for processing by a CSL processor?

Do you mean in general, or just in this case of sub-field formatting?

I'm definitely leaning towards being silent on the YAML in this case.

bdarcus · 2020-07-13T17:56:10Z

Sorry, but "attributes", I meant "properties"--I get the jargon of these various formats mixed up sometimes. Under rich-text, you specify that items in the array can be anyOf then enumerate 8 properties. I take it that bare (non-property) strings can always be included in an array, or do you need to specifically allow them?

I just quickly put something together for discussion, and did not test it, so it's possible some detail or another is wrong technically.

But the idea behind it was it's a nested array of strings and formatting objects; per what @jgm was demonstrating, but that a simple string is also an option, per the definition at the top.

bdarcus · 2020-07-14T11:52:39Z

I just added the rich-text only schema to this branch, in yaml, so easier to grok and discuss. It includes annotations, including examples.

I decided to just add top-level math-ml and math-tex properties, for consistency/simplicity.

If this is the way we want to go, I can update the main schema to reflect.

This would be a significant change, though, so seems we'd want some input and testing from different developers.

It could also be that for 1.1, we simply include a note about this being experimental, but not hook it up to the main schema, and add when we actually have apps using it?

Thoughts on how to wrap this up?

bdarcus · 2020-07-14T17:19:19Z

It could also be that for 1.1, we simply include a note about this being experimental, but not hook it up to the main schema, and add when we actually have apps using it?

I've rebased and squashed this branch around this path.

So the idea is we explicitly flag this as "experimental" for 1.1, and the new schema is for developers to test this and provide feedback.

We can then add to the main schema when it's ready; as in when at least one project actually implements it.

I'd like to merge this. Any objections?

We could also merge, ask developers to look closely at this while wrapping up 1.1, and see about including it in the release.

This adds an experimental csl-rich-text.yaml schema that defines a structure for rich-text formatting in JSON. Also adds a definition for rich-text variables, and title-string definitions that uses that variable, and then redefines all title and other fields to use these definitions. So it will be easy to merge the rich text support in the future. Addresses in part #278

outdated

bdarcus · 2020-07-15T14:23:19Z

OK, merged.

Now we need input from developers on whether to incorporate into the main input schema.

If and when we do integrate it, we also need feedback on how we should document this.

bdarcus · 2022-06-20T14:38:38Z

Just to add, here's the YAML schema version converted to JSON, with included examples:

{
  "description": "JSON schema for CSL input rich text representation",
  "$schema": "http://json-schema.org/draft-07/schema#",
  "$id": "https://resource.citationstyles.org/schema/latest/input/json/csl-rich-text.json",
  "definitions": {
    "rich-text-content": {
      "title": "Rich Text Content",
      "description": "Rich text content can be represented as an array of strings and formatted object strings.",
      "examples": [
        {
          "title": [
            "A title with a",
            {
              "quote": "quoted string."
            }
          ]
        },
        {
          "title": [
            "A title with tex math",
            {
              "math-tex": "x=y^2"
            }
          ]
        },
        {
          "title": [
            "A title with mathml",
            {
              "math-ml": "x=y^2"
            }
          ]
        }
      ],
      "oneOf": [
        {
          "$ref": "#/definitions/rich-text"
        },
        {
          "type": "string"
        }
      ]
    },
    "rich-text": {
      "type": "array",
      "items": {
        "anyOf": [
          {
            "title": "Unformatted Sub-String",
            "type": "string"
          },
          {
            "bold": {
              "title": "Bold Text",
              "#ref": "#/definitions/rich-text-content"
            }
          },
          {
            "code": {
              "title": "Code/Verbatim Text",
              "#ref": "#/definitions/rich-text-content"
            }
          },
          {
            "italic": {
              "title": "Italicized Text",
              "#ref": "#/definitions/rich-text-content"
            }
          },
          {
            "math-ml": {
              "title": "MathML",
              "#ref": "#/definitions/rich-text-content"
            }
          },
          {
            "math-tex": {
              "title": "Math-TeX",
              "#ref": "#/definitions/rich-text-content"
            }
          },
          {
            "preserve": {
              "title": "Preserve Case Text",
              "#ref": "#/definitions/rich-text-content"
            }
          },
          {
            "quote": {
              "title": "Quote",
              "#ref": "#/definitions/rich-text-content"
            }
          },
          {
            "sc": {
              "title": "Small-Cap Text",
              "#ref": "#/definitions/rich-text-content"
            }
          },
          {
            "strike": {
              "title": "Strike-Through Text",
              "#ref": "#/definitions/rich-text-content"
            }
          }
        ]
      }
    }
  }
}

jgm · 2022-06-20T16:46:45Z

Are super/subscript not included?
Links and images are other things one might find in rich text (e.g. in an abstract).

bdarcus · 2022-06-20T17:03:15Z

Are super/subscript not included?

Yes, but just an oversight. I'll add them.

Links and images are other things one might find in rich text (e.g. in an abstract).

These would require more complex models, perhaps less relevant to citation formatting, where the most important variables for this are titles?

But I'll add this as a question somewhere, that we'll want to resolve should citeproc developers otherwise support this.

bdarcus · 2022-06-20T17:09:33Z

PS, @jgm, you see my invite for next month? We were hoping to talk about some of these issues.

No pressure though.

bdarcus · 2022-06-20T18:57:25Z

Just added sub and sup.

c3b917b

bdarcus · 2023-02-12T14:50:28Z

John's new project might be interesting for this use case; kind of a stricter and easier-to-parse markdown, with the added features of pandoc (including math, and soon, citations):

https://djot.net/

Example, with emphasis and inline math:

Einstein _derived_ $`e=mc^2`

And the AST:

  para
    str text="Einstein "
    emph
      str text="derived"
    str text=" "
    inline_math text="e=mc^2"

bwiernik · 2023-02-13T00:06:15Z

Thanks for merging

bdarcus requested review from bwiernik and denismaier July 12, 2020 15:06

bdarcus force-pushed the rich-text branch from fe65dbf to 2b8b2a2 Compare July 12, 2020 15:13

bwiernik previously requested changes Jul 12, 2020

View reviewed changes

schemas/input/csl-data.json Outdated Show resolved Hide resolved

schemas/input/csl-data.json Outdated Show resolved Hide resolved

bdarcus force-pushed the rich-text branch from c24aecc to 2af1015 Compare July 14, 2020 11:45

bdarcus force-pushed the rich-text branch from 2af1015 to f27b526 Compare July 14, 2020 12:18

bdarcus mentioned this pull request Jul 14, 2020

Add sub/main forms to json schema #310

Closed

bdarcus force-pushed the rich-text branch from b34f08d to 72f9e72 Compare July 14, 2020 17:15

bdarcus requested a review from bwiernik July 14, 2020 17:19

bdarcus force-pushed the rich-text branch from 72f9e72 to 2eb8479 Compare July 14, 2020 17:28

bdarcus merged commit 4056538 into v1.1 Jul 15, 2020

bdarcus deleted the rich-text branch August 7, 2020 17:28

bdarcus mentioned this pull request Jun 3, 2022

"Add" CSL YAML #278

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

input: rich text #315

input: rich text #315

bdarcus commented Jul 12, 2020 •

edited

Loading

bwiernik left a comment

bdarcus commented Jul 12, 2020

bwiernik commented Jul 12, 2020

bdarcus commented Jul 12, 2020

jgm commented Jul 13, 2020

bdarcus commented Jul 13, 2020 via email •

edited

Loading

bdarcus commented Jul 13, 2020 •

edited

Loading

jgm commented Jul 13, 2020 •

edited

Loading

bdarcus commented Jul 13, 2020 via email •

edited

Loading

bwiernik commented Jul 13, 2020

bdarcus commented Jul 13, 2020

jgm commented Jul 13, 2020 •

edited

Loading

bwiernik commented Jul 13, 2020

jgm commented Jul 13, 2020

bwiernik commented Jul 13, 2020

bdarcus commented Jul 13, 2020

bdarcus commented Jul 13, 2020

bdarcus commented Jul 14, 2020 •

edited

Loading

bdarcus commented Jul 14, 2020 •

edited

Loading

bdarcus commented Jul 15, 2020

bdarcus commented Jun 20, 2022

jgm commented Jun 20, 2022

bdarcus commented Jun 20, 2022

bdarcus commented Jun 20, 2022

bdarcus commented Jun 20, 2022

bdarcus commented Feb 12, 2023 •

edited

Loading

bwiernik commented Feb 13, 2023

input: rich text #315

input: rich text #315

Conversation

bdarcus commented Jul 12, 2020 • edited Loading

Description

Type of change

bwiernik left a comment

Choose a reason for hiding this comment

bdarcus commented Jul 12, 2020

bwiernik commented Jul 12, 2020

bdarcus commented Jul 12, 2020

jgm commented Jul 13, 2020

bdarcus commented Jul 13, 2020 via email • edited Loading

bdarcus commented Jul 13, 2020 • edited Loading

jgm commented Jul 13, 2020 • edited Loading

bdarcus commented Jul 13, 2020 via email • edited Loading

bwiernik commented Jul 13, 2020

bdarcus commented Jul 13, 2020

jgm commented Jul 13, 2020 • edited Loading

bwiernik commented Jul 13, 2020

jgm commented Jul 13, 2020

bwiernik commented Jul 13, 2020

bdarcus commented Jul 13, 2020

bdarcus commented Jul 13, 2020

bdarcus commented Jul 14, 2020 • edited Loading

bdarcus commented Jul 14, 2020 • edited Loading

bdarcus commented Jul 15, 2020

bdarcus commented Jun 20, 2022

jgm commented Jun 20, 2022

bdarcus commented Jun 20, 2022

bdarcus commented Jun 20, 2022

bdarcus commented Jun 20, 2022

bdarcus commented Feb 12, 2023 • edited Loading

bwiernik commented Feb 13, 2023

bdarcus commented Jul 12, 2020 •

edited

Loading

bdarcus commented Jul 13, 2020 via email •

edited

Loading

bdarcus commented Jul 13, 2020 •

edited

Loading

jgm commented Jul 13, 2020 •

edited

Loading

bdarcus commented Jul 13, 2020 via email •

edited

Loading

jgm commented Jul 13, 2020 •

edited

Loading

bdarcus commented Jul 14, 2020 •

edited

Loading

bdarcus commented Jul 14, 2020 •

edited

Loading

bdarcus commented Feb 12, 2023 •

edited

Loading