Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

input: rich text #315

Merged
merged 1 commit into from
Jul 15, 2020
Merged

input: rich text #315

merged 1 commit into from
Jul 15, 2020

Conversation

bdarcus
Copy link
Member

@bdarcus bdarcus commented Jul 12, 2020

Description

This adds support for rich-text handling to the JSON schema, by defining new formatted string objects, that can be mixed with strings in an array.

Note that I have been conservative on where I allow rich-text formatting. I don't think we should allow it everywhere, and I think we should only explicitly allow it where it make sense.

Type of change

  • This is a breaking change
  • This change requires a documentation update

cc @jgm

Copy link
Member

@bwiernik bwiernik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments on your questions:

  1. where to put the markup property? Perhaps we need a top level object with such metadata? Also, should it be an uncontrolled string, or some kind of enum list?

A top-level object is probably a good thing in any event to specify the schema version (e.g., with date-parts being removed). Conceivably a processor might want markup in the native language for the implementation, so I'd say free text, perhaps with the specification that strings be in lowercase?

  1. the current document language I have suggests only the HTML subset is allowed on JSON. Is that right?

  2. maybe more for the spec, but what to say about stuff beyond the things a processor has to support? Strip (ignore)? But would that open a can of worms?

strikethrough and verbatim/code should be added to the enumerated list.

Some explicit language about other other HTML syntax one way or the other should be given. Given the previous discussion, it might be reasonable to indicate that other markup syntax may be supported at the discretion of the application. Something like Dan's final suggestion from the other thread might be good. Indicate that other markup should be processed by the calling application and returned to the processor in a format that is appropriate for the target output format (e.g., HTML, RTF, etc.).

Dan also noted a potential security risk of unescaped markup.

  1. math:
    a. seems sensible to say MathML for JSON/HTML (but what about @bwiernik's suggestion of unicodemath?)

I think we should go with Dan's suggestions here. Keeping with the JSON/HTML calling for well-formed HTML, I think that calling for MathML in the JSON schema is good. Applications can choose to support LaTeX, AsciiMath, UnicodeMath, etc. as they see fit.

b. what about YAML?

pandoc currently calls for LaTeX equations in pandoc Markdown's syntax ($...$). That is probably the most widely-used and well-supported somewhat-human-readable syntax. I think we can leave this as an open option for processors. I could imagine processors that target CSL YAML choosing to support LaTeX, AsciiMath, UnicodeMath, depending on their audience.

schemas/input/csl-data.json Outdated Show resolved Hide resolved
schemas/input/csl-data.json Outdated Show resolved Hide resolved
@bdarcus
Copy link
Member Author

bdarcus commented Jul 12, 2020

strikethrough and verbatim/code should be added to the enumerated list.

I forgot about this.

Some explicit language about other other HTML syntax one way or the other should be given. Given the previous discussion, it might be reasonable to indicate that other markup syntax may be supported at the discretion of the application. Something like Dan's final suggestion from the other thread might be good. Indicate that other markup should be processed by the calling application and returned to the processor in a format that is appropriate for the target output format (e.g., HTML, RTF, etc.).

This will be for the spec though; correct?

As in, we don't need to worry about it here.

@bwiernik
Copy link
Member

I think we need some comment here about the syntax structure for other markup (e.g., if we go with <span class="markup-html"> to indicate the contained has markup (ala Dan's second suggestion).

@bdarcus
Copy link
Member Author

bdarcus commented Jul 12, 2020

I think we need some comment here about the syntax structure for other markup (e.g., if we go with <span class="markup-html"> to indicate the contained has markup (ala Dan's second suggestion).

Yeah, I don't really understand his comment, or how it would impact this PR.

@jgm
Copy link

jgm commented Jul 13, 2020

Instead of using some HTML subset for this, another alternative would be to represent the rich text structure using JSON structures (as we do with pandoc's json output). There are many models you could adopt here, but here is the sort of thing I have in mind:

{
  "title": "This is just a string with no markup",
  "subtitle": [
    "The plot of ",
    {
      "formatting": "italic",
      "text": [
        {
          "formatting": "nocase",
          "text": "Paradise Lost"
        }
      ]
    }
  ]
}

or more compactly

{
  "title": "This is just a string with no markup",
  "subtitle": [
    "The plot of ",
    {
      "italic": [
        {
          "nocase": "Paradise Lost"
        }
      ]
    }
  ]
}

This would avoid the issues about supporting some but not all HTML in the JSON representation, and avoid the need for parsing HTML-ish tags.

@bdarcus
Copy link
Member Author

bdarcus commented Jul 13, 2020 via email

@bdarcus
Copy link
Member Author

bdarcus commented Jul 13, 2020

@jgm - with this approach, how would you deal with math ...

I guess the model you're suggesting would be something like this (in JSON Schema, but YAML syntax)?

Edit: updated based on the reply.

---
definitions:
  rich-text-content:
    oneOf:
      - "$ref": "#/definitions/rich-text"
      - type: string
  rich-text:
    type: array
    items:
      anyOf:
        italic:
          title: Italicized Text
          "#ref": "#/definitions/rich-text-content"
        bold:
          title: Bold Text
          "#ref": "#/definitions/rich-text-content"
        sc:
          title: Small-Cap Text
          "#ref": "#/definitions/rich-text-content"
        preserve:
          title: Preserve Case Text
          "#ref": "#/definitions/rich-text-content"
        code:
          title: Code/Verbatim Text
          "#ref": "#/definitions/rich-text-content"
        quote:
          title: Quote
          "#ref": "#/definitions/rich-text-content"
        strike:
          title: Strike-Through Text
          "#ref": "#/definitions/rich-text-content"
        math:
          title: Math
          type: object
          properties:
            content: string
            format:
              enum:
                - tex
                - mathml

@jgm
Copy link

jgm commented Jul 13, 2020

I wasn't thinking of YAML, just JSON. It seems to me that these fill different needs. JSON is the interchange format. YAML is for humans to read and write, and some kind of markup (e.g. markdown) would be convenient to have there. I don't really see why you're trying to add a YAML format at all, actually. Seems better to let there be different YAML variants appropriate for different purposes, but all interconvertible with the canonical JSON. I'd plan to continue allowing pandoc users to use pandoc's markdown in YAML, for example, even if citeproc specified something else.

As for math, you could either have a math element that takes mathml:

{ "math": "<math display=\"block\" xmlns=\"http://www.w3.org/1998/Math/MathML\"><mrow><mi>x</mi><mo>=</mo><msup><mi>y</mi><mn>2</mn></msup></mrow></math>" }

or you could have variants for mathml and tex:

{ "tex-math": "x=y^2" }
{ "mathml-math": "x=y^2" }

EDIT: or maybe

{ "math": "x=y^2",
  "format": "tex" }

@bdarcus
Copy link
Member Author

bdarcus commented Jul 13, 2020 via email

@bwiernik
Copy link
Member

@bdarcus Do you need to add something to permit non-attribute string elements to an array?

I actually rather like the JSON structure idea. That leaves it clearly up to the applications to decide how to represent these features internally, and it would also make it fairly easy to flexibly handle features the serving or target application doesn't support (e.g., Zotero could just display an AsciiMath translation of a math element in a Word document field, which don't support full equations; then, when field codes are removed, Zotero can convert the math to a proper Word Math environment using the MathML).

I think the 7 simple text markup options can be as you specify @bdarcus. For other markup types, I suggest we have three attributes:

  • the contents, labeled by the type of markup (e.g., math)
  • a format attribute, describing the format of the contained element (e.g., html, mathml, tex, unicodemath, asciimath)
  • an option text or display attribute, which gives the contents to display in unsupported environments; this would include only regular text and the 7 simple text markup options.
    • The purpose here is to make it clear for the processor what to do when a feature isn't supported.
    • If there is no display attribute, the contents is omitted.

Using my Zotero–Word example again, the item might have these data:

{ "math": "<math display=\"block\" xmlns=\"http://www.w3.org/1998/Math/MathML\"><mrow><mi>x</mi><mo>=</mo><msup><mi>y</mi><mn>2</mn></msup></mrow></math>",
"format": "mathml" ,
"display": "x = y^2"}

When Zotero's field codes are active, the bibliography shows the display option. When it converts the field codes to regular text, it replaces that with the MathML content to give a full equation display.

@bdarcus
Copy link
Member Author

bdarcus commented Jul 13, 2020

@bdarcus Do you need to add something to permit non-attribute string elements to an array?

I'm not following here @bwiernik; can you restate?

Attributes aren't a feature of JSON, for example; just strings, arrays, objects.

@jgm
Copy link

jgm commented Jul 13, 2020

Note that a tex fallback can be included directly in mathml:

<math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>x</mi><mo>=</mo><msup><mi>y</mi><mn>2</mn></msup></mrow><annotation encoding="application/x-tex">x=y^2</annotation></semantics></math>

@bwiernik
Copy link
Member

Sorry, but "attributes", I meant "properties"--I get the jargon of these various formats mixed up sometimes. Under rich-text, you specify that items in the array can be anyOf then enumerate 8 properties. I take it that bare (non-property) strings can always be included in an array, or do you need to specifically allow them?

@jgm
Copy link

jgm commented Jul 13, 2020

So why not kill two birds with one stone IF it's practical?

Even if some kind of markdown is supported in certain YAML fields, it may not be as expressive as pandoc's markdown. So I'd have to convert to CSL JSON (or an equivalent structure) anyway. Other people might have applications where they want to write bibliographic data in reST or HTML or whatever. If you specify the CSL JSON format people will know what they need to convert to. I just don't see what's gained by specifying formatting in the YAML format, or specifying the YAML at all. Why not just say: if you have an application that represents bibliographic data in YAML, just make sure it can be converted into CSL JSON for processing by a CSL processor?

@bwiernik
Copy link
Member

@jgm That works for mathml, but what about, for example, <br> or lists in HTML? Or a tex equation? I'm thinking about this as a general solution to providing fallbacks to more complex markup.

Even if some kind of markdown is supported in certain YAML fields, it may not be as expressive as pandoc's markdown.

I think we've reached the consensus that markup in YAML is open to the application to decide, but that the format should be specified.

@bdarcus
Copy link
Member Author

bdarcus commented Jul 13, 2020

Why not just say: if you have an application that represents bibliographic data in YAML, just make sure it can be converted into CSL JSON for processing by a CSL processor?

Do you mean in general, or just in this case of sub-field formatting?

I'm definitely leaning towards being silent on the YAML in this case.

@bdarcus
Copy link
Member Author

bdarcus commented Jul 13, 2020

Sorry, but "attributes", I meant "properties"--I get the jargon of these various formats mixed up sometimes. Under rich-text, you specify that items in the array can be anyOf then enumerate 8 properties. I take it that bare (non-property) strings can always be included in an array, or do you need to specifically allow them?

I just quickly put something together for discussion, and did not test it, so it's possible some detail or another is wrong technically.

But the idea behind it was it's a nested array of strings and formatting objects; per what @jgm was demonstrating, but that a simple string is also an option, per the definition at the top.

@bdarcus
Copy link
Member Author

bdarcus commented Jul 14, 2020

I just added the rich-text only schema to this branch, in yaml, so easier to grok and discuss. It includes annotations, including examples.

I decided to just add top-level math-ml and math-tex properties, for consistency/simplicity.

If this is the way we want to go, I can update the main schema to reflect.

This would be a significant change, though, so seems we'd want some input and testing from different developers.

It could also be that for 1.1, we simply include a note about this being experimental, but not hook it up to the main schema, and add when we actually have apps using it?

Thoughts on how to wrap this up?

@bdarcus
Copy link
Member Author

bdarcus commented Jul 14, 2020

It could also be that for 1.1, we simply include a note about this being experimental, but not hook it up to the main schema, and add when we actually have apps using it?

I've rebased and squashed this branch around this path.

So the idea is we explicitly flag this as "experimental" for 1.1, and the new schema is for developers to test this and provide feedback.

We can then add to the main schema when it's ready; as in when at least one project actually implements it.

I'd like to merge this. Any objections?

We could also merge, ask developers to look closely at this while wrapping up 1.1, and see about including it in the release.

@bdarcus bdarcus requested a review from bwiernik July 14, 2020 17:19
This adds an experimental csl-rich-text.yaml schema that defines a
structure for rich-text formatting in JSON.

Also adds a definition for rich-text variables, and title-string
definitions that uses that variable, and then redefines all title and
other fields to use these definitions. So it will be easy to merge the
rich text support in the future.

Addresses in part #278
@bdarcus bdarcus merged commit 4056538 into v1.1 Jul 15, 2020
@bdarcus
Copy link
Member Author

bdarcus commented Jul 15, 2020

OK, merged.

Now we need input from developers on whether to incorporate into the main input schema.

If and when we do integrate it, we also need feedback on how we should document this.

@bdarcus bdarcus deleted the rich-text branch August 7, 2020 17:28
@bdarcus bdarcus mentioned this pull request Jun 3, 2022
@bdarcus
Copy link
Member Author

bdarcus commented Jun 20, 2022

Just to add, here's the YAML schema version converted to JSON, with included examples:

{
  "description": "JSON schema for CSL input rich text representation",
  "$schema": "http://json-schema.org/draft-07/schema#",
  "$id": "https://resource.citationstyles.org/schema/latest/input/json/csl-rich-text.json",
  "definitions": {
    "rich-text-content": {
      "title": "Rich Text Content",
      "description": "Rich text content can be represented as an array of strings and formatted object strings.",
      "examples": [
        {
          "title": [
            "A title with a",
            {
              "quote": "quoted string."
            }
          ]
        },
        {
          "title": [
            "A title with tex math",
            {
              "math-tex": "x=y^2"
            }
          ]
        },
        {
          "title": [
            "A title with mathml",
            {
              "math-ml": "x=y^2"
            }
          ]
        }
      ],
      "oneOf": [
        {
          "$ref": "#/definitions/rich-text"
        },
        {
          "type": "string"
        }
      ]
    },
    "rich-text": {
      "type": "array",
      "items": {
        "anyOf": [
          {
            "title": "Unformatted Sub-String",
            "type": "string"
          },
          {
            "bold": {
              "title": "Bold Text",
              "#ref": "#/definitions/rich-text-content"
            }
          },
          {
            "code": {
              "title": "Code/Verbatim Text",
              "#ref": "#/definitions/rich-text-content"
            }
          },
          {
            "italic": {
              "title": "Italicized Text",
              "#ref": "#/definitions/rich-text-content"
            }
          },
          {
            "math-ml": {
              "title": "MathML",
              "#ref": "#/definitions/rich-text-content"
            }
          },
          {
            "math-tex": {
              "title": "Math-TeX",
              "#ref": "#/definitions/rich-text-content"
            }
          },
          {
            "preserve": {
              "title": "Preserve Case Text",
              "#ref": "#/definitions/rich-text-content"
            }
          },
          {
            "quote": {
              "title": "Quote",
              "#ref": "#/definitions/rich-text-content"
            }
          },
          {
            "sc": {
              "title": "Small-Cap Text",
              "#ref": "#/definitions/rich-text-content"
            }
          },
          {
            "strike": {
              "title": "Strike-Through Text",
              "#ref": "#/definitions/rich-text-content"
            }
          }
        ]
      }
    }
  }
}

@jgm
Copy link

jgm commented Jun 20, 2022

Are super/subscript not included?
Links and images are other things one might find in rich text (e.g. in an abstract).

@bdarcus
Copy link
Member Author

bdarcus commented Jun 20, 2022

Are super/subscript not included?

Yes, but just an oversight. I'll add them.

Links and images are other things one might find in rich text (e.g. in an abstract).

These would require more complex models, perhaps less relevant to citation formatting, where the most important variables for this are titles?

But I'll add this as a question somewhere, that we'll want to resolve should citeproc developers otherwise support this.

@bdarcus
Copy link
Member Author

bdarcus commented Jun 20, 2022

PS, @jgm, you see my invite for next month? We were hoping to talk about some of these issues.

No pressure though.

@bdarcus
Copy link
Member Author

bdarcus commented Jun 20, 2022

Just added sub and sup.

c3b917b

@bdarcus
Copy link
Member Author

bdarcus commented Feb 12, 2023

John's new project might be interesting for this use case; kind of a stricter and easier-to-parse markdown, with the added features of pandoc (including math, and soon, citations):

https://djot.net/

Example, with emphasis and inline math:

Einstein _derived_ $`e=mc^2`

And the AST:

  para
    str text="Einstein "
    emph
      str text="derived"
    str text=" "
    inline_math text="e=mc^2"

@bwiernik
Copy link
Member

Thanks for merging

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants