-
-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
input: rich text #315
input: rich text #315
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments on your questions:
- where to put the markup property? Perhaps we need a top level object with such metadata? Also, should it be an uncontrolled string, or some kind of enum list?
A top-level object is probably a good thing in any event to specify the schema version (e.g., with date-parts
being removed). Conceivably a processor might want markup in the native language for the implementation, so I'd say free text, perhaps with the specification that strings be in lowercase?
the current document language I have suggests only the HTML subset is allowed on JSON. Is that right?
maybe more for the spec, but what to say about stuff beyond the things a processor has to support? Strip (ignore)? But would that open a can of worms?
strikethrough and verbatim/code should be added to the enumerated list.
Some explicit language about other other HTML syntax one way or the other should be given. Given the previous discussion, it might be reasonable to indicate that other markup syntax may be supported at the discretion of the application. Something like Dan's final suggestion from the other thread might be good. Indicate that other markup should be processed by the calling application and returned to the processor in a format that is appropriate for the target output format (e.g., HTML, RTF, etc.).
Dan also noted a potential security risk of unescaped markup.
- math:
a. seems sensible to say MathML for JSON/HTML (but what about @bwiernik's suggestion of unicodemath?)
I think we should go with Dan's suggestions here. Keeping with the JSON/HTML calling for well-formed HTML, I think that calling for MathML in the JSON schema is good. Applications can choose to support LaTeX, AsciiMath, UnicodeMath, etc. as they see fit.
b. what about YAML?
pandoc currently calls for LaTeX equations in pandoc Markdown's syntax ($...$
). That is probably the most widely-used and well-supported somewhat-human-readable syntax. I think we can leave this as an open option for processors. I could imagine processors that target CSL YAML choosing to support LaTeX, AsciiMath, UnicodeMath, depending on their audience.
I forgot about this.
This will be for the spec though; correct? As in, we don't need to worry about it here. |
I think we need some comment here about the syntax structure for other markup (e.g., if we go with |
Yeah, I don't really understand his comment, or how it would impact this PR. |
Instead of using some HTML subset for this, another alternative would be to represent the rich text structure using JSON structures (as we do with pandoc's {
"title": "This is just a string with no markup",
"subtitle": [
"The plot of ",
{
"formatting": "italic",
"text": [
{
"formatting": "nocase",
"text": "Paradise Lost"
}
]
}
]
} or more compactly {
"title": "This is just a string with no markup",
"subtitle": [
"The plot of ",
{
"italic": [
{
"nocase": "Paradise Lost"
}
]
}
]
} This would avoid the issues about supporting some but not all HTML in the JSON representation, and avoid the need for parsing HTML-ish tags. |
@jgm - with this approach, how would you deal with math, and in YAML?
|
I guess the model you're suggesting would be something like this (in JSON Schema, but YAML syntax)? Edit: updated based on the reply. ---
definitions:
rich-text-content:
oneOf:
- "$ref": "#/definitions/rich-text"
- type: string
rich-text:
type: array
items:
anyOf:
italic:
title: Italicized Text
"#ref": "#/definitions/rich-text-content"
bold:
title: Bold Text
"#ref": "#/definitions/rich-text-content"
sc:
title: Small-Cap Text
"#ref": "#/definitions/rich-text-content"
preserve:
title: Preserve Case Text
"#ref": "#/definitions/rich-text-content"
code:
title: Code/Verbatim Text
"#ref": "#/definitions/rich-text-content"
quote:
title: Quote
"#ref": "#/definitions/rich-text-content"
strike:
title: Strike-Through Text
"#ref": "#/definitions/rich-text-content"
math:
title: Math
type: object
properties:
content: string
format:
enum:
- tex
- mathml |
I wasn't thinking of YAML, just JSON. It seems to me that these fill different needs. JSON is the interchange format. YAML is for humans to read and write, and some kind of markup (e.g. markdown) would be convenient to have there. I don't really see why you're trying to add a YAML format at all, actually. Seems better to let there be different YAML variants appropriate for different purposes, but all interconvertible with the canonical JSON. I'd plan to continue allowing pandoc users to use pandoc's markdown in YAML, for example, even if citeproc specified something else. As for math, you could either have a math element that takes mathml: { "math": "<math display=\"block\" xmlns=\"http://www.w3.org/1998/Math/MathML\"><mrow><mi>x</mi><mo>=</mo><msup><mi>y</mi><mn>2</mn></msup></mrow></math>" } or you could have variants for mathml and tex: { "tex-math": "x=y^2" }
{ "mathml-math": "x=y^2" } EDIT: or maybe { "math": "x=y^2",
"format": "tex" } |
The reason why is the tech landscape around the two languages and tools
have converged such that they are virtually interchangeable.
We can validate the JSON with either a JSON or YAML version of the schema,
and the same for YAML data.
So why not kill two birds with one stone IF it's practical?
Granted, there are these two places where the human vs machine priorities
do diverge: this case, and dates.
But it seems easy enough to handle. I'd expect in YAML people will
prioritize EDTF for dates, and markdown for subfield markup.
BTW, I haven't checked, but believe the recent changes in dates means the
CSL JSON schema will validate the current pandoc YAML.
|
@bdarcus Do you need to add something to permit non-attribute string elements to an array? I actually rather like the JSON structure idea. That leaves it clearly up to the applications to decide how to represent these features internally, and it would also make it fairly easy to flexibly handle features the serving or target application doesn't support (e.g., Zotero could just display an AsciiMath translation of a math element in a Word document field, which don't support full equations; then, when field codes are removed, Zotero can convert the math to a proper Word Math environment using the MathML). I think the 7 simple text markup options can be as you specify @bdarcus. For other markup types, I suggest we have three attributes:
Using my Zotero–Word example again, the item might have these data:
When Zotero's field codes are active, the bibliography shows the |
Note that a tex fallback can be included directly in mathml:
|
Sorry, but "attributes", I meant "properties"--I get the jargon of these various formats mixed up sometimes. Under |
Even if some kind of markdown is supported in certain YAML fields, it may not be as expressive as pandoc's markdown. So I'd have to convert to CSL JSON (or an equivalent structure) anyway. Other people might have applications where they want to write bibliographic data in reST or HTML or whatever. If you specify the CSL JSON format people will know what they need to convert to. I just don't see what's gained by specifying formatting in the YAML format, or specifying the YAML at all. Why not just say: if you have an application that represents bibliographic data in YAML, just make sure it can be converted into CSL JSON for processing by a CSL processor? |
@jgm That works for mathml, but what about, for example,
I think we've reached the consensus that markup in YAML is open to the application to decide, but that the format should be specified. |
Do you mean in general, or just in this case of sub-field formatting? I'm definitely leaning towards being silent on the YAML in this case. |
I just quickly put something together for discussion, and did not test it, so it's possible some detail or another is wrong technically. But the idea behind it was it's a nested array of strings and formatting objects; per what @jgm was demonstrating, but that a simple string is also an option, per the definition at the top. |
I just added the rich-text only schema to this branch, in yaml, so easier to grok and discuss. It includes annotations, including examples. I decided to just add top-level If this is the way we want to go, I can update the main schema to reflect. This would be a significant change, though, so seems we'd want some input and testing from different developers. It could also be that for 1.1, we simply include a note about this being experimental, but not hook it up to the main schema, and add when we actually have apps using it? Thoughts on how to wrap this up? |
I've rebased and squashed this branch around this path. So the idea is we explicitly flag this as "experimental" for 1.1, and the new schema is for developers to test this and provide feedback. We can then add to the main schema when it's ready; as in when at least one project actually implements it. I'd like to merge this. Any objections? We could also merge, ask developers to look closely at this while wrapping up 1.1, and see about including it in the release. |
This adds an experimental csl-rich-text.yaml schema that defines a structure for rich-text formatting in JSON. Also adds a definition for rich-text variables, and title-string definitions that uses that variable, and then redefines all title and other fields to use these definitions. So it will be easy to merge the rich text support in the future. Addresses in part #278
OK, merged. Now we need input from developers on whether to incorporate into the main input schema. If and when we do integrate it, we also need feedback on how we should document this. |
Just to add, here's the YAML schema version converted to JSON, with included examples: {
"description": "JSON schema for CSL input rich text representation",
"$schema": "http://json-schema.org/draft-07/schema#",
"$id": "https://resource.citationstyles.org/schema/latest/input/json/csl-rich-text.json",
"definitions": {
"rich-text-content": {
"title": "Rich Text Content",
"description": "Rich text content can be represented as an array of strings and formatted object strings.",
"examples": [
{
"title": [
"A title with a",
{
"quote": "quoted string."
}
]
},
{
"title": [
"A title with tex math",
{
"math-tex": "x=y^2"
}
]
},
{
"title": [
"A title with mathml",
{
"math-ml": "x=y^2"
}
]
}
],
"oneOf": [
{
"$ref": "#/definitions/rich-text"
},
{
"type": "string"
}
]
},
"rich-text": {
"type": "array",
"items": {
"anyOf": [
{
"title": "Unformatted Sub-String",
"type": "string"
},
{
"bold": {
"title": "Bold Text",
"#ref": "#/definitions/rich-text-content"
}
},
{
"code": {
"title": "Code/Verbatim Text",
"#ref": "#/definitions/rich-text-content"
}
},
{
"italic": {
"title": "Italicized Text",
"#ref": "#/definitions/rich-text-content"
}
},
{
"math-ml": {
"title": "MathML",
"#ref": "#/definitions/rich-text-content"
}
},
{
"math-tex": {
"title": "Math-TeX",
"#ref": "#/definitions/rich-text-content"
}
},
{
"preserve": {
"title": "Preserve Case Text",
"#ref": "#/definitions/rich-text-content"
}
},
{
"quote": {
"title": "Quote",
"#ref": "#/definitions/rich-text-content"
}
},
{
"sc": {
"title": "Small-Cap Text",
"#ref": "#/definitions/rich-text-content"
}
},
{
"strike": {
"title": "Strike-Through Text",
"#ref": "#/definitions/rich-text-content"
}
}
]
}
}
}
} |
Are super/subscript not included? |
Yes, but just an oversight. I'll add them.
These would require more complex models, perhaps less relevant to citation formatting, where the most important variables for this are titles? But I'll add this as a question somewhere, that we'll want to resolve should citeproc developers otherwise support this. |
PS, @jgm, you see my invite for next month? We were hoping to talk about some of these issues. No pressure though. |
Just added sub and sup. |
John's new project might be interesting for this use case; kind of a stricter and easier-to-parse markdown, with the added features of pandoc (including math, and soon, citations): Example, with emphasis and inline math:
And the AST: para
str text="Einstein "
emph
str text="derived"
str text=" "
inline_math text="e=mc^2" |
Thanks for merging |
Description
This adds support for rich-text handling to the JSON schema, by defining new formatted string objects, that can be mixed with strings in an array.
Note that I have been conservative on where I allow rich-text formatting. I don't think we should allow it everywhere, and I think we should only explicitly allow it where it make sense.
Type of change
cc @jgm