Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sub/main forms to json schema #310

Closed
denismaier opened this issue Jul 12, 2020 · 48 comments
Closed

Add sub/main forms to json schema #310

denismaier opened this issue Jul 12, 2020 · 48 comments
Labels

Comments

@denismaier
Copy link
Member

denismaier commented Jul 12, 2020

On the other hand, there seems to be a preference for flat CSL JSON (ignoring the funky date-variables nesting). And some of the flat fields already exist like container-title-short and backcompat would be good.

If going with the flat structure, I think it's best to update the CSL JSON schema to include these properties. I don't think it makes sense to add -short variants where there is not a documented or theoretical need.

To enable the flexibility to add suffixes to CSL fields, we could look into JSON Schema pattern properties. This would allow us to keep the number of schema definitions from exploding:

{
  "type": "object",
  "patternProperties": {
    "^title(-long|-sub|-main|-short)?$": { "type": "string" }
  },
  "additionalProperties": false
}

This looks promising. This will work for title, right?
What's the best way to add the prefix patterns? Does that work?

{
  "type": "object",
  "patternProperties": {
    "^(container-|collection-)?title(-long|-sub|-main|-short)?$": { "type": "string" }
  },
  "additionalProperties": false
}

Or should we add one pattern per title?

{
  "type": "object",
  "patternProperties": {
    "^title(-long|-sub|-main|-short)?$": { "type": "string" },
    "^container-title(-long|-sub|-main|-short)?$": { "type": "string" },
    "^collection-title(-long|-sub|-main|-short)?$": { "type": "string" }
  },
  "additionalProperties": false
}

Readability is better here, but it is redundant, of course. Maybe something like this?

{
  "type": "object",
  "patternProperties": {
    "^(container-
        |collection-
        |volume-)?
        title(-long|-sub|-main|-short)?$": { "type": "string" }
  },
  "additionalProperties": false
}

Originally posted by @denismaier in #271 (comment)

It seems that we currently deal with sub/main forms in the rnc schema, but there's nothing on the input side. Shouldn't we add those?
(I was about to start writing the documentation for the split-title feature and also to prepare some tests. But it looks like there are still some open quesitons...)

Edit: Currently, it looks like we'll support this by changing titles to objects.

@denismaier
Copy link
Member Author

Opinions @bwiernik @bdarcus @dhimmel

@bwiernik
Copy link
Member

For the prefixes, can we refer to an enumerated list of the title variables defined elsewhere in the schema?

@bdarcus
Copy link
Member

bdarcus commented Jul 12, 2020 via email

@denismaier
Copy link
Member Author

I think we should defer this.

Fine, but till when?

But concerning the tests/documentation: Can I expect title-main, title-sub, etc. being available on the input side somehow, whether via patterns or explicitely defined?

@bdarcus
Copy link
Member

bdarcus commented Jul 12, 2020 via email

@bdarcus
Copy link
Member

bdarcus commented Jul 12, 2020 via email

@denismaier
Copy link
Member Author

I think we assume available in styles via @Form; so extracted from a full

Yes, but @form is only in the rnc. I'm talking about the json schema. We will need a way to explicitely define the main form of a title variable because the extraction mechanism might produce wrong results.

title: One --- Two --- Three:  a subtitle

Depending on the settings, a citeproc will (incorrectly) produce:

main: One
sub: Two --- Three:  a subtitle

So, we'll need to supply the main form explicitely in addition to the full form:

title: One --- Two --- Three:  a subtitle
title-main: One --- Two --- Three 

@bdarcus
Copy link
Member

bdarcus commented Jul 12, 2020

I understand that.

But that's a hypothetical example. I'm saying I'd prefer to see what happens in the wild, before requiring all titles to split-able, in the data, upfront.

It's just my impulse; if others feel strongly, we can consider those arguments.

It does feel somehow wrong to have four different title variants in the actual data.

To repeat history, we introduced the short variant, I am 99% certain, to handle main titles. Yes, it can be used for other purposes, but that was the primary idea, with it being flexible.

@denismaier
Copy link
Member Author

denismaier commented Jul 12, 2020

I understand that.

But that's a hypothetical example. I'm saying I'd prefer to see what happens in the wild, before requiring all titles to split-able, in the data, upfront.

It's just my impulse; if others feel strongly, we can consider those arguments.

It does feel somehow wrong to have four different title variants in the actual data.

I understand your point. Having used biblatex before, I'd rather just have title-main and title-sub, but that would be a massive change, and probably not something that we would want to do.

I think for this feature to work reliably, we'd need to have this overriding mechanism. I think most users will not need to supply title-main and title-sub in the data. But for the other cases this should be possible.

To repeat history, we introduce the short variant, I am 99% certain, to handle main titles. Yes, it can be used for other purposes, but that was the primary idea, with it being flexible.

Yes, but as, e.g. @adam3smith has already pointed out, and I completely agree with him, title-short and title-main are not necessarily identical. They may be in some cases, but in a lot of cases they won't. Using title-short as an indicator for the main form is not a good idea.

In any case, users being able to supply title parts in some way was a basic assumption of what @bwiernik and I have worked out.

@bdarcus
Copy link
Member

bdarcus commented Jul 12, 2020

Yes, but as, e.g. @adam3smith has already pointed out, and I completely agree with him, title-short and title-main are not necessarily identical. They may be in some cases, but in a lot of cases they won't. Using title-short as an indicator for the main form is not a good idea.

This is another one of those cases, like the debate about label and citekey, where whether that's the case is almost irrelevant. We do have this legacy that was based on this logic, so we have to design with it in mind.

By far the most common example is this sort of pattern:

Some Title: With a Subtitle

For that, main title = short title.

So at minimum, we need to explain the difference in docmentation.

@denismaier
Copy link
Member Author

This is another one of those cases, like the debate about label and citekey, where whether that's the case is almost irrelevant. We do have this legacy that was based on this logic, so we have to design with it in mind.

I'm not so sure most users will have the main part of the title in title-short, regardless of what was the original reason for adding this, and I doubt most users were ever aware of the logic behind that. I think it's much more likely they'll have there some shorter version of the main title because this is what style guide usually require. E.g. Chicago:

Grazer, Brian, and Charles Fishman. A Curious Mind: The Secret to a Bigger Life. New York: Simon & Schuster, 2015.
=> Curious Mind

Borel, Brooke. The Chicago Guide to Fact-Checking. Chicago: University of Chicago Press, 2016.
=> Fact-Checking

Keng, Shao-Hsun, Chun-Hung Lin, and Peter F. Orazem. “Expanding College Access in Taiwan, 1978–2014: Effects on Graduate Quality and Income Inequality.” Journal of Human Capital 11, no. 1 (Spring 2017): 1–34. https://doi.org/10.1086/690235
=>Expanding College Access

Mead, Rebecca. “The Prophet of Dystopia.” New Yorker, April 17, 2017.
=> Dystopia

Rutz, Cynthia Lillian. “King Lear and Its Folktale Analogues.” PhD diss., University of Chicago, 2013.
=> King Lear

Of course, this needs to be documented accordingly.

@bdarcus
Copy link
Member

bdarcus commented Jul 12, 2020

The other option I was wondering about is whether it'd be feasible to add some sort of split instruction to the full title, as part of the sub-field formatting, akin to preserving case.

I'm not thrilled with the idea, but think it worth considering, given the need to override auto-splitting should be rare.

@denismaier
Copy link
Member Author

By far the most common example is this sort of pattern:

Some Title: With a Subtitle

For that, main title = short title.

So in that particular case, there's no need to change anything!

@denismaier
Copy link
Member Author

The other option I was wondering about is whether it'd be feasible to add some sort of split instruction to the full title, as part of the sub-field formatting, akin to preserving case.

In the proposal for this feature, we proposed to split multiple subtitles in title-sub with two vertical bars ||. @dstillman has also already suggested using some sort of markup to indicate split points. I really don't care much one or the other, as long as there is some way to override the automatic splits. We could also just use || on the full title, if that is an easy solution. @bwiernik ?

@bdarcus
Copy link
Member

bdarcus commented Jul 12, 2020

We could also just use || on the full title, if that is an easy solution.

So in the common (perhaps 99% or more) case, titles stay the same, and in the other case, one could just do ...

title: Some Weird Title ... || With a Subtitle

...?

@denismaier
Copy link
Member Author

Hopefully, yes.
The example above would be:

title: One --- Two --- Three:|| a subtitle

@bwiernik
Copy link
Member

Yes.

That is also similar to citeproc-js existing syntax for separating family and given names when names are entered as a key value pair:

author: Jones || Davey

@denismaier
Copy link
Member Author

denismaier commented Jul 12, 2020

That is also similar to citeproc-js existing syntax for separating family and given names when names are entered as a key value pair:

Just that this would be used on the standard title field...

@bwiernik Was there a reason we did not consider this option in the first place?

@bdarcus
Copy link
Member

bdarcus commented Jul 12, 2020

So then details depend on the spec language.

We could have something like:

Processors must split titles according the [insert rules], or on the || pattern.

So a processor would be splitting on whatever default split characters, and/or what is defined in locale and/style, or ||?

Does that mean a full title is not rendered directly, but is always reassembled from the split title?

@denismaier
Copy link
Member Author

Does that mean a full title is not rendered directly, but is always reassembled from the split title?

Current proposal here says (point 3):

Parsing by citeproc: If title-main and title-sub are not supplied in the data, the citeproc will derive them from title following these rules (based on existing citeproc-js behavior):

So, the answer to your question is yes. Citeprocs will always split titles into main and sub, and then reassemble. We could add a new option for @title-split "false", or similar, to disable that.

@denismaier
Copy link
Member Author

So a processor would be splitting on whatever default split characters, and/or what is defined in locale and/style, or ||?

Split characters are defined with @title-split.

@bdarcus
Copy link
Member

bdarcus commented Jul 12, 2020

Then a processor will just, for example, internally have some variable of characters to split on, and || overrides those?

So the splitting can be auto or manual, but not both?

And what did we decide about sub-sub titles? Is this valid to do internally?

>>> split_characters = re.compile('[\?,:]')
>>> split_characters.split("One: Two? Three")
['One', ' Two', ' Three']

@denismaier
Copy link
Member Author

denismaier commented Jul 12, 2020

Then a processor will just, for example, internally have some variable of characters to split on, and || overrides those?

So the splitting can be auto or manual, but not both?

@title-split defines the split-points for automatic splitting. E.g. with title-split=""simple" processors will split on . , : , :: , ! , ? . If this leads to incorrect results for some reason, you'd have to override the automatic behaviour with ||.

And yes, concerning sub-sub titles: that's mainly it.

"One: Two? Three" will be split into:

title[@form="main"]:  One
title[@form="sub"]: 
  - Two?
  - Three

@bdarcus
Copy link
Member

bdarcus commented Jul 12, 2020

So the split characters are either defined in the style OR ||; right?

In any case, this seems like the direction that's sensible. It would mean just a small change to the spec, and no change to the input schema.

@bwiernik
Copy link
Member

Yeah, there was no reason to not do this in the first place. Just didn’t occur to me. This is better.

@denismaier
Copy link
Member Author

So the split characters are either defined in the style OR ||; right?

Yes, processors will need to check if there are explicit split-points defined with ||, and if not, split using the split characters defined in the style.

@bdarcus
Copy link
Member

bdarcus commented Jul 12, 2020 via email

@denismaier
Copy link
Member Author

So, summing this up: I will start drafting the documentation and the test based on these assumptions:

  1. In styles, we'll have @form="sub" and @form="main" available. The standard/long form of a title will be the reassembled title.
  2. On the input side, users can provide split-points explicitly with ||.

@bdarcus
Copy link
Member

bdarcus commented Jul 12, 2020

I put this placeholder in this PR I just pushed, where we can include this.

@dhimmel
Copy link
Contributor

dhimmel commented Jul 12, 2020

Or should we add one pattern per title?

I like one pattern per line, so you can have title and description fields that apply to the family of variables, like container.

@bdarcus
Copy link
Member

bdarcus commented Jul 13, 2020 via email

@bdarcus bdarcus added the input label Jul 13, 2020
@denismaier
Copy link
Member Author

Allowing titles to be an object (sub/main) or array.

We considered this already, but @dstillman was not really in favour. This makes things really complicated on that side. Also, how will users override the heuristics then?

@denismaier
Copy link
Member Author

So, per discussion in the rich text issue, expect the apps to create the pre-parsed data.

That would perhaps be an option with Zotero and pandoc, but I'm less optimistic with other apps. That's why we thought best is to implement thus in the processor.

@bdarcus
Copy link
Member

bdarcus commented Jul 14, 2020

Allowing titles to be an object (sub/main) or array.

We considered this already, but @dstillman was not really in favour. This makes things really complicated on that side. Also, how will users override the heuristics then?

Probably the best approach for zotero et al is to do titles like they do names.

@bwiernik
Copy link
Member

I think both default parsing behavior and providing a common syntax for users to override default parsing behavior are necessary in the processor for CSL to be at its best with typical bibliographic data in the wild.

@bdarcus
Copy link
Member

bdarcus commented Jul 14, 2020 via email

@bwiernik
Copy link
Member

I think those issues really aren't that related. "Parsing" involves many things, and not all of of them have the answer. I'm coming around to thinking that rich text markup might be something the processor needn't necessarily worry about (I'm doing some investigation into what journals provide there).

But parsing of titles is much more similar to testing is-numeric in my view. This is something where data exist in the wild

So, for the many places where CSL is used outside of a person writing a manuscript, such as as Cite this For Me or Open Science Framework, the only tool in the chain is the citation processor. Asking every potential little application adopting CSL to roll their own title parser, name parser, etc. seems like a huge barrier to entry.

@bdarcus
Copy link
Member

bdarcus commented Jul 14, 2020 via email

@bdarcus
Copy link
Member

bdarcus commented Jul 20, 2020

Could we revise the issue description to include a concise list of requirements?

I think that would help us make final decisions.

For example:

  1. sub-components of titles need to be accessed in styles
  2. delimiters among these sub-components need to be configurable in styles and locales, for full title rendering

Are those two correct?

And then what about the other wrinkle that is making this so difficult?

Is it that some styles require printing full titles without modifying the sub-component punctuation?

So in those styles, would one also need the 1 requirement above to access components?

Is this the only other requirement; so three?

@bwiernik
Copy link
Member

bwiernik commented Jul 20, 2020

Is it that some styles require printing full titles without modifying the sub-component punctuation?

Yes. Chicago modifies punctuation, APA and Vancouver do not. Both types are common.

So in those styles, would one also need the 1 requirement above to access components?

I don’t think we identified any style where separate formatting of main/sub AND keeping original punctuation were needed.

We had planned for the CSL style syntax to not permit that—separate formatting of main and sub is accomplished using a group with a specified delimiter.

The data model thus needs to provide:

  1. Individual title parts
  2. The original punctuation separating those parts

Processors need to be able to:

  1. Concatenate title parts together for form-="long"
  2. Replace existing delimiter punctuation with normalized punctuation
    • perhaps add delimiters if none exist?

@bdarcus
Copy link
Member

bdarcus commented Jul 20, 2020

The data model thus needs to provide:

  1. Individual title parts
  2. The original punctuation separating those parts

If the way I stated the third requirement is correct (and you confirmed it is), isn't it more simple than this?

Isn't it that the processor needs access to the full title, full stop?

If yes, then your second requirement is not needed; what is needed is for the full title property to be filled.

So some styles require title decomposition and recomposition, and some don't?

Why I'm asking this question.

@bwiernik
Copy link
Member

bwiernik commented Jul 20, 2020

I don’t understand what you are saying your “third requirement” is.

Another way to put the requirement:

  1. Render the full title, with original punctuation
    • Main casing options:
      • No change
      • Title case
      • Uppercase main title first
      • Uppercase main title first and subtitle(s) first
  2. Render the full title, with normalized punctuation
    • Main casing options:
      • No change
      • Title case
      • Uppercase main title first
      • Uppercase main title first and subtitle(s) first
  3. Render a title part separately

The casing requirements are why I'm not a big fan of including "full" as an element--that would require text comparison of "full" to "main" and "sub" to determine what to capitalize.

@bwiernik
Copy link
Member

To put these together with examples:

  1. Vancouver style: no decomposition, no subtitle capitalization
  2. APA style: no decomposition, subtitle capitalization
  3. Chicago style: decomposition, subtitle capitalization
  4. ABNT style: decomposition, separate main and sub text formatting

@bdarcus
Copy link
Member

bdarcus commented Jul 20, 2020

I don’t understand what you are saying your “third requirement” is.

Just to clarify this part, I meant this from above:

And then what about the other wrinkle that is making this so difficult?
Is it that some styles require printing full titles without modifying the sub-component punctuation?

But your explanation here further clarifies.

@bdarcus
Copy link
Member

bdarcus commented Jul 20, 2020

The casing requirements are why I'm not a big fan of including "full" as an element--that would require text comparison of "full" to "main" and "sub" to determine what to capitalize.

OK, this is a key piece I was missing.

So among styles which do not specify decomposition, if we have a full title, some will specify to modify casing, and others will specify to leave it alone.

The problem this presents is with a full title, a processor won't have access to the sub-components, so it won't be able to modify the casing.

@denismaier, do you think you could modify the main post to reflect this as clearly as possible, for the record?

@denismaier
Copy link
Member Author

The casing requirements are why I'm not a big fan of including "full" as an element--that would require text comparison of "full" to "main" and "sub" to determine what to capitalize.

OK, this is a key piece I was missing.

Yeah, if citeprocs needed to compare "full" to "main" and "sub" to capitalize properly they just could do the whole splitting operation on their own, which is what using objects here tries to avoid.

@denismaier
Copy link
Member Author

@denismaier, do you think you could modify the main post to reflect this as clearly as possible, for the record?

You mean the original PR?

@bdarcus
Copy link
Member

bdarcus commented Jul 20, 2020 via email

bdarcus added a commit that referenced this issue Jul 22, 2020
To support independent formatting of main and subtitles, this converts 
title strings to objects, with "full" and "main" string properties and 
a "sub" array (to support multiple subtitles).

Also, moves "short" title variants to the new object.

addresses #310
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants