Improve/update Schema.org JSON-LD export #7349

jggautier · 2020-10-22T18:39:52Z

In a meeting with folks from the FAIRsFAIR group (namely @kitchenprinzessin3880) who are building and testing tools to access the "FAIRNESS" of datasets in Dataverse repositories (https://www.fairsfair.eu/fairsfair-data-object-assessment-metrics-request-comments), some changes were recommended for the metadata that Dataverse includes in the Schema.org JSON-LD metadata it exports for datasets. I said I'd open a Github issue so we could record and explain these changes.

For license property, use the @type "CreativeWorks" and use "name" instead of "text":
As of Dataverse 5.1.1, the @type for the Schema.org property "license" is "Dataset". Here's an example of what that looks like:

license: {
   @type: "Dataset",
   text: "CC0",
   url: "https://creativecommons.org/publicdomain/zero/1.0/"

or if CC0 is waived:

license: {
   @type: "Dataset",
   text: "Text the depositor entered in the Terms of Use field"

Google's guide for describing datasets with Schema.org says to use the "CreativeWorks" @type for license and use "name".

Here's an example of what the license metadata in the Schema.org export might look like when this issue is merged (after the "multiple license" work described at #7440 and #7742 is also merged):

If the dataset depositor chooses a license from the list of licenses:

license: {
   @type: "CreativeWorks",
   name: "CC0",
   url: "https://creativecommons.org/publicdomain/zero/1.0/"

license: {
   @type: "CreativeWorks",
   name: "CC BY",
   url: "https://creativecommons.org/licenses/by/4.0/"

Or if no license is chosen and a custom license is entered:

license: {
   @type: "CreativeWorks",
   name: "Text entered in the "Dataset Terms" fields"

For files (in the "distribution" property):
As of Dataverse 5.1.1, here's an example of what the file metadata in the Schema.org export looks like:

distribution: [
    {
        @type: "DataDownload",
        name: "cases_by_infection.tab",
        fileFormat: "text/tab-separated-values",
        contentSize: 56377,
        description: "",
        @id: "https://doi.org/10.70122/FK2/LQKU61/SCM19X",
        identifier: "https://doi.org/10.70122/FK2/LQKU61/SCM19X",
        contentUrl: "https://demo.dataverse.org/api/access/datafile/1653135"
    {
        @type: "DataDownload",
        name: "DatasetDiagram.png",
        fileFormat: "image/png",
        contentSize: 84006,
        description: "",
        @id: "https://doi.org/10.70122/FK2/LQKU61/SBJENW",
        identifier: "https://doi.org/10.70122/FK2/LQKU61/SBJENW"

Here are the changes related to file metadata being proposed in this GitHub issue:

Use "encodingFormat" instead of "fileFormat":
Google's guide for describing datasets with Schema.org says to use the property "encodingFormat" (doesn't mention using the "fileFormat" property)
contentURL should always be added:
As of Dataverse 5.1.1, Dataverse puts each file's "download URL" in Schema.org's contentURL property as long as the file isn't restricted or its dataset has no guestbook or Terms of Use metadata. (See details about the current logic at As a researcher, I want more dataset metadata in schema.org exports so that my data is more discoverable #4371 (comment))

Instead, Dataverse should always include every file's "download URL" in Schema.org's contentURL property. Then if the file is restricted or its dataset has a guestbook or Terms of Access metadata, the download URL will return the access restricted error that it returns now.
Add conditionsOfAccess to declare that a file is open or restricted:
@kitchenprinzessin3880 pointed to two vocabularies whose terms we might consider using as values for conditionsOfAccess, to indicate how accessible the file is: https://guidelines.openaire.eu/en/latest/literature/field_accesslevel.html and http://vocabularies.coar-repositories.org/documentation/access_rights.

Each vocab defines four terms. I've written in Access Rights metadata in OpenAIRE metadata export is being misapplied #5920 about current problems Dataverse has with using the Access Rights terms from the info:eu-repo namespace, so I'm hesitant to use those terms. To put it briefly, Dataverse has files that are restricted using Dataverse's file restriction feature and the "File Request" feature is disabled, but the depositor uses a process outside of Dataverse to manage access to the file. So the file is restricted, not "closedAccess," even though people aren't able to request access to the file through Dataverse's "File Request" feature. Most of the datasets in Harvard Dataverse's Murray collections are like this (e.g. there's a process outside of the Dataverse software for requesting access to restricted files in https://doi.org/10.7910/DVN/0PMZC6). Maybe we can discuss that in this issue.

Here's an example of what the file metadata in the Schema.org export might look like when a pull request for this issue is merged:

distribution: [
    {
        @type: "DataDownload",
        name: "cases_by_infection.tab",
        encodingFormat: "text/tab-separated-values",
        contentSize: 56377,
        description: "",
        @id: "https://doi.org/10.70122/FK2/LQKU61/SCM19X",
        identifier: "https://doi.org/10.70122/FK2/LQKU61/SCM19X",
        contentUrl: "https://demo.dataverse.org/api/access/datafile/1653135"
        conditionsOfAccess: (to be determined)
    {
        @type: "DataDownload",
        name: "DatasetDiagram.png",
        encodingFormat: "image/png",
        contentSize: 84006,
        description: "",
        @id: "https://doi.org/10.70122/FK2/LQKU61/SBJENW",
        identifier: "https://doi.org/10.70122/FK2/LQKU61/SBJENW"
        contentUrl: "https://demo.dataverse.org/api/access/datafile/26"
        conditionsOfAccess: (to be determined)

The text was updated successfully, but these errors were encountered:

jggautier · 2020-10-22T18:49:56Z

For adding conditionsOfAccess for the file metadata, should the values be binary, e.g. open and close, like the following?:

open (or a term like it) means there are no barriers to programmatic access to the file, e.g. the contentUrl works
closed (or a term like it) means there are barriers to programmatic access to the file, e.g. the contentUrl does not work

kitchenprinzessin3880 · 2020-10-23T09:07:29Z

@jggautier if you plan to use binary, maybe this property is more appropriate? https://schema.org/isAccessibleForFree

jggautier · 2020-10-26T15:11:51Z

That makes sense to me! I think that if we use that property this way, since you've been following this issue closely, you (and the tools you're helping develop) will know what it means for a file to be "isAccessibleForFree". Hopefully others who need to use this metadata will also be able to figure how it's being used.

The Google Research group writes on page 3 of their "Google Dataset Search by the Numbers" article that the property "is a boolean value that indicates whether or not the dataset requires a payment", but then they describe how Google Dataset Search interprets a True value to mean "open" and similar to any of the "Creative Commons and open government licenses". So I think it's fair to expect that their interpretation, applied at the dataset level, should be applied at the file level, too, right? So it shouldn't be hard for others who need to use this metadata to figure out that a file flagged as "isAccessibleForFree" is open to some degree, although the exact degree (programmatic access to the file) might not be apparent by just looking at the metadata.

kitchenprinzessin3880 · 2020-10-26T15:33:42Z

So it shouldn't be hard for others who need to use this metadata to figure out that a file flagged as "accessibleForFree" is open to some degree, although the exact degree (programmatic access to the file) might not be apparent by just looking at the metadata.

what is the @type at the file level? as long as it is sub-type of creative works, the property can be applied.

btw, the tool accepts both schema.org properties (accessibleForFree, conditionsOfAccess) which may be used to indicate access-level metadata of a dataset.

jggautier · 2020-10-28T13:34:48Z

Isn't "DataDownload" the @type at the file level? That's what's used in this issue's first comment. https://schema.org/DataDownload lists isAccessibleForFree, so I think it can be applied then?

I meant more that if I was looking to use the metadata to build a tool or query the repository and saw isAccessibleForFree: True (or False) in the datasets' Schema.org metadata, I wouldn't know what that means exactly. For example, you mentioned earlier that Pangea uses isAccessibleForFree and I can see it in the schema.org metadata for this dataset, but to figure out what that means, I'd have to find information that's not present in the metadata itself. The page for that Pangea dataset says I need to be logged in to download the data, but Pangea says elsewhere that downloading most of their datasets' files doesn't require login, like the dataset at https://doi.pangaea.de/10.1594/PANGAEA.921541, whose Schema.org metadata has isAccessibleForFree: True. So now I'm thinking that isAccessibleForFree is True for Pangea datasets if I don't have to log in to download the data. But I can't determine this by just looking at the Schema.org metadata.

Does this make the metadata less FAIR? The definition of the isAccessibleForFree property doesn't define what free means. But maybe it's okay to expect people who need to programmatically determine a file's access level to do a little investigation into what free means in this context, or, if it's already common practice to use isAccessibleForFree the way we've proposed (Pangea, and maybe other repositories, seem to be using it this way already) it's okay to expect that people should assume that when data repositories use isAccessibleForFree for data files, that means either there is one or more barriers to accessing the file (isAccessibleForFree:False) or there are no barriers (isAccessibleForFree:True).

kitchenprinzessin3880 · 2020-10-28T14:09:54Z

Isn't "DataDownload" the @type at the file level? That's what's used in this issue's first comment. https://schema.org/DataDownload lists isAccessibleForFree, so I think it can be applied then?

yup, Thing > CreativeWork > MediaObject > DataDownload, so the property can be used with DataDownload.

kitchenprinzessin3880 · 2020-10-28T14:22:32Z

Does this make the metadata less FAIR? The definition of the isAccessibleForFree property doesn't define what free means. But maybe it's okay to expect people who need to programmatically determine a file's access level to do a little investigation into what free means in this context, or, if it's already common practice to use isAccessibleForFree the way we've proposed (Pangea, and maybe other repositories, seem to be using it this way already) it's okay to expect that people should assume that when data repositories use isAccessibleForFree for data files, that means either there is one or more barriers to accessing the file (false) or there are no barriers (true).

For pangaea, all public datasets are set with isAccessibleForFree = True, the rest restricted datasets (embargoed, requires login) are set to False. In addition to the this property, we also use the 'conditionsOfAccess' property to communicate access data level.
I agree that the property 'isAccessibleForFree' is loosely defined and mainly specified for general search, not 100% applicable to scientific datasets. Let me check with other data reposiroties....

@ashepherd, can you please let us know the way you specifiy data access level at science-on-schema.org?

jggautier · 2021-02-16T19:29:20Z

Speaking of science-on-schema.org, RDA's Research Metadata Schemas WG announced updated guidelines from the ESIP Schema.org cluster for using Schema.org to describe data. It's at https://github.com/ESIPFed/science-on-schema.org (and is summarized in the RDA WG's own report). Guidelines for describing datasets specifically are at https://github.com/ESIPFed/science-on-schema.org/blob/master/guides/Dataset.md.

From a quick look it seems like the guide includes ways to add metadata that Dataverse isn't mapping to its Schema.org export and using different elements and structures to include more metadata. When we tackle this issue (updating Dataverse's Schema.org export), I think we should learn how in line these guidelines are with the FAIRsFAIR's testing tools.

kitchenprinzessin3880 · 2021-02-24T02:48:38Z

@jggautier I skimmed through guidelines, the recommended fields suggested in the guidelines are currently being considered by F-UJI when evaluating a dataset except 1.catalog 2. linking physical samples to dataset. In any case, i will cross-check again the schema.org mappings captured as part of the tool with the recommendations from ESIP. @https://github.com/huberrob

jggautier · 2021-04-02T01:05:49Z

DataONE hosted a community call on "Science on Schema.org Guidelines and Experiences" (https://www.dataone.org/community-calls/soso/). Collaborative notes from the meeting are posted at https://github.com/DataONEorg/community-calls/blob/master/notes/20210401_call_notes.md.

jggautier · 2021-06-21T17:58:00Z

The upcoming "multiple license" work (#7440, #7742) will change how license metadata is mapped to Dataverse repositories' Schema.org exports (as well as the other metadata exports), so I updated this issue's first comment to reflect those changes.

eunices · 2022-01-20T00:39:55Z

Just putting additional information that license's "@type" should be "CreativeWork" not "Dataset", based on our Rich Results Test.

https://support.google.com/webmasters/thread/146534613?hl=en&msgid=146553381#action=helpful

adam3smith · 2022-10-12T16:12:28Z

Some additional things that we're finding based on google's validation:

Creator is missing a type (and thus doesn't display on Google): Improving Dataverse's Schema.org JSON-LD schema to enable author names display in Google Dataset Search's #5029
Description needs to be truncated at 5,000 characters
Related publications need to have either a name or a URL

We'd be interested in working on all of these. I think the only contentious one is #5029, so if we could come to a decision on what to do there we could wrap this all in one PR

adam3smith · 2022-10-14T17:50:44Z

@jggautier and I have what we think is a good way forward on #5029 , so I think this is pretty doable and we'll try to put it onto our roadmap at QDR.

pdurbin · 2022-10-16T01:44:05Z

Related (possibly a duplicate or sub-issue):

schema.org representation of license is incorrect #7574

jggautier added the Feature: Metadata label Oct 22, 2020

jggautier mentioned this issue Feb 4, 2021

schema.org representation of license is incorrect #7574

Closed

jggautier mentioned this issue Jan 7, 2022

Support for configurable list of licenses #7920

Merged

qqmyers mentioned this issue Oct 28, 2022

IQSS/7349-5 - Use brand name for catalog #9101

Merged

qqmyers mentioned this issue Jan 20, 2023

IQSS/9100 OpenAire update for orgs #9102

Merged

jggautier mentioned this issue Jan 26, 2023

Affiliations entered in affiliation fields are parenthesized in "Datacite" and Schema.org exports #9330

Open

kcondon closed this as completed in #9101 Jan 30, 2023

scolapasta reopened this Jan 30, 2023

kcondon closed this as completed in #9086 Jan 30, 2023

scolapasta reopened this Jan 30, 2023

kcondon closed this as completed in #9085 Jan 30, 2023

scolapasta reopened this Jan 30, 2023

kcondon closed this as completed in #9089 Feb 1, 2023

jggautier mentioned this issue Feb 5, 2024

Feature Request/Idea: Include Grant ID in "Schema.org JSON-LD" metadata export format #10296

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve/update Schema.org JSON-LD export #7349

Improve/update Schema.org JSON-LD export #7349

jggautier commented Oct 22, 2020 •

edited

Loading

jggautier commented Oct 22, 2020 •

edited

Loading

kitchenprinzessin3880 commented Oct 23, 2020

jggautier commented Oct 26, 2020 •

edited

Loading

kitchenprinzessin3880 commented Oct 26, 2020 •

edited by jggautier

Loading

jggautier commented Oct 28, 2020 •

edited

Loading

kitchenprinzessin3880 commented Oct 28, 2020 •

edited

Loading

kitchenprinzessin3880 commented Oct 28, 2020 •

edited

Loading

jggautier commented Feb 16, 2021 •

edited

Loading

kitchenprinzessin3880 commented Feb 24, 2021 •

edited

Loading

jggautier commented Apr 2, 2021 •

edited

Loading

jggautier commented Jun 21, 2021

eunices commented Jan 20, 2022

adam3smith commented Oct 12, 2022

adam3smith commented Oct 14, 2022

pdurbin commented Oct 16, 2022

Improve/update Schema.org JSON-LD export #7349

Improve/update Schema.org JSON-LD export #7349

Comments

jggautier commented Oct 22, 2020 • edited Loading

jggautier commented Oct 22, 2020 • edited Loading

kitchenprinzessin3880 commented Oct 23, 2020

jggautier commented Oct 26, 2020 • edited Loading

kitchenprinzessin3880 commented Oct 26, 2020 • edited by jggautier Loading

jggautier commented Oct 28, 2020 • edited Loading

kitchenprinzessin3880 commented Oct 28, 2020 • edited Loading

kitchenprinzessin3880 commented Oct 28, 2020 • edited Loading

jggautier commented Feb 16, 2021 • edited Loading

kitchenprinzessin3880 commented Feb 24, 2021 • edited Loading

jggautier commented Apr 2, 2021 • edited Loading

jggautier commented Jun 21, 2021

eunices commented Jan 20, 2022

adam3smith commented Oct 12, 2022

adam3smith commented Oct 14, 2022

pdurbin commented Oct 16, 2022

jggautier commented Oct 22, 2020 •

edited

Loading

jggautier commented Oct 22, 2020 •

edited

Loading

jggautier commented Oct 26, 2020 •

edited

Loading

kitchenprinzessin3880 commented Oct 26, 2020 •

edited by jggautier

Loading

jggautier commented Oct 28, 2020 •

edited

Loading

kitchenprinzessin3880 commented Oct 28, 2020 •

edited

Loading

kitchenprinzessin3880 commented Oct 28, 2020 •

edited

Loading

jggautier commented Feb 16, 2021 •

edited

Loading

kitchenprinzessin3880 commented Feb 24, 2021 •

edited

Loading

jggautier commented Apr 2, 2021 •

edited

Loading