Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve/update Schema.org JSON-LD export #7349

Closed
jggautier opened this issue Oct 22, 2020 · 15 comments · Fixed by #9085, #9086, #9087, #9089 or #9101
Closed

Improve/update Schema.org JSON-LD export #7349

jggautier opened this issue Oct 22, 2020 · 15 comments · Fixed by #9085, #9086, #9087, #9089 or #9101

Comments

@jggautier
Copy link
Contributor

jggautier commented Oct 22, 2020

In a meeting with folks from the FAIRsFAIR group (namely @kitchenprinzessin3880) who are building and testing tools to access the "FAIRNESS" of datasets in Dataverse repositories (https://www.fairsfair.eu/fairsfair-data-object-assessment-metrics-request-comments), some changes were recommended for the metadata that Dataverse includes in the Schema.org JSON-LD metadata it exports for datasets. I said I'd open a Github issue so we could record and explain these changes.

For license property, use the @type "CreativeWorks" and use "name" instead of "text":
As of Dataverse 5.1.1, the @type for the Schema.org property "license" is "Dataset". Here's an example of what that looks like:

license: {
   @type: "Dataset",
   text: "CC0",
   url: "https://creativecommons.org/publicdomain/zero/1.0/"

or if CC0 is waived:

license: {
   @type: "Dataset",
   text: "Text the depositor entered in the Terms of Use field"

Google's guide for describing datasets with Schema.org says to use the "CreativeWorks" @type for license and use "name".

Here's an example of what the license metadata in the Schema.org export might look like when this issue is merged (after the "multiple license" work described at #7440 and #7742 is also merged):

If the dataset depositor chooses a license from the list of licenses:

license: {
   @type: "CreativeWorks",
   name: "CC0",
   url: "https://creativecommons.org/publicdomain/zero/1.0/"
license: {
   @type: "CreativeWorks",
   name: "CC BY",
   url: "https://creativecommons.org/licenses/by/4.0/"

Or if no license is chosen and a custom license is entered:

license: {
   @type: "CreativeWorks",
   name: "Text entered in the "Dataset Terms" fields"

For files (in the "distribution" property):
As of Dataverse 5.1.1, here's an example of what the file metadata in the Schema.org export looks like:

distribution: [
    {
        @type: "DataDownload",
        name: "cases_by_infection.tab",
        fileFormat: "text/tab-separated-values",
        contentSize: 56377,
        description: "",
        @id: "https://doi.org/10.70122/FK2/LQKU61/SCM19X",
        identifier: "https://doi.org/10.70122/FK2/LQKU61/SCM19X",
        contentUrl: "https://demo.dataverse.org/api/access/datafile/1653135"
    {
        @type: "DataDownload",
        name: "DatasetDiagram.png",
        fileFormat: "image/png",
        contentSize: 84006,
        description: "",
        @id: "https://doi.org/10.70122/FK2/LQKU61/SBJENW",
        identifier: "https://doi.org/10.70122/FK2/LQKU61/SBJENW"

Here are the changes related to file metadata being proposed in this GitHub issue:

Here's an example of what the file metadata in the Schema.org export might look like when a pull request for this issue is merged:

distribution: [
    {
        @type: "DataDownload",
        name: "cases_by_infection.tab",
        encodingFormat: "text/tab-separated-values",
        contentSize: 56377,
        description: "",
        @id: "https://doi.org/10.70122/FK2/LQKU61/SCM19X",
        identifier: "https://doi.org/10.70122/FK2/LQKU61/SCM19X",
        contentUrl: "https://demo.dataverse.org/api/access/datafile/1653135"
        conditionsOfAccess: (to be determined)
    {
        @type: "DataDownload",
        name: "DatasetDiagram.png",
        encodingFormat: "image/png",
        contentSize: 84006,
        description: "",
        @id: "https://doi.org/10.70122/FK2/LQKU61/SBJENW",
        identifier: "https://doi.org/10.70122/FK2/LQKU61/SBJENW"
        contentUrl: "https://demo.dataverse.org/api/access/datafile/26"
        conditionsOfAccess: (to be determined)
@jggautier
Copy link
Contributor Author

jggautier commented Oct 22, 2020

For adding conditionsOfAccess for the file metadata, should the values be binary, e.g. open and close, like the following?:

  • open (or a term like it) means there are no barriers to programmatic access to the file, e.g. the contentUrl works
  • closed (or a term like it) means there are barriers to programmatic access to the file, e.g. the contentUrl does not work

@kitchenprinzessin3880
Copy link

@jggautier if you plan to use binary, maybe this property is more appropriate? https://schema.org/isAccessibleForFree

@jggautier
Copy link
Contributor Author

jggautier commented Oct 26, 2020

That makes sense to me! I think that if we use that property this way, since you've been following this issue closely, you (and the tools you're helping develop) will know what it means for a file to be "isAccessibleForFree". Hopefully others who need to use this metadata will also be able to figure how it's being used.

The Google Research group writes on page 3 of their "Google Dataset Search by the Numbers" article that the property "is a boolean value that indicates whether or not the dataset requires a payment", but then they describe how Google Dataset Search interprets a True value to mean "open" and similar to any of the "Creative Commons and open government licenses". So I think it's fair to expect that their interpretation, applied at the dataset level, should be applied at the file level, too, right? So it shouldn't be hard for others who need to use this metadata to figure out that a file flagged as "isAccessibleForFree" is open to some degree, although the exact degree (programmatic access to the file) might not be apparent by just looking at the metadata.

@kitchenprinzessin3880
Copy link

kitchenprinzessin3880 commented Oct 26, 2020

So it shouldn't be hard for others who need to use this metadata to figure out that a file flagged as "accessibleForFree" is open to some degree, although the exact degree (programmatic access to the file) might not be apparent by just looking at the metadata.

what is the @type at the file level? as long as it is sub-type of creative works, the property can be applied.

btw, the tool accepts both schema.org properties (accessibleForFree, conditionsOfAccess) which may be used to indicate access-level metadata of a dataset.

@jggautier
Copy link
Contributor Author

jggautier commented Oct 28, 2020

Isn't "DataDownload" the @type at the file level? That's what's used in this issue's first comment. https://schema.org/DataDownload lists isAccessibleForFree, so I think it can be applied then?

I meant more that if I was looking to use the metadata to build a tool or query the repository and saw isAccessibleForFree: True (or False) in the datasets' Schema.org metadata, I wouldn't know what that means exactly. For example, you mentioned earlier that Pangea uses isAccessibleForFree and I can see it in the schema.org metadata for this dataset, but to figure out what that means, I'd have to find information that's not present in the metadata itself. The page for that Pangea dataset says I need to be logged in to download the data, but Pangea says elsewhere that downloading most of their datasets' files doesn't require login, like the dataset at https://doi.pangaea.de/10.1594/PANGAEA.921541, whose Schema.org metadata has isAccessibleForFree: True. So now I'm thinking that isAccessibleForFree is True for Pangea datasets if I don't have to log in to download the data. But I can't determine this by just looking at the Schema.org metadata.

Does this make the metadata less FAIR? The definition of the isAccessibleForFree property doesn't define what free means. But maybe it's okay to expect people who need to programmatically determine a file's access level to do a little investigation into what free means in this context, or, if it's already common practice to use isAccessibleForFree the way we've proposed (Pangea, and maybe other repositories, seem to be using it this way already) it's okay to expect that people should assume that when data repositories use isAccessibleForFree for data files, that means either there is one or more barriers to accessing the file (isAccessibleForFree:False) or there are no barriers (isAccessibleForFree:True).

@kitchenprinzessin3880
Copy link

kitchenprinzessin3880 commented Oct 28, 2020

Isn't "DataDownload" the @type at the file level? That's what's used in this issue's first comment. https://schema.org/DataDownload lists isAccessibleForFree, so I think it can be applied then?

yup, Thing > CreativeWork > MediaObject > DataDownload, so the property can be used with DataDownload.

@kitchenprinzessin3880
Copy link

kitchenprinzessin3880 commented Oct 28, 2020

Does this make the metadata less FAIR? The definition of the isAccessibleForFree property doesn't define what free means. But maybe it's okay to expect people who need to programmatically determine a file's access level to do a little investigation into what free means in this context, or, if it's already common practice to use isAccessibleForFree the way we've proposed (Pangea, and maybe other repositories, seem to be using it this way already) it's okay to expect that people should assume that when data repositories use isAccessibleForFree for data files, that means either there is one or more barriers to accessing the file (false) or there are no barriers (true).

For pangaea, all public datasets are set with isAccessibleForFree = True, the rest restricted datasets (embargoed, requires login) are set to False. In addition to the this property, we also use the 'conditionsOfAccess' property to communicate access data level.
I agree that the property 'isAccessibleForFree' is loosely defined and mainly specified for general search, not 100% applicable to scientific datasets. Let me check with other data reposiroties....

@ashepherd, can you please let us know the way you specifiy data access level at science-on-schema.org?

@jggautier
Copy link
Contributor Author

jggautier commented Feb 16, 2021

Speaking of science-on-schema.org, RDA's Research Metadata Schemas WG announced updated guidelines from the ESIP Schema.org cluster for using Schema.org to describe data. It's at https://github.com/ESIPFed/science-on-schema.org (and is summarized in the RDA WG's own report). Guidelines for describing datasets specifically are at https://github.com/ESIPFed/science-on-schema.org/blob/master/guides/Dataset.md.

From a quick look it seems like the guide includes ways to add metadata that Dataverse isn't mapping to its Schema.org export and using different elements and structures to include more metadata. When we tackle this issue (updating Dataverse's Schema.org export), I think we should learn how in line these guidelines are with the FAIRsFAIR's testing tools.

@kitchenprinzessin3880
Copy link

kitchenprinzessin3880 commented Feb 24, 2021

@jggautier I skimmed through guidelines, the recommended fields suggested in the guidelines are currently being considered by F-UJI when evaluating a dataset except 1.catalog 2. linking physical samples to dataset. In any case, i will cross-check again the schema.org mappings captured as part of the tool with the recommendations from ESIP. @https://github.com/huberrob

@jggautier
Copy link
Contributor Author

jggautier commented Apr 2, 2021

DataONE hosted a community call on "Science on Schema.org Guidelines and Experiences" (https://www.dataone.org/community-calls/soso/). Collaborative notes from the meeting are posted at https://github.com/DataONEorg/community-calls/blob/master/notes/20210401_call_notes.md.

@jggautier
Copy link
Contributor Author

The upcoming "multiple license" work (#7440, #7742) will change how license metadata is mapped to Dataverse repositories' Schema.org exports (as well as the other metadata exports), so I updated this issue's first comment to reflect those changes.

@eunices
Copy link
Contributor

eunices commented Jan 20, 2022

Just putting additional information that license's "@type" should be "CreativeWork" not "Dataset", based on our Rich Results Test.

https://support.google.com/webmasters/thread/146534613?hl=en&msgid=146553381#action=helpful

@adam3smith
Copy link
Contributor

Some additional things that we're finding based on google's validation:

We'd be interested in working on all of these. I think the only contentious one is #5029, so if we could come to a decision on what to do there we could wrap this all in one PR

@adam3smith
Copy link
Contributor

@jggautier and I have what we think is a good way forward on #5029 , so I think this is pretty doable and we'll try to put it onto our roadmap at QDR.

@pdurbin
Copy link
Member

pdurbin commented Oct 16, 2022

Related (possibly a duplicate or sub-issue):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment