Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provenance data: BRs missing primary source and BRs with more than one primary source for a single snapshot #19

Open
eliarizzetto opened this issue Feb 28, 2024 · 0 comments
Assignees
Labels
bug Something isn't working

Comments

@eliarizzetto
Copy link
Collaborator

In version 5 of “OpenCitations Meta RDF dataset of all bibliographic metadata and its provenance information” (https://doi.org/10.6084/m9.figshare.21747536.v5), there are 105,944,601 provenance graph objects for bibliographic resources, therefore the number of provenance graphs does not match the number of bibliographic resources in the triplestore (105,953,699, from what can be read in the dump’s record metadata on its Figshare page).

Additionally, a very small number (5) of bibliographic resource entities in the provenance RDF files are missing the primary source (the http://www.w3.org/ns/prov#hadPrimarySource property.

E.g.:

{'@graph': [{'@id': 'https://w3id.org/oc/meta/br/0603904264/prov/se/1', '@type': ['http://www.w3.org/ns/prov#Entity'], 'http://www.w3.org/ns/prov#invalidatedAtTime': [{'@type': 'http://www.w3.org/2001/XMLSchema#dateTime', '@value': '2023-06-12T13:31:24+00:00'}]}, {'@id': 'https://w3id.org/oc/meta/br/0603904264/prov/se/2', '@type': ['http://www.w3.org/ns/prov#Entity'], 'http://purl.org/dc/terms/description': [{'@value': "The entity 'https://w3id.org/oc/meta/br/0603904264' has been deleted."}], 'http://www.w3.org/ns/prov#generatedAtTime': [{'@type': 'http://www.w3.org/2001/XMLSchema#dateTime', '@value': '2023-06-12T13:31:24+00:00'}], 'http://www.w3.org/ns/prov#invalidatedAtTime': [{'@type': 'http://www.w3.org/2001/XMLSchema#dateTime', '@value': '2023-06-12T13:31:24+00:00'}], 'http://www.w3.org/ns/prov#specializationOf': [{'@id': 'https://w3id.org/oc/meta/br/0603904264'}], 'http://www.w3.org/ns/prov#wasAttributedTo': [{'@id': 'https://w3id.org/oc/meta/prov/pa/1'}], 'http://www.w3.org/ns/prov#wasDerivedFrom': [{'@id': 'https://w3id.org/oc/meta/br/0603904264/prov/se/1'}], 'https://w3id.org/oc/ontology/hasUpdateQuery': [{'@value': 'DELETE DATA { GRAPH <https://w3id.org/oc/meta/br/> { <https://w3id.org/oc/meta/br/0603904264> <http://purl.org/spar/datacite/hasIdentifier> <https://w3id.org/oc/meta/id/0603634322> .<https://w3id.org/oc/meta/br/0603904264> <http://purl.org/vocab/frbr/core#partOf> <https://w3id.org/oc/meta/br/061106396> .<https://w3id.org/oc/meta/br/0603904264> <http://purl.org/spar/pro/isDocumentContextFor> <https://w3id.org/oc/meta/ar/06015663864> .<https://w3id.org/oc/meta/br/0603904264> <http://purl.org/spar/pro/isDocumentContextFor> <https://w3id.org/oc/meta/ar/06015663859> .<https://w3id.org/oc/meta/br/0603904264> <http://purl.org/spar/datacite/hasIdentifier> <https://w3id.org/oc/meta/id/0603634323> .<https://w3id.org/oc/meta/br/0603904264> <http://purl.org/dc/terms/title> "Bariatric Surgery Outcomes: A Single-Center Study In The United Arab Emirates." .<https://w3id.org/oc/meta/br/0603904264> <http://purl.org/spar/pro/isDocumentContextFor> <https://w3id.org/oc/meta/ar/06015663860> .<https://w3id.org/oc/meta/br/0603904264> <http://purl.org/spar/pro/isDocumentContextFor> <https://w3id.org/oc/meta/ar/06015663862> .<https://w3id.org/oc/meta/br/0603904264> <http://purl.org/spar/pro/isDocumentContextFor> <https://w3id.org/oc/meta/ar/06015663861> .<https://w3id.org/oc/meta/br/0603904264> <http://purl.org/spar/pro/isDocumentContextFor> <https://w3id.org/oc/meta/ar/06015663863> .<https://w3id.org/oc/meta/br/0603904264> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/spar/fabio/Expression> .<https://w3id.org/oc/meta/br/0603904264> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/spar/fabio/JournalArticle> .<https://w3id.org/oc/meta/br/0603904264> <http://prismstandard.org/namespaces/basic/2.0/publicationDate> "2015" . } }'}]}], '@id': 'https://w3id.org/oc/meta/br/0603904264/prov/'}

In the same dataset, moreover, the provenance graphs of 91,284 bibliographic resources have more than one primary source specified inside a single snapshot (while we would expect to have a distinct snapshot for each primary source).

E.g.

{'@graph': [{'@id': 'https://w3id.org/oc/meta/br/060292/prov/se/1', '@type': ['http://www.w3.org/ns/prov#Entity'], 'http://purl.org/dc/terms/description': [{'@value': "The entity 'https://w3id.org/oc/meta/br/060292' has been created."}], 'http://www.w3.org/ns/prov#generatedAtTime': [{'@type': 'http://www.w3.org/2001/XMLSchema#dateTime', '@value': '2023-01-23T17:56:32+00:00'}, {'@type': 'http://www.w3.org/2001/XMLSchema#dateTime', '@value': '2023-06-24T07:17:30+00:00'}], 'http://www.w3.org/ns/prov#hadPrimarySource': [{'@id': 'https://api.datacite.org/'}, {'@id': 'https://doi.org/10.5281/zenodo.7845968'}], 'http://www.w3.org/ns/prov#invalidatedAtTime': [{'@type': 'http://www.w3.org/2001/XMLSchema#dateTime', '@value': '2023-06-22T19:40:25+00:00'}], 'http://www.w3.org/ns/prov#specializationOf': [{'@id': 'https://w3id.org/oc/meta/br/060292'}], 'http://www.w3.org/ns/prov#wasAttributedTo': [{'@id': 'https://w3id.org/oc/meta/prov/pa/1'}]}, {'@id': 'https://w3id.org/oc/meta/br/060292/prov/se/2', '@type': ['http://www.w3.org/ns/prov#Entity'], 'http://purl.org/dc/terms/description': [{'@value': "The entity 'https://w3id.org/oc/meta/br/060292' has been modified."}], 'http://www.w3.org/ns/prov#generatedAtTime': [{'@type': 'http://www.w3.org/2001/XMLSchema#dateTime', '@value': '2023-06-22T19:40:25+00:00'}], 'http://www.w3.org/ns/prov#hadPrimarySource': [{'@id': 'https://doi.org/10.5281/zenodo.7845968'}], 'http://www.w3.org/ns/prov#specializationOf': [{'@id': 'https://w3id.org/oc/meta/br/060292'}], 'http://www.w3.org/ns/prov#wasAttributedTo': [{'@id': 'https://w3id.org/oc/meta/prov/pa/1'}], 'http://www.w3.org/ns/prov#wasDerivedFrom': [{'@id': 'https://w3id.org/oc/meta/br/060292/prov/se/1'}], 'https://w3id.org/oc/ontology/hasUpdateQuery': [{'@value': 'INSERT DATA { GRAPH <https://w3id.org/oc/meta/br/> { <https://w3id.org/oc/meta/br/060292> <http://purl.org/spar/datacite/hasIdentifier> <https://w3id.org/oc/meta/id/0603696871> . } }'}]}], '@id': 'https://w3id.org/oc/meta/br/060292/prov/'}

Such observations can be reproduced with the following script:

import json
from zipfile import ZipFile
from tqdm import tqdm
import os
from os.path import join

def get_provenance_data(br_rdf_path):

    with ZipFile(br_rdf_path) as archive:
        for filepath in archive.namelist():
            if filepath.endswith('prov/se.zip'):
                with ZipFile(archive.open(filepath)) as prov_archive:
                    for prov_file in prov_archive.namelist():
                        if prov_file.endswith('se.json'):
                            with prov_archive.open(prov_file) as f:
                                data: list = json.load(f)
                                for obj in data:
                                    yield obj


def check_provenance(rdf_path, outdir):
    nosource_brs_file = join(outdir, 'nosource.txt')
    multisource_snapshots_file = join(outdir, 'multisource.txt')
    nospec_brs_file = join(outdir, 'nospec.txt')
    multisource_count = 0
    nosource_count = 0 
    nospec_count = 0
		tot_prov_graphs = 0
    with open(nosource_brs_file, 'w') as nosourcef, open(multisource_snapshots_file, 'w') as multisourcef, open(nospec_brs_file, 'w') as nospecf:
        for prov_graph in tqdm(get_provenance_data(rdf_path)):
						tot_prov_graphs += 1
            out_row = dict()
            out_row['source'] = set()
            out_row['br'] = set()
            out_row['multisource'] = False
            for snapshot in prov_graph['@graph']:
                primary_source = snapshot.get('http://www.w3.org/ns/prov#hadPrimarySource')  # list|None
                if primary_source:
                    for i in primary_source:
                        out_row['source'].add(i['@id'])
                    if not out_row['multisource']:
                        if len(primary_source) > 1:
                            # ... add graph to multisource file + count multi_source
                            multisourcef.write(str(prov_graph)+'\n')
                            multisource_count += 1
                            out_row['multisource'] = True

                if snapshot.get('http://www.w3.org/ns/prov#specializationOf'):
                    out_row['br'].add(snapshot['http://www.w3.org/ns/prov#specializationOf'][0]['@id'])
                    
            if not out_row['source']:
                # ... add graph to nosource file + count nosource
                nosourcef.write(str(prov_graph)+'\n')
                nosource_count += 1

            if not out_row['br']:
                # ... add graph to nospec file + count nospec
                nospecf.write(str(prov_graph)+'\n')
                nospec_count += 1
    
		print('Total number of provenance graph objects: ', tot_prov_graphs)
    print('BRs with no primary source specified: ', nosource_count)
    print('BRs with no #specializationOf property: ', nospec_count)
    print('BRs with one or more snapshot(s) that have multiple sources specified within: ', multisource_count)

    return tot_prov_graphs, nosource_count, nospec_count, multisource_count

if __name__ == '__main__':
    rdfpath= 'path/to/rdf/br.zip' # path to zip archive containing bibliographic resources!
    outdir = 'tmp'
    os.makedirs(outdir, exist_ok=True)
    print(check_provenance(rdfpath, outdir))
@arcangelo7 arcangelo7 added the bug Something isn't working label Mar 7, 2024
@arcangelo7 arcangelo7 self-assigned this Mar 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants