-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
authenticity and integrity of dcat files and associated datasets #1526
Comments
Thanks for the feedback. We discussed the issue of integrity and authenticity in the DXWG plenary. Let me try to summarize part of the discussion below. The core of our work is DCAT as a metadata model, and integrity and authenticity seem to relate more to how DCAT is provided than the DCAT model itself. We are reluctant to address issues “Not at the core” of the group mandate. We want to avoid our DCAT-limited perspective can later conflict with more devoted solutions stemming from new groups working on promoting transversal technology, which might be chosen to deliver DCAT metadata. As one of the most typical ways to serve DCAT, the RDF encoding provides an example of the above concerns. Typically, a DCAT encoding in RDF might end in an RDF store or file. In the case of an RDF store, it is the chosen software which needs to implement the caveat to ensure integrity and authenticity. Can you live with this solution till the grounding RDF solutions are delineated more? |
I'm not convinced that it's wholly out of scope. One of the only features being added to this version is a checksum property, which is apparently intended to provide security protections, but doesn't provide the expected security protections if there's no way to provide integrity or authenticity of the DCAT metadata. I'm not sure if the checksum property is fully defined enough that it can be generally interoperably used (is there implementation experience?), but that property assumes that there already exists a canonical way to refer to a distribution, if not a dataset. If it's not feasible to provide standardized functionality for authenticity and integrity of DCAT files (or other distributions of the metadata) in the short term, then I think it would be reasonable to:
Postponing features has to happen sometimes. But I would strongly recommend that there be a plan to address this in the future, rather than just postponing it as a way to avoid dealing with it. Accessing datasets that could be tampered with, or not knowing the provenance or authorship or integrity of a dataset, is a real and significant threat; it affects far more than just the implementers of this spec. I don't think it can be our long-term plan that W3C Recommendations don't provide any mechanism for basic, interoperable security properties and instead rely on the hope that every individual implementation or user will figure out its own way to provide security. |
Thanks, Nick, for your suggestions; we've included them in the Security and Privacy section; check the second paragraph in https://w3c.github.io/dxwg/dcat/#security_and_privacy Please feel free to suggest improvements to the draft. If you can live with the current draft, we will backlog this issue for further consideration in the next standardization round of DCAT ( e.g., DCAT 4).
We have acknowledged that in the new paragraph.
This solution is adopted by DCAT-AP 2.1.0. The checksum property range in
We agree that this is a pervasive and transversal issue that impacts every vocabulary the W3C recommends, and this is the main reason why the solution should be common to all vocabularies. RDF Dataset Canonicalization and Hash Working Group will likely provide a ground upon which RDF vocabularies will build. Anyway, any further input to consider in the next standardization round is more than welcome. |
@npdoty are you satisfied with what we have added in the spec and @riccardoAlbertoni's response above, and can we close this issue? |
In addendum, I would add the effect of persistent URIs. Suppose I have found in a portal, e.g. data.europa.eu a dataset ( So if one does not trusted the harvesting portal, then one could via this mechanism find the source portal and the original metadata. Now, one could argue that one does not trust the response of the HTTP dereferencing, which in the end comes down, I do not trust the source. Note that DCAT is about metadata descriptions. It describes the rules of the use of the data that it metadata wise describes. Thus, the issue of trust is actually way more complex than having the "original metadata descriptions". Suppose one uses the data that is found via a DCAT metadata description via a super secure data ecosystem for a data processing task that is infringing the legislation, then the super secure data ecosystem for DCAT will not be an argument that one could perform the data processing. This is all because DCAT does not provide data, but the means to access the data. And thus the trust/legal responsibility will be transferred to the data provider/data processor (in GDPR terminology) that is providing the access to the real data. Also, I want to note that DCAT does not mean sharing the data in RDF format. I hope we agree as community that DCAT can be implemented in many technical ways, as long the semantics are preserved. ps. on the checksum: that is about the file the Distribution is pointing to, not about the metadata (the dcat:Distribution). |
It's an improvement to at least have these concerns noted in the spec. By convention (and to make it parallel with the following section), I would suggest "Security and Privacy Considerations" as a title. I think "is also not guaranteed" should be "is not also guaranteed". You might describe addressing these concerns at both the application level and the transport level -- that may be what you mean, but we would note in the Web context that an attacker could tamper with the contents between the server and client if a security-sensitive property like a checksum were delivered over an insecure transport. This text seems to suggest that the checksum value and algorithm aren't typically sufficient for calculating and comparing checksums and that separately a publisher should provide instructions so that a checksum can be accurately calculated. Have there been interoperable implementations that do calculate and compare these checksums? Or is it just a case-by-case manual review of the documentation and then calculation of a checksum? If the latter, I'm not clear what interoperability we are getting by adding it to the spec. (Apologies for my belated review and follow-up.) |
Thanks @npdoty, PR #1579 is a joint effort to elaborate the section on the base of your observations, see how it looks at https://w3c.github.io/dxwg/dcat/#security_and_privacy This new phrasing of the section is quite an improvement. |
The spec should address providing integrity and authenticity of dcat files and associated datasets.
As a security matter, it's not clear how authenticity or integrity of metadata files or the associated datasets are assured. A checksum property for the dataset file is available (new in DCAT 3), but there seems a risk of a kind of downgrade attack here: someone tampering with the dataset might at the same time be able to tamper with the metadata and its checksum property.
Authenticity and integrity might be important security properties to consider; signatures and potentially use of a public key infrastructure might make it possible for a consumer of a dataset to confirm that they know who it came from and that they received it without tampering.
The text was updated successfully, but these errors were encountered: