-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheming support #281
Scheming support #281
Conversation
This allows to check if a field should be stored as a custom field or an extra
…ext-dcat into 56-add-schema-file-dcat-ap-2.1
ckanext/dcat/profiles.py
Outdated
@@ -2003,3 +2093,122 @@ def _distribution_url_graph(self, distribution, resource_dict): | |||
def _distribution_numbers_graph(self, distribution, resource_dict): | |||
if resource_dict.get('size'): | |||
self.g.add((distribution, SCHEMA.contentSize, Literal(resource_dict['size']))) | |||
|
|||
|
|||
# TODO: split all these classes in different files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in #282
@amercader I've been working in DCAT this week, including adding spec compliant HVD 2.2.0 output and scheming portions to the current 1.7 version. (somewhat split across dcat and our schema extension at the moment). A couple of things have come up for making the output compliant with the HVD shaql files (https://semiceu.github.io/DCAT-AP/releases/2.2.0-hvd/#validation):
gives us something like this:
The codelists get rendered like this in the .ttl:
|
@EricSoroos thanks for the feedback:
Great to see you are working on this same area. If you have any feedback on the general approach followed for scheming support it would be great to hear it. |
@seitenbau-govdata @bellisk would love to know your take on this, and see if this approach would play well with how you are using ckanext-dcat |
In terms of HVD support, the current EU DCAT 2 implementation is close, at least, it has all of the fields. This commit: derilinx/ckanext-dcat@d5ef9f4 is the difference, and it's only the codelist and two types that were required. There are some other compliance issues, like one of the license or rights needs to be available, and the applicable_legislation has to have at least one specific value. I'm looking at validation level stuff for those (legislation already done, license/rights not). I'm at the point of thinking that these things are more general, so I'm not clear that we'd necessarily want to be adding a separate profile for this -- Inheritance is really tricky when you're blatting in items to a graph, and may need to override just one piece of it. From what I can tell, the extra profiles tend to be aggregative and compatible, so realistically, there are potentially a few extra fields per entity and/or additional codes/required fields. Also, I think that the changes here are more of the form of "potentially backwards incompatible fixing the implementation" rather than actually adding support for the profile. FWIW, I think this has been the general take previously, e.g, the geo fields are added from GeoDCAT. For the Codelists, (at least on the scheming side) I've got something like this in my schema:
And then the choices_helper is this:
The codelist directory has the
Right now, this is spread over my schema plugin and the dcat plugin, but the next iteration is going to need to pull the codelists into dcat so that I can kick out the prefLabels there. |
ckanext/dcat/plugins/__init__.py
Outdated
for index, item in enumerate(dataset_dict[field['field_name']]): | ||
for key in item: | ||
# Index a flattened version | ||
new_key = f'{field["field_name"]}_{index}_{key}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this would let us ask solr for things like 'datasets with the 6th name field in contacts equal to "frank"', but not 'datasets containing any contact named "frank"'. If we the same keys from all subfields into a single field like
new_key = f'{field["field_name"]}__{key}'
then we could, right? We're not doing dynamic solr schemas so we'd have to combine everything as text fields but it should work for text searches.
Also nothing here is specific to dcat so shouldn't it be a scheming PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, given this:
With your suggestion you need to run a query like q=contact__name:*Racoon*
to get a hit, with the original it would be impossible to get this hit without knowing the index, so that's obviously better.
But I think users would expect to find these results in a free text search as well. For this, the subfields need to be indexed in an extras_*
field, as these are copied to the catch-all text
field. So this would allow just to do q=Racoon
and get a result back, so maybe indexing
new_key = f'extras_{field["field_name"]}__{key}'
with the combined key values is the better approach.
Also nothing here is specific to dcat so shouldn't it be a scheming PR?
Sure, I was just getting things working in this extension. Do you want to replace the logic in SchemingNerfIndexPlugin with this one or create a separate plugin? I'll send a PR
|
||
- field_name: endpoint_url | ||
label: Endpoint URL | ||
preset: multiple_text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDYT about tighter validation on this schema, e.g. BCP47 values in language, valid emails, URIs and URLs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a great idea, but I want to do it in a second stage once all the fields are defined in the schema and the general approach validated. I'll start to compile a list of possible validations, these are all great candidates
Mostly taken from the DCAT-AP 2.1 spec doc, adapted for CKAN
I think this is now ready to go, any further work should be done in separate PRs as this has grown quite a lot. Highlights are:
|
This looks good. I would be tempted to put more of the logic in the schemas but this extension needs to maintain backwards compatibility and ckanext-scheming-less operation so your approach makes sense. |
Co-authored-by: Ian Ward <[email protected]>
As this is a `text` field that allows free text search
Scheming adds a dict with empty keys when empty repeating subfields are submitted from the form. Check that there's an actual value before creating the triples when serializing
except Invalid: | ||
raise Invalid( | ||
_( | ||
"Date format incorrect. Supported formats are YYYY, YYYY-MM, YYYY-MM-DD and YYYY-MM-DDTHH:MM:SS" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is overly restrictive. The first datetime I attempted to parse with the dcat harvester was of the form: YYYY-MM-DDTHH:MM:SS.000Z
, which is permitted by ISO8601 (https://en.wikipedia.org/wiki/ISO_8601) and the xsd:dateTime
(https://www.w3.org/TR/xmlschema11-2/#dateTime) spec.
It appears that this goes back to the ckan helper date_str_to_datetime which "Converts an ISO like date to a datetime", using a 12 yr old regex based time zone ignoring date parser. This predates python having reasonable sensible timezone aware datetimes and the datetime.datetime.fromisoformat
function.
The core helper probably should be fixed, with its potential for knock on changes, but in the meantime since this is new code, perhaps we should just directly use datetime.datetime.fromisoformat
here. (python3.7 min for that). We'd still need to support YYYY
and YYYY-MM
specially in the code, because those aren't covered.
There's also a version of this in scheming (https://github.com/ckan/ckanext-scheming/blob/master/ckanext/scheming/helpers.py#L299) which also does manual date parsing, but again looks replaceable with fromisoformat
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
datetime.datetime.fromisoformat()
is quite limited up until Python 3.11:
Changed in version 3.11: Previously, this method only supported formats that could be emitted by date.isoformat() or datetime.isoformat().
So on python 3.10 and lower these dates are not parsed:
Python 3.10.10 (main, Mar 14 2023, 15:55:23) [GCC 9.4.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.13.1 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import datetime
In [2]: datetime.datetime.fromisoformat("2011-11-04T00:05:23Z")
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[2], line 1
----> 1 datetime.datetime.fromisoformat("2011-11-04T00:05:23Z")
ValueError: Invalid isoformat string: '2011-11-04T00:05:23Z'
What do you think of the approach I followed in c7b8c02? If something is not an xsd:gYear
, an xsd:gYearMonth
or an xsd:date
we just let dateutil parse it (which will accept the timezone values you suggested). If that parses we serve it as an xsd:dateTime
.
The tests check that 1) the input value is unchanged and 2) it's served with the correct xsd
data type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks better than doing it ourselves. You'r enot testing for invallid dates, or ... conditionally valid dates like M/D/Y, but overall, I'm definitely happier with getting a builtin date parser that handles the in the wild formats that I've seen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added tests in 39b4d91
Merging this big chunk of changes, any followup like #288 we can do in smaller PRs |
This PR adds initial support for seamless integration between ckanext-dcat and ckanext-scheming, providing a custom profile that modifies the dataset dicts generated and consumed from the existing profiles so it plays well with the scheming presets defined.
Summary of changes
ckanext/dcat/schemas/dcat_ap_2.1.yaml
)RDFProfile
class to access schema field definitions from datasets and resourceseuro_dcat_ap_scheming
profile that adds support for the field serializations supported by the ckanext-scheming presets. The existing profiles (euro_dcat_ap
andeuro_dcat_ap_2
) remain unchanged (except for some very minor backward compatible changes regarding the handling of access services in distributions/resources). This means that existing sites will keep working as currently, but maintainers can choose to enable scheming support if they choose to migrate to that approach. Upcoming DCAT 3 based profiles will be scheming based (in a new ckanext-dcat version)Compatibility and release plan
Extra care has been taken to not break any existing systems. Sites using the existing
euro_dcat_ap
andeuro_dcat_ap_2
profiles should not see any change in their current parsing and serialization functionalities and these profiles will never change their outputs. Sites willing to migrate to a scheming based profile can do so by adding the neweuro_dcat_ap_scheming
profile at the end of their profile chain (value ofckanext.dcat.rdf.profiles
config option, egckanext.dcat.rdf.profiles = euro_dcat_ap_2 euro_dcat_ap_scheming
), which will modify the existing profile outputs to the expected format by the scheming validators. Note that the scheming profile will only affect fields defined in the schema definition file, so sites can start migrating gradually different metadata fields.This compatibility profile will be released in the next ckanext-dcat version (1.8.0). The upcoming DCAT v3 based profiles for DCAT-AP 3 and DCAT-US 3 will be scheming based and will incorporate the mapping changes described below.
Mapping changes
The main changes between the old processors (parsers and serializers) and the new scheming-based ones are:
Root level fields
Custom DCAT fields that didn't link directly to standard CKAN fields were stored as extras (see all the ones marked
extra:
here). So the DCATversion_notes
field would be stored as:In the scheming-based profile, if the field is defined in the scheming schema, it will get stored as a root level field, like all custom dataset properties:
List fields
The old profiles stored lists as JSON strings:
By using the
multiple_text
preset, lists are now automatically handled:The form snippets UI allows to provide multiple values:
Repeating subfields
Mapping complex entities like
dcat:contactPoint
ordct:publisher
was very limited, storing a subset of properties of just one linked entity as prefixed extras:By using the
repeating_subfields
preset we can consume and present these as proper objects, and store multiple entities for those properties that have 0..n cardinality (see comment in "Issues"):Repeating subfields are also supported in resources/distributions. In this case complex objects like
dcat:accessService
were stored as JSON strings:They now appear as proper objects:
Again, these can be easily managed via the UI thanks to the scheming form snippets:
Issues
dct:publisher
that have 0..1 cardinality, I don't think CKAN supports "non-repeating" subfields so it makes sense to use therepeating_subfields
one for now and create a new one in the future.date
anddatetime
with nice UI form snippets so it's tempting to use them for properties likeissued
and modified, but these support other formats likexsd:gYear
orxsd:gYearMonth
which will fail with these presets so we can consider creating a new one that extends the existing ones to support these formats