Check for quantity in validator #500

dsleiter · 2020-12-11T07:42:02Z

Motivation

Check that the quantity of child groups/datasets/links of a group match what is defined in the spec for the group.

Fix Validator should check for quantity #197
validation functions can now return IncorrectQuantityErrors
Includes some refactoring of GroupValidator so that it is easier to improve in the future

How to test the behavior?

from hdmf.spec import GroupSpec, SpecCatalog, SpecNamespace
from hdmf.build import GroupBuilder
from hdmf.validate import ValidatorMap

def get_vmap(specs):
    spec_catalog = SpecCatalog()
    for spec in specs:
        spec_catalog.register_spec(spec, 'test.yaml')
    namespace = SpecNamespace(
        'a test namespace', 'test_namespace', [{'source': 'test.yaml'}], version='0.1.0', catalog=spec_catalog)
    return ValidatorMap(namespace)

def build_specs():
    bar = GroupSpec('A test group', data_type_def='Bar')
    foo = GroupSpec('A group containing a quantity of tests and dataasets', data_type_def='Foo',
                    groups=[GroupSpec('A collection of bars', data_type_inc='Bar', quantity=5)])
    return (foo, bar)

builder = GroupBuilder('my_foo', attributes={'data_type': 'Foo'},
                    groups=[GroupBuilder(f'bar_{n}', attributes={'data_type': 'Bar'}) for n in range(2)])
vmap = get_vmap(build_specs())
vmap.validate(builder)

Checklist

Did you update CHANGELOG.md with your changes?
Have you checked our Contributing document?
Have you ensured the PR clearly describes the problem and the solution?
Is your contribution compliant with our coding style? This can be checked running flake8 from the source directory.
Have you checked to ensure that there aren't other open Pull Requests for the same change?
Have you included the relevant issue number using "Fix #XXX" notation where XXX is the issue number? By including "Fix #XXX" you allow GitHub to close issue #XXX when the PR is merged.

* Fix hdmf-dev#197 * Includes new validator tests * Refactor GroupValidator for readability

dsleiter · 2020-12-11T07:45:10Z

Please let me know if this looks like it covers the expected edge cases, and I'm very open to feedback and suggestions here. It took me a while to build up my mental model of what is going on in the validator and how it relates to specs and builders, so I very well could be missing things.

I'm also very open to any feedback in terms of how I've refactored the code and whether or not it fits with the expected style of the rest of the library.

dsleiter · 2020-12-11T07:53:25Z

There were a few things I discovered while going through the validator code, and am not sure if this is something the schema shouldn't allow or if the validator doesn't work in these cases. They seemed to be best handled outside of the scope of this PR, so I thought I'd bring them up here to discuss and we can create separate issues if needed.

if a group includes two children with the same defined data type but a different name (independent of the quantity), then only the last will be validated. Eg. an electrodes group which has VectorData X, Y, and Z - in which case only Z will be validated.
if a child link has the same data type as a child group or dataset, then it will cause that group or dataset to not be validated. Should it be possible to have a link with the same data type as a child group/dataset?
if a group specifies two unnamed child groups of type A and B, where B inherits from A, should a single child of type B satisfy the spec for both B and A?

It seems like perhaps the code in GroupValidator.__init__ which creates dictionaries of the specs to review might no longer be appropriate as it limits one spec of each type. Do you agree? https://github.com/hdmf-dev/hdmf/blob/dev/src/hdmf/validate/validator.py#L403

I'm still putting all the pieces together in my head, so maybe I'm missing a few things here, but hopefully this is helpful

rly · 2021-01-12T07:04:58Z

src/hdmf/validate/validator.py

+        for _, value in builder.items():
            v_builder = value


Suggested change

for _, value in builder.items():

v_builder = value

for v_builder in builder.values():

Good suggestion to just iterate over .values(), I included that change.

Removing the assignment on line 449 also looks reasonable, but on closer inspection, it would end up changing the logic for one particular edge case: if value is a LinkBuilder which links to something inheriting from BaseBuilder, and that BaseBuilder has a data_type, then it would store the BaseBuilder in the data_types map, as opposed to the current behavior which is to store the LinkBuilder in the map.

excerpt for reference:

v_builder = value if isinstance(v_builder, LinkBuilder): v_builder = v_builder.builder if isinstance(v_builder, BaseBuilder): dt = v_builder.attributes.get(self.spec.type_key()) if dt is not None: data_types[dt].append(value)

The implementation here hasn't changed from the previous version, it was just extracted into a separate function. Looking at the usage of the return value of __group_children_by_data_type after it is passed to __validate_child_data_type, there is another check if the builder is a LinkedBuilder (line 467), so it seems to me like the current implementation is consistent. However, by making the change you suggested, all unit tests still pass, so it's not obvious that making that change would be incorrect.

I think there will likely be some large changes to validation logic in the solution for the new issue: #507 so my suggestion is to verify that the solution for that issue verifies correct behavior here. Do you agree?

You are right. How you have it now, keeping the assignment on line 449, is correct. We want to pass on the fact that a LinkBuilder is associated with the data type. Good catch.

tests/unit/validator_tests/test_validate.py

rly · 2021-01-12T07:16:25Z

I'm also very open to any feedback in terms of how I've refactored the code and whether or not it fits with the expected style of the rest of the library.

The refactoring looks great. Thanks!

rly · 2021-01-12T07:40:15Z

if a group includes two children with the same defined data type but a different name (independent of the quantity), then only the last will be validated. Eg. an electrodes group which has VectorData X, Y, and Z - in which case only Z will be validated.

That is not correct/intended behavior. Each child should be validated. Can you create a new issue about this?

if a child link has the same data type as a child group or dataset, then it will cause that group or dataset to not be validated. Should it be possible to have a link with the same data type as a child group/dataset?

I wrote privately to you that this should not be allowed by the schema language; however, on further thought, I think this case is OK, and the current code that validates only the link spec and not the group/dataset spec is incorrect/bugged.

if a group specifies two unnamed child groups of type A and B, where B inherits from A, should a single child of type B satisfy the spec for both B and A?

Great question! I would say the answer is no. If a group specifies two unnamed child groups of different types, then having a single child means one of the required specs is not fulfilled.

Now what if the file contains two children of type B? Then the most restrictive spec (type B) should be matched first, and then move to less restrictive requirements, which would match the second child with the type A spec. I don't know if the validator handles this correctly yet.

It seems like perhaps the code in GroupValidator.__init__ which creates dictionaries of the specs to review might no longer be appropriate as it limits one spec of each type. Do you agree? https://github.com/hdmf-dev/hdmf/blob/dev/src/hdmf/validate/validator.py#L403

Yes. If a group has multiple named instances of the same type, e.g., a DynamicTable with three differently named columns of type VectorData, then it looks like this code validates only the last instance. Also if there is an included VectorData dataset and a linked VectorData, regardless of name, then only the link is validated. These are not correct behaviors. Can you create a new issue about this?

I'm still putting all the pieces together in my head, so maybe I'm missing a few things here, but hopefully this is helpful

Very helpful! This is great attention to detail and the different edge cases here.

rly · 2021-01-12T07:48:41Z

This PR looks good to me. I suggested some minor code changes. Feel free to accept/reject. If you want, you can address some of the issues you raised above within this PR, but it could also be done in a separate PR another day.

dsleiter · 2021-01-12T23:41:14Z

Thanks for the review @rly! I created an issue to cover the additional incorrect behavior discussed above: #507

I think they are related and could be resolved together by implementing new logic for how builders are matched against specs in GroupBuilder (as opposed to the current __include_dts, __dataset_validators, and __group_validators maps), so I put them together in a single issue. But if you have a suggestion for a different approach or would prefer that they are separate issues, let me know and I'll break the issues up.

dsleiter

@rly I made a pair of small changes you suggested, and created a new issue for the remainder. Please take a look if you think that covers everything, and then I think the PR is ready.

dsleiter · 2021-01-12T23:46:17Z

src/hdmf/validate/validator.py

+        for _, value in builder.items():
            v_builder = value


Good suggestion to just iterate over .values(), I included that change.

Removing the assignment on line 449 also looks reasonable, but on closer inspection, it would end up changing the logic for one particular edge case: if value is a LinkBuilder which links to something inheriting from BaseBuilder, and that BaseBuilder has a data_type, then it would store the BaseBuilder in the data_types map, as opposed to the current behavior which is to store the LinkBuilder in the map.

excerpt for reference:

v_builder = value if isinstance(v_builder, LinkBuilder): v_builder = v_builder.builder if isinstance(v_builder, BaseBuilder): dt = v_builder.attributes.get(self.spec.type_key()) if dt is not None: data_types[dt].append(value)

The implementation here hasn't changed from the previous version, it was just extracted into a separate function. Looking at the usage of the return value of __group_children_by_data_type after it is passed to __validate_child_data_type, there is another check if the builder is a LinkedBuilder (line 467), so it seems to me like the current implementation is consistent. However, by making the change you suggested, all unit tests still pass, so it's not obvious that making that change would be incorrect.

I think there will likely be some large changes to validation logic in the solution for the new issue: #507 so my suggestion is to verify that the solution for that issue verifies correct behavior here. Do you agree?

* dev: Fix building of Data with no dtype spec and value DataIO wrapping DCI (hdmf-dev#513) Allow np.bool_ as a valid type when bool is specified in spec (hdmf-dev#505) Check for quantity in validator (hdmf-dev#500) added driver option for ros3 (hdmf-dev#506) Use dockerhub authentication (hdmf-dev#432)

* upstream/dev: Fix building of Data with no dtype spec and value DataIO wrapping DCI (hdmf-dev#513) Allow np.bool_ as a valid type when bool is specified in spec (hdmf-dev#505) Check for quantity in validator (hdmf-dev#500) added driver option for ros3 (hdmf-dev#506) Use dockerhub authentication (hdmf-dev#432)

dsleiter added 2 commits December 11, 2020 18:08

Check for quantity in validator

6831432

* Fix hdmf-dev#197 * Includes new validator tests * Refactor GroupValidator for readability

Add a unit test to validate IncorrectQuantityError message

c12e7b9

dsleiter and others added 3 commits December 11, 2020 19:00

Update changelog

13b6fe2

Merge branch 'dev' into enh/validator_check_quantity

088156a

Merge branch 'dev' into enh/validator_check_quantity

f4fee85

rly reviewed Jan 12, 2021

View reviewed changes

tests/unit/validator_tests/test_validate.py Show resolved Hide resolved

rly previously approved these changes Jan 12, 2021

View reviewed changes

dsleiter mentioned this pull request Jan 12, 2021

Ensure all specs are validated #507

Closed

5 tasks

Include small changes suggested in PR review

8ff9591

dsleiter dismissed rly’s stale review via 8ff9591 January 12, 2021 23:48

dsleiter commented Jan 12, 2021

View reviewed changes

rly approved these changes Jan 13, 2021

View reviewed changes

rly merged commit c20660c into hdmf-dev:dev Jan 13, 2021

dsleiter deleted the enh/validator_check_quantity branch January 14, 2021 03:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check for quantity in validator #500

Check for quantity in validator #500

dsleiter commented Dec 11, 2020 •

edited

Loading

dsleiter commented Dec 11, 2020

dsleiter commented Dec 11, 2020

rly Jan 12, 2021

dsleiter Jan 12, 2021

rly Jan 13, 2021

rly commented Jan 12, 2021

rly commented Jan 12, 2021 •

edited

Loading

rly commented Jan 12, 2021

dsleiter commented Jan 12, 2021

dsleiter left a comment

dsleiter Jan 12, 2021

	for _, value in builder.items():
	v_builder = value
	for v_builder in builder.values():

Check for quantity in validator #500

Check for quantity in validator #500

Conversation

dsleiter commented Dec 11, 2020 • edited Loading

Motivation

How to test the behavior?

Checklist

dsleiter commented Dec 11, 2020

dsleiter commented Dec 11, 2020

rly Jan 12, 2021

Choose a reason for hiding this comment

dsleiter Jan 12, 2021

Choose a reason for hiding this comment

rly Jan 13, 2021

Choose a reason for hiding this comment

rly commented Jan 12, 2021

rly commented Jan 12, 2021 • edited Loading

rly commented Jan 12, 2021

dsleiter commented Jan 12, 2021

dsleiter left a comment

Choose a reason for hiding this comment

dsleiter Jan 12, 2021

Choose a reason for hiding this comment

dsleiter commented Dec 11, 2020 •

edited

Loading

rly commented Jan 12, 2021 •

edited

Loading