Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check for quantity in validator #500

Merged
merged 6 commits into from
Jan 13, 2021

Conversation

dsleiter
Copy link
Contributor

@dsleiter dsleiter commented Dec 11, 2020

Motivation

Check that the quantity of child groups/datasets/links of a group match what is defined in the spec for the group.

How to test the behavior?

from hdmf.spec import GroupSpec, SpecCatalog, SpecNamespace
from hdmf.build import GroupBuilder
from hdmf.validate import ValidatorMap

def get_vmap(specs):
    spec_catalog = SpecCatalog()
    for spec in specs:
        spec_catalog.register_spec(spec, 'test.yaml')
    namespace = SpecNamespace(
        'a test namespace', 'test_namespace', [{'source': 'test.yaml'}], version='0.1.0', catalog=spec_catalog)
    return ValidatorMap(namespace)

def build_specs():
    bar = GroupSpec('A test group', data_type_def='Bar')
    foo = GroupSpec('A group containing a quantity of tests and dataasets', data_type_def='Foo',
                    groups=[GroupSpec('A collection of bars', data_type_inc='Bar', quantity=5)])
    return (foo, bar)

builder = GroupBuilder('my_foo', attributes={'data_type': 'Foo'},
                    groups=[GroupBuilder(f'bar_{n}', attributes={'data_type': 'Bar'}) for n in range(2)])
vmap = get_vmap(build_specs())
vmap.validate(builder)

Checklist

  • Did you update CHANGELOG.md with your changes?
  • Have you checked our Contributing document?
  • Have you ensured the PR clearly describes the problem and the solution?
  • Is your contribution compliant with our coding style? This can be checked running flake8 from the source directory.
  • Have you checked to ensure that there aren't other open Pull Requests for the same change?
  • Have you included the relevant issue number using "Fix #XXX" notation where XXX is the issue number? By including "Fix #XXX" you allow GitHub to close issue #XXX when the PR is merged.

* Fix hdmf-dev#197
* Includes new validator tests
* Refactor GroupValidator for readability
@dsleiter
Copy link
Contributor Author

Please let me know if this looks like it covers the expected edge cases, and I'm very open to feedback and suggestions here. It took me a while to build up my mental model of what is going on in the validator and how it relates to specs and builders, so I very well could be missing things.

I'm also very open to any feedback in terms of how I've refactored the code and whether or not it fits with the expected style of the rest of the library.

@dsleiter
Copy link
Contributor Author

There were a few things I discovered while going through the validator code, and am not sure if this is something the schema shouldn't allow or if the validator doesn't work in these cases. They seemed to be best handled outside of the scope of this PR, so I thought I'd bring them up here to discuss and we can create separate issues if needed.

  1. if a group includes two children with the same defined data type but a different name (independent of the quantity), then only the last will be validated. Eg. an electrodes group which has VectorData X, Y, and Z - in which case only Z will be validated.

  2. if a child link has the same data type as a child group or dataset, then it will cause that group or dataset to not be validated. Should it be possible to have a link with the same data type as a child group/dataset?

  3. if a group specifies two unnamed child groups of type A and B, where B inherits from A, should a single child of type B satisfy the spec for both B and A?

It seems like perhaps the code in GroupValidator.__init__ which creates dictionaries of the specs to review might no longer be appropriate as it limits one spec of each type. Do you agree? https://github.com/hdmf-dev/hdmf/blob/dev/src/hdmf/validate/validator.py#L403

I'm still putting all the pieces together in my head, so maybe I'm missing a few things here, but hopefully this is helpful

Comment on lines 448 to 449
for _, value in builder.items():
v_builder = value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for _, value in builder.items():
v_builder = value
for v_builder in builder.values():

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion to just iterate over .values(), I included that change.

Removing the assignment on line 449 also looks reasonable, but on closer inspection, it would end up changing the logic for one particular edge case: if value is a LinkBuilder which links to something inheriting from BaseBuilder, and that BaseBuilder has a data_type, then it would store the BaseBuilder in the data_types map, as opposed to the current behavior which is to store the LinkBuilder in the map.

excerpt for reference:

            v_builder = value
            if isinstance(v_builder, LinkBuilder):
                v_builder = v_builder.builder
            if isinstance(v_builder, BaseBuilder):
                dt = v_builder.attributes.get(self.spec.type_key())
                if dt is not None:
                    data_types[dt].append(value)

The implementation here hasn't changed from the previous version, it was just extracted into a separate function. Looking at the usage of the return value of __group_children_by_data_type after it is passed to __validate_child_data_type, there is another check if the builder is a LinkedBuilder (line 467), so it seems to me like the current implementation is consistent. However, by making the change you suggested, all unit tests still pass, so it's not obvious that making that change would be incorrect.

I think there will likely be some large changes to validation logic in the solution for the new issue: #507 so my suggestion is to verify that the solution for that issue verifies correct behavior here. Do you agree?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. How you have it now, keeping the assignment on line 449, is correct. We want to pass on the fact that a LinkBuilder is associated with the data type. Good catch.

@rly
Copy link
Contributor

rly commented Jan 12, 2021

I'm also very open to any feedback in terms of how I've refactored the code and whether or not it fits with the expected style of the rest of the library.

The refactoring looks great. Thanks!

@rly
Copy link
Contributor

rly commented Jan 12, 2021

  1. if a group includes two children with the same defined data type but a different name (independent of the quantity), then only the last will be validated. Eg. an electrodes group which has VectorData X, Y, and Z - in which case only Z will be validated.

That is not correct/intended behavior. Each child should be validated. Can you create a new issue about this?

  1. if a child link has the same data type as a child group or dataset, then it will cause that group or dataset to not be validated. Should it be possible to have a link with the same data type as a child group/dataset?

I wrote privately to you that this should not be allowed by the schema language; however, on further thought, I think this case is OK, and the current code that validates only the link spec and not the group/dataset spec is incorrect/bugged.

  1. if a group specifies two unnamed child groups of type A and B, where B inherits from A, should a single child of type B satisfy the spec for both B and A?

Great question! I would say the answer is no. If a group specifies two unnamed child groups of different types, then having a single child means one of the required specs is not fulfilled.

Now what if the file contains two children of type B? Then the most restrictive spec (type B) should be matched first, and then move to less restrictive requirements, which would match the second child with the type A spec. I don't know if the validator handles this correctly yet.

It seems like perhaps the code in GroupValidator.__init__ which creates dictionaries of the specs to review might no longer be appropriate as it limits one spec of each type. Do you agree? https://github.com/hdmf-dev/hdmf/blob/dev/src/hdmf/validate/validator.py#L403

Yes. If a group has multiple named instances of the same type, e.g., a DynamicTable with three differently named columns of type VectorData, then it looks like this code validates only the last instance. Also if there is an included VectorData dataset and a linked VectorData, regardless of name, then only the link is validated. These are not correct behaviors. Can you create a new issue about this?

I'm still putting all the pieces together in my head, so maybe I'm missing a few things here, but hopefully this is helpful

Very helpful! This is great attention to detail and the different edge cases here.

rly
rly previously approved these changes Jan 12, 2021
@rly
Copy link
Contributor

rly commented Jan 12, 2021

This PR looks good to me. I suggested some minor code changes. Feel free to accept/reject. If you want, you can address some of the issues you raised above within this PR, but it could also be done in a separate PR another day.

@dsleiter dsleiter mentioned this pull request Jan 12, 2021
5 tasks
@dsleiter
Copy link
Contributor Author

Thanks for the review @rly! I created an issue to cover the additional incorrect behavior discussed above: #507

I think they are related and could be resolved together by implementing new logic for how builders are matched against specs in GroupBuilder (as opposed to the current __include_dts, __dataset_validators, and __group_validators maps), so I put them together in a single issue. But if you have a suggestion for a different approach or would prefer that they are separate issues, let me know and I'll break the issues up.

Copy link
Contributor Author

@dsleiter dsleiter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rly I made a pair of small changes you suggested, and created a new issue for the remainder. Please take a look if you think that covers everything, and then I think the PR is ready.

Comment on lines 448 to 449
for _, value in builder.items():
v_builder = value
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion to just iterate over .values(), I included that change.

Removing the assignment on line 449 also looks reasonable, but on closer inspection, it would end up changing the logic for one particular edge case: if value is a LinkBuilder which links to something inheriting from BaseBuilder, and that BaseBuilder has a data_type, then it would store the BaseBuilder in the data_types map, as opposed to the current behavior which is to store the LinkBuilder in the map.

excerpt for reference:

            v_builder = value
            if isinstance(v_builder, LinkBuilder):
                v_builder = v_builder.builder
            if isinstance(v_builder, BaseBuilder):
                dt = v_builder.attributes.get(self.spec.type_key())
                if dt is not None:
                    data_types[dt].append(value)

The implementation here hasn't changed from the previous version, it was just extracted into a separate function. Looking at the usage of the return value of __group_children_by_data_type after it is passed to __validate_child_data_type, there is another check if the builder is a LinkedBuilder (line 467), so it seems to me like the current implementation is consistent. However, by making the change you suggested, all unit tests still pass, so it's not obvious that making that change would be incorrect.

I think there will likely be some large changes to validation logic in the solution for the new issue: #507 so my suggestion is to verify that the solution for that issue verifies correct behavior here. Do you agree?

@rly rly merged commit c20660c into hdmf-dev:dev Jan 13, 2021
@dsleiter dsleiter deleted the enh/validator_check_quantity branch January 14, 2021 03:00
satra added a commit to satra/hdmf that referenced this pull request Feb 21, 2021
* dev:
  Fix building of Data with no dtype spec and value DataIO wrapping DCI (hdmf-dev#513)
  Allow np.bool_ as a valid type when bool is specified in spec (hdmf-dev#505)
  Check for quantity in validator (hdmf-dev#500)
  added driver option for ros3 (hdmf-dev#506)
  Use dockerhub authentication (hdmf-dev#432)
satra added a commit to satra/hdmf that referenced this pull request Feb 21, 2021
* upstream/dev:
  Fix building of Data with no dtype spec and value DataIO wrapping DCI (hdmf-dev#513)
  Allow np.bool_ as a valid type when bool is specified in spec (hdmf-dev#505)
  Check for quantity in validator (hdmf-dev#500)
  added driver option for ros3 (hdmf-dev#506)
  Use dockerhub authentication (hdmf-dev#432)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Validator should check for quantity
2 participants