Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix caching configuration #3785

Merged
merged 18 commits into from
Mar 2, 2020
Merged

Conversation

greschd
Copy link
Member

@greschd greschd commented Feb 21, 2020

Fixes #3673

This changes the meaning of the identifier used in the caching configuration from being an entry point string, to being the process_type of the node.

Since this change removes the ability to select process classes via subclassing (isinstance check), we instead allow the configuration to contain globbing patterns, as implemented in the fnmatch standard library.

Multiple globbing patterns can apply to a single process type, as long as one of them is uniquely the most specific.

Since the new configuration allows arbitrary string, the validation of the caching configuration on load is now less strict -- we can not check if the process type exists.

Dominik Gresch added 4 commits February 20, 2020 17:54
Change the 'identifier' to always refer to the process_type. Use
the 'fnmatch' stdlib to allow globbing in the configuration.

This is technically a breaking change, because the interface
of 'get_use_cache', 'enable_caching', 'disable_caching' no longer
allows the 'node_class' interface. However, the previous
version silently did nothing when this argument is passed, so
this just makes the issue more apparent.
When using globbing in the caching configuration, we want the
ability to override a general glob (e.g. aiida.calculations:*)
with a more specific one (e.g. aiida.calculations:aiida-diff.*).
This commit implements the logic and tests for this behavior.
To check which identifier pattern is more specific, the patterns
are fnmatch'ed against each other. If all other patterns match
a given pattern, it is the most specific.
@greschd greschd requested a review from sphuber February 21, 2020 13:34
@greschd
Copy link
Member Author

greschd commented Feb 22, 2020

Thinking about this some more, the logic for checking which pattern is "most specific" is incorrect in corner cases. The criterion for being the most specific pattern should be that the set of identifiers that it matches is a strict subset of all other identifiers. Here's a counter example for the current logic:

With patterns

pattern1 = 'a????c?`
pattern2 = 'a[ba]c*'

pattern2 is identified as the most specific. However, considering the identifiers

A = 'afcdecf'
B = 'abccccd'
C = 'abccccde'

pattern1 matches {A, B}, while pattern2 matches {B, C}. Clearly, it is not a strict subset.

The root of this problem lies with the ? and [...] matching kinds - for the latter, a pattern will not even match itself. I have a felling restricting matches to * would be enough to make everything work, but I still need to convince myself fully.

Should we just restrict the patterns so that they can not contain [, ], or ?? @sphuber what's the "grammar" of the process_type, could it ever contain one of these characters?

Besides this issue, we still need to update the documentation about caching configuration.

@greschd
Copy link
Member Author

greschd commented Feb 22, 2020

Also, we should be using fnmatch.fnmatchcase instead of fnmatch.fnmatch. Or instead of using fnmatch, we could split the pattern by *, re.escape the parts, and then join them with .* again to get a regex that does only * - matching.

Dominik Gresch added 2 commits February 24, 2020 08:22
The 'fnmatch' implementation allows using '?' (any single
character) or '[abc]' (any of the characters in brackets) matches.
These are problematic when trying to determine the 'most specific'
match to a given identifier. To avoid this, we instead implement
matching of _only_ '*' wildcards with 're'. This is similar to how
'fnmatch' is implemented, except for removing the extra logic
needed for '?' and '[]' matches.
@greschd greschd added this to the v1.1.1 milestone Feb 24, 2020
@greschd
Copy link
Member Author

greschd commented Feb 24, 2020

Also, we should be using fnmatch.fnmatchcase instead of fnmatch.fnmatch. Or instead of using fnmatch, we could split the pattern by *, re.escape the parts, and then join them with .* again to get a regex that does only * - matching.

I decided to go with the re solution, because that is pretty much exactly what fnmatch is doing also: https://github.com/python/cpython/blob/3.8/Lib/fnmatch.py#L74 (but it's also implementing the ? and [...] matching).

The documentation is now also updated, so this is ready for review.

@sphuber sphuber modified the milestones: v1.1.1, v1.1.2 Feb 24, 2020
@sphuber sphuber modified the milestones: v1.1.2, v1.1.1 Feb 25, 2020
Copy link
Contributor

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @greschd looking good. I agree with your assessment of only supporting *is the better option. It should give more than enough flexibility and keeps the implementation not overly complex. Just one question if we should maybe add a little more validation on values in the configuration file.

aiida/manage/caching.py Outdated Show resolved Hide resolved
try:
type_check(config[ConfigKeys.DEFAULT.value], bool)
type_check(config[ConfigKeys.ENABLED.value], list)
type_check(config[ConfigKeys.DISABLED.value], list)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add an explicit check on the actual values in the disabled and enabled list? It can only be a string with a valid process type string, optional with wildcard. Valid process types are:

  • aiida.plugins.entry_point.is_valid_entry_point_string returns True
  • Or it is a python module path. Should be possible to write a validation function for. I think essentially [A-Za-z0-9_\.]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree it would be good to have at least the validation that it could be a valid process type string. Because of the possible wildcards, we probably have to implement the check separately and can't re-use the is_valid_entry_point_string.

For the entry points, as far as I see the valid form is

<group_name>:<something>, where

  • group_name is one of aiida.data, aiida.calculations, ..., as defined by the entry_point_group_to_module_path_map
  • something can be literally anything except it can not contain another semicolon

For the fully qualified python name, each colon-separated part must fulfill

name.isidentifier() and not keyword.iskeyword(name)

The only two restrictions I'm aware of (compared to the simple regex) is that it can't start with a number, and can't be a keyword (like in, for, else, ...).

All this is a bit tricky to check for due to the wildcards, but I think I can come up with a reasonable solution.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, had a crack at this in _validate_identifier_pattern.

tests/manage/test_caching_config.py Outdated Show resolved Hide resolved
@greschd greschd requested a review from sphuber March 1, 2020 17:51
Copy link
Contributor

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two more minor changes. The logic of the identifier validator seems OK, just have two questions there

'the `node_class` argument is deprecated and will be removed in `v2.0.0`. '
'Use the `identifier` argument instead', AiidaDeprecationWarning
)
type_check(identifier, (type(None), str))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allow_none=True

aiida/manage/caching.py Show resolved Hide resolved
aiida/manage/caching.py Outdated Show resolved Hide resolved
aiida/manage/caching.py Show resolved Hide resolved
@greschd greschd requested a review from sphuber March 2, 2020 14:27
@sphuber sphuber merged commit dcd40af into aiidateam:develop Mar 2, 2020
@greschd greschd deleted the fix_caching_config branch March 2, 2020 14:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Caching and the configuration of caching is broken
2 participants