Fix caching configuration #3785

greschd · 2020-02-21T13:34:53Z

This changes the meaning of the identifier used in the caching configuration from being an entry point string, to being the process_type of the node.

Since this change removes the ability to select process classes via subclassing (isinstance check), we instead allow the configuration to contain globbing patterns, as implemented in the fnmatch standard library.

Multiple globbing patterns can apply to a single process type, as long as one of them is uniquely the most specific.

Since the new configuration allows arbitrary string, the validation of the caching configuration on load is now less strict -- we can not check if the process type exists.

Change the 'identifier' to always refer to the process_type. Use the 'fnmatch' stdlib to allow globbing in the configuration. This is technically a breaking change, because the interface of 'get_use_cache', 'enable_caching', 'disable_caching' no longer allows the 'node_class' interface. However, the previous version silently did nothing when this argument is passed, so this just makes the issue more apparent.

When using globbing in the caching configuration, we want the ability to override a general glob (e.g. aiida.calculations:*) with a more specific one (e.g. aiida.calculations:aiida-diff.*). This commit implements the logic and tests for this behavior. To check which identifier pattern is more specific, the patterns are fnmatch'ed against each other. If all other patterns match a given pattern, it is the most specific.

greschd · 2020-02-22T00:03:48Z

Thinking about this some more, the logic for checking which pattern is "most specific" is incorrect in corner cases. The criterion for being the most specific pattern should be that the set of identifiers that it matches is a strict subset of all other identifiers. Here's a counter example for the current logic:

With patterns

pattern1 = 'a????c?`
pattern2 = 'a[ba]c*'

pattern2 is identified as the most specific. However, considering the identifiers

A = 'afcdecf'
B = 'abccccd'
C = 'abccccde'

pattern1 matches {A, B}, while pattern2 matches {B, C}. Clearly, it is not a strict subset.

The root of this problem lies with the ? and [...] matching kinds - for the latter, a pattern will not even match itself. I have a felling restricting matches to * would be enough to make everything work, but I still need to convince myself fully.

Should we just restrict the patterns so that they can not contain [, ], or ?? @sphuber what's the "grammar" of the process_type, could it ever contain one of these characters?

Besides this issue, we still need to update the documentation about caching configuration.

greschd · 2020-02-22T00:06:01Z

Also, we should be using fnmatch.fnmatchcase instead of fnmatch.fnmatch. Or instead of using fnmatch, we could split the pattern by *, re.escape the parts, and then join them with .* again to get a regex that does only * - matching.

The 'fnmatch' implementation allows using '?' (any single character) or '[abc]' (any of the characters in brackets) matches. These are problematic when trying to determine the 'most specific' match to a given identifier. To avoid this, we instead implement matching of _only_ '*' wildcards with 're'. This is similar to how 'fnmatch' is implemented, except for removing the extra logic needed for '?' and '[]' matches.

greschd · 2020-02-24T08:32:09Z

Also, we should be using fnmatch.fnmatchcase instead of fnmatch.fnmatch. Or instead of using fnmatch, we could split the pattern by *, re.escape the parts, and then join them with .* again to get a regex that does only * - matching.

I decided to go with the re solution, because that is pretty much exactly what fnmatch is doing also: https://github.com/python/cpython/blob/3.8/Lib/fnmatch.py#L74 (but it's also implementing the ? and [...] matching).

The documentation is now also updated, so this is ready for review.

sphuber

Thanks @greschd looking good. I agree with your assessment of only supporting *is the better option. It should give more than enough flexibility and keeps the implementation not overly complex. Just one question if we should maybe add a little more validation on values in the configuration file.

aiida/manage/caching.py

sphuber · 2020-02-29T20:49:34Z

aiida/manage/caching.py

+    try:
+        type_check(config[ConfigKeys.DEFAULT.value], bool)
+        type_check(config[ConfigKeys.ENABLED.value], list)
+        type_check(config[ConfigKeys.DISABLED.value], list)


Should we add an explicit check on the actual values in the disabled and enabled list? It can only be a string with a valid process type string, optional with wildcard. Valid process types are:

aiida.plugins.entry_point.is_valid_entry_point_string returns True

Or it is a python module path. Should be possible to write a validation function for. I think essentially [A-Za-z0-9_\.]

Yes, I agree it would be good to have at least the validation that it could be a valid process type string. Because of the possible wildcards, we probably have to implement the check separately and can't re-use the is_valid_entry_point_string.

For the entry points, as far as I see the valid form is

<group_name>:<something>, where

group_name is one of aiida.data, aiida.calculations, ..., as defined by the entry_point_group_to_module_path_map

something can be literally anything except it can not contain another semicolon

For the fully qualified python name, each colon-separated part must fulfill

name.isidentifier() and not keyword.iskeyword(name)

The only two restrictions I'm aware of (compared to the simple regex) is that it can't start with a number, and can't be a keyword (like in, for, else, ...).

All this is a bit tricky to check for due to the wildcards, but I think I can come up with a reasonable solution.

Alright, had a crack at this in _validate_identifier_pattern.

tests/manage/test_caching_config.py

sphuber

Two more minor changes. The logic of the identifier validator seems OK, just have two questions there

sphuber · 2020-03-02T10:34:15Z

aiida/manage/caching.py

-            'the `node_class` argument is deprecated and will be removed in `v2.0.0`. '
-            'Use the `identifier` argument instead', AiidaDeprecationWarning
-        )
+    type_check(identifier, (type(None), str))


allow_none=True

aiida/manage/caching.py

Co-Authored-By: Sebastiaan Huber <[email protected]>

Dominik Gresch added 4 commits February 20, 2020 17:54

(dirty commit) add parts of the caching config tests

bf20924

Improve type checking and fix calcfunction caching tests

dc8267a

greschd requested a review from sphuber February 21, 2020 13:34

Fix docstring to refer to unique ConfigurationError

cec4ae4

Dominik Gresch added 2 commits February 24, 2020 08:22

Check that case is taken into account in caching config tests

6d7935c

greschd added this to the v1.1.1 milestone Feb 24, 2020

Update configuration description in docs

05806f5

greschd and others added 2 commits February 24, 2020 09:32

Merge branch 'develop' into fix_caching_config

8ab4966

Make reference to CalculationNode unique.

9061b20

sphuber modified the milestones: v1.1.1, v1.1.2 Feb 24, 2020

Merge branch 'develop' into fix_caching_config

32bdab9

sphuber modified the milestones: v1.1.2, v1.1.1 Feb 25, 2020

sphuber requested changes Feb 29, 2020

View reviewed changes

Dominik Gresch and others added 3 commits March 1, 2020 17:30

Use 'allow_none' in 'type_check', remove debug print statement

67c83e7

Add validation for the identifier patterns.

969c72f

Merge branch 'develop' into fix_caching_config

27d757d

greschd requested a review from sphuber March 1, 2020 17:51

Dominik Gresch added 2 commits March 1, 2020 18:53

Fix type in _validate_identifier_pattern docstring

9866948

Add test for pattern validation in caching context managers

e5661f7

sphuber requested changes Mar 2, 2020

View reviewed changes

greschd and others added 2 commits March 2, 2020 14:27

Fix typo in comment

db9a60d

Co-Authored-By: Sebastiaan Huber <[email protected]>

Add example to comment about swallowed separators

884ceab

greschd requested a review from sphuber March 2, 2020 14:27

sphuber approved these changes Mar 2, 2020

View reviewed changes

sphuber merged commit dcd40af into aiidateam:develop Mar 2, 2020

greschd deleted the fix_caching_config branch March 2, 2020 14:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix caching configuration #3785

Fix caching configuration #3785

greschd commented Feb 21, 2020

greschd commented Feb 22, 2020

greschd commented Feb 22, 2020 •

edited

Loading

greschd commented Feb 24, 2020

sphuber left a comment

sphuber Feb 29, 2020

greschd Mar 1, 2020

greschd Mar 1, 2020

sphuber left a comment

sphuber Mar 2, 2020

Fix caching configuration #3785

Fix caching configuration #3785

Conversation

greschd commented Feb 21, 2020

greschd commented Feb 22, 2020

greschd commented Feb 22, 2020 • edited Loading

greschd commented Feb 24, 2020

sphuber left a comment

Choose a reason for hiding this comment

sphuber Feb 29, 2020

Choose a reason for hiding this comment

greschd Mar 1, 2020

Choose a reason for hiding this comment

greschd Mar 1, 2020

Choose a reason for hiding this comment

sphuber left a comment

Choose a reason for hiding this comment

sphuber Mar 2, 2020

Choose a reason for hiding this comment

greschd commented Feb 22, 2020 •

edited

Loading