Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset factories #2635

Merged
merged 29 commits into from
Jul 6, 2023
Merged

Dataset factories #2635

merged 29 commits into from
Jul 6, 2023

Conversation

ankatiyar
Copy link
Contributor

@ankatiyar ankatiyar commented Jun 2, 2023

Description

Resolve #2423 This PR is introduces the dataset factories feature. Look at #2670 for the documentation on how to use it.

Development notes

Kedro run workflow

-->kedro run in terminal
--> Session is created and run
--> Context created which loads all the config from the conf_source
--> DataCatalog.fromconfig(dict:catalog) which initialises the DataCatalog object

  • At this point the DataCatalog is created with explicitly mentioned datasets in DataCatalog._data_sets
  • Sort the patterns and save them to DataCatalog._dataset_patterns
  • Also save load_version and save_version to self._load_version and self._save_version

--> runner.run(pipeline, catalog, .., ..)

  • loop through all datasets used by the pipeline and save all that exist (either as explicit datasets or patterns) to registered_ds
  • The check for existence will materialise and add the datasets to the catalog (in DataCatalog.__contains__())

--> Pipeline is run and catalog._get_dataset(dataset_name)-> AbstractDataSet : No major change because datasets should already be materialised at the check for existence

  • Add datasets that exist as patterns as materialised datasets to the catalog

NOTE : The tests are not updated to reflect the latest changes (TODO)

Toy project to test with : https://github.com/ankatiyar/space-patterns

Follow up tasks to do

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes

merelcht and others added 3 commits May 22, 2023 17:48
@ankatiyar ankatiyar requested a review from merelcht June 5, 2023 10:10
Signed-off-by: Ankita Katiyar <[email protected]>
Signed-off-by: Ankita Katiyar <[email protected]>
@ankatiyar ankatiyar changed the title [Draft] Dataset factories - Lazy resolving approach [Draft] Dataset factories Jun 8, 2023
@ankatiyar ankatiyar marked this pull request as ready for review June 12, 2023 10:36
@ankatiyar ankatiyar requested a review from idanov as a code owner June 12, 2023 10:36
@ankatiyar ankatiyar changed the title [Draft] Dataset factories Dataset factories Jun 12, 2023
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done a first review of the code in data_catalog.py and left some questions. My main question is what the purpose of the _pattern_name_matches_cache is and the methods that update it.

kedro/io/data_catalog.py Outdated Show resolved Hide resolved
kedro/io/data_catalog.py Outdated Show resolved Hide resolved
kedro/io/data_catalog.py Outdated Show resolved Hide resolved
kedro/io/data_catalog.py Outdated Show resolved Hide resolved
kedro/io/data_catalog.py Outdated Show resolved Hide resolved
kedro/io/data_catalog.py Outdated Show resolved Hide resolved
@ankatiyar
Copy link
Contributor Author

ankatiyar commented Jun 13, 2023

@merelcht The reason I added self._pattern_name_matches_cache is because the first thing we need to do in the runner is check if all the datasets used in the pipeline exist in the catalog using catalog.exists_in_catalog(). At this step if we find a pattern that matches, it's added to the cache dict which stores dataset name -> pattern match. (Just so when we do catalog._get_dataset(dataset_name) we don't have to iterate over all the patterns again).

This might be overkill and I can revert to what the prototype was doing but I thought for large catalogs, this would avoid doing the same iteration twice. What do you think?

The workflow for then fetching a dataset is ->

  1. If the dataset is in self._data_set - go to step 3
  2. If not, I reuse the self.exists_in_catalog() to check if it is a pattern.
    i. If this is the first time we're checking for the dataset's existence( e.g when someone is using DataCatalog as a standalone and the runner hasn't populated the cache), add "the name -> pattern" entry in the self._pattern_name_matches_cache.
    ii. get the factory pattern for the dataset name from _pattern_name_matches_cache()
    iii. get resolved dataset using _get_resolved_dataset() -> resolves the config and then loads the instance of AbstractDataSet
    iv. Add the resolved dataset to self._data_sets
  3. Load dataset from self._data_sets and return

Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying the use of the cache @ankatiyar, it makes more sense now! I think we can keep it, but perhaps give it a slightly shorter name e.g. _pattern_matches_cache.
In general, I struggled with following the code because some of the names of the methods. I tried giving suggestions on how to improve, but it's quite personal so it would be good to get some other opinions as well 🙂

One thing every single user in the interviews mentioned is a warning about the catch-all pattern, so we must really implement that, including tests.

Another thing that was mentioned several times was a warning about when multiple patterns match a dataset, it would be good to have a test for that to see how that would happen and if we can also warn about that.

kedro/io/data_catalog.py Outdated Show resolved Hide resolved
kedro/io/data_catalog.py Outdated Show resolved Hide resolved
kedro/io/data_catalog.py Outdated Show resolved Hide resolved
kedro/io/data_catalog.py Outdated Show resolved Hide resolved
kedro/io/data_catalog.py Outdated Show resolved Hide resolved
kedro/io/data_catalog.py Outdated Show resolved Hide resolved
tests/io/test_data_catalog.py Outdated Show resolved Hide resolved
@ankatiyar
Copy link
Contributor Author

@merelcht I'll leave the warning about catch-all patterns and multiple matches out of this PR but do it as a part of #2668

Copy link
Contributor

@noklam noklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop some comments for 1st round of review.

kedro/io/data_catalog.py Outdated Show resolved Hide resolved
Comment on lines 197 to 204
self._sorted_dataset_patterns = sorted(
self.dataset_patterns.keys(),
key=lambda pattern: (
-(_specificity(pattern)),
-pattern.count("{"),
pattern,
),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be refactor as a method and have its own unit tests

kedro/io/data_catalog.py Show resolved Hide resolved
kedro/io/data_catalog.py Outdated Show resolved Hide resolved
@@ -567,16 +642,47 @@ def list(self, regex_search: str | None = None) -> list[str]:
) from exc
return [dset_name for dset_name in self._data_sets if pattern.search(dset_name)]

def exists_in_catalog_config(self, dataset_name: str) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the cache that add matched pattern into catalog? If so why do we need to keep it separately instead of just creating the entry in catalog?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fn just adds the dataset name -> pattern in the self._pattern_matches_cache does not create an entry in the catalog. This is because this function is also called in the runner to check for existence of datasets from the pipeline in the catalog. The actual adding of a dataset to the catalog happens in self._get_dataset()
This is to avoid doing the pattern matching multiple times when it's called inside a Session run but the actual resolving and adding is still at _get_dataset time to allow for DataCatalog standalone usage

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a good name for a public method, it's way too descriptive. I would suggest that you change it to def __contains__(self, dataset_name):. This will allow us to do a test like this:
if dataset_name in catalog:..., which is way more intuitive and nicer than catalog.exists_in_catalog_config(dataset_name).

kedro/io/data_catalog.py Show resolved Hide resolved
@ankatiyar ankatiyar requested review from merelcht and noklam July 3, 2023 15:02
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed both this PR and #2743 and I definitely prefer the approach here with adding load_versions and save_version arguments instead of raw_catalog. I left some more comments, but in general I think this is nearly finished.

kedro/io/data_catalog.py Show resolved Hide resolved
kedro/io/data_catalog.py Show resolved Hide resolved
kedro/io/data_catalog.py Outdated Show resolved Hide resolved
kedro/io/data_catalog.py Outdated Show resolved Hide resolved
Comment on lines 357 to 358
Args:
pattern: The factory pattern
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For completeness this doc string should also include the return type/value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a private method, the Args: part is leftover from when it was a helper fn outside DataCatalog I've removed it

kedro/io/data_catalog.py Show resolved Hide resolved
kedro/io/data_catalog.py Show resolved Hide resolved
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left one more minor comment, but apart from that I'm very happy with how this looks! It would be good to get another review from @noklam or @idanov, but from my side this is good to go 👍

kedro/io/data_catalog.py Outdated Show resolved Hide resolved
kedro/io/data_catalog.py Show resolved Hide resolved
Signed-off-by: Ankita Katiyar <[email protected]>
Copy link
Contributor

@noklam noklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some comments here, but I figured it is easier if I drop all my comments in one draft PR, it's not ready yet. I will ping you when it's ready.

Review PR since it is easier.

self,
data_sets: dict[str, AbstractDataSet] = None,
feed_dict: dict[str, Any] = None,
layers: dict[str, set[str]] = None,
dataset_patterns: dict[str, dict[str, Any]] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it is good to introduce keyword-only arguments here, similar to the proposal to get us more freedom to re-arrange the arguments later without breaking changes. The current argument lists does not make too much sense

@@ -170,8 +184,12 @@ def __init__(
self._data_sets = dict(data_sets or {})
self.datasets = _FrozenDatasets(self._data_sets)
self.layers = layers
# Keep a record of all patterns in the catalog.
# {dataset pattern name : dataset pattern body}
self._dataset_patterns = dict(dataset_patterns or {})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still confuse me a lot, is there any difference? Isn't dataset_patterns a dict already?

Suggested change
self._dataset_patterns = dict(dataset_patterns or {})
self._dataset_patterns =dataset_patterns or {}

# Keep a record of all patterns in the catalog.
# {dataset pattern name : dataset pattern body}
self._dataset_patterns = dict(dataset_patterns or {})
self._load_versions = dict(load_versions or {})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as self._dataset_patterns

catalog = copy.deepcopy(catalog) or {}
credentials = copy.deepcopy(credentials) or {}
save_version = save_version or generate_timestamp()
load_versions = copy.deepcopy(load_versions) or {}
layers: dict[str, set[str]] = defaultdict(set)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same argument, good to have keywords-only argument here even we are going to remove layers soon

Comment on lines 322 to 324
if "{" in pattern:
return True
return False
Copy link
Contributor

@noklam noklam Jul 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if "{" in pattern:
return True
return False
return "{" in pattern

@noklam
Copy link
Contributor

noklam commented Jul 6, 2023

As discussed, let's separate the Type Alias proposed in #2770.
I will close #2770 and approve this PR now. :)

@noklam noklam self-requested a review July 6, 2023 13:45
Copy link
Contributor

@noklam noklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now! Let's ship it and do the type alias separately.

ankatiyar and others added 3 commits July 6, 2023 15:13
Signed-off-by: Ankita Katiyar <[email protected]>
* Add warning for catch-all patterns

Signed-off-by: Ankita Katiyar <[email protected]>

* Update warning message

Signed-off-by: Ankita Katiyar <[email protected]>

---------

Signed-off-by: Ankita Katiyar <[email protected]>
@ankatiyar ankatiyar merged commit 6da8bde into main Jul 6, 2023
3 checks passed
@ankatiyar ankatiyar deleted the feature/dataset-factories-for-catalog branch July 6, 2023 15:06
Comment on lines +329 to +333
for pattern, _ in data_set_patterns.items():
result = parse(pattern, data_set_name)
if result:
return pattern
return None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be nicely rewritten to:

matches = (parse(pattern, data_set_name) for pattern in data_set_patterns.keys())
return next(filter(None, matches), None)

missing_keys = [
key
for key in load_versions.keys()
if not (cls._match_pattern(sorted_patterns, key) or key in catalog)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test should be reordered, since key in catalog is easier to test and probably will match more often. No need to go through _match_pattern for all the non-patterned datasets.

Comment on lines +352 to +355
sorted_patterns = {}
for key in sorted_keys:
sorted_patterns[key] = data_set_patterns[key]
return sorted_patterns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shorter and neater.

return {key: data_set_patterns[key] for key in sorted_keys}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Eliminate the need of having all datasets defined in the catalog
4 participants