[PROTOTYPE NOT TO BE MERGED] Dataset factories prototype #2560

merelcht · 2023-05-05T09:54:56Z

Description

Prototype for the dataset factories proposal in #2423
This doesn't include any sophisticated parsing rules, but just uses the basic parse() method from https://github.com/r1chardj0n3s/parse#readme.

Development notes

When the DataCatalog class instance gets instantiated from the config in: https://github.com/kedro-org/kedro/blob/807c4e64fb/kedro/framework/context/context.py#L287, it now checks for entries that are patterns (they contain "}" ) and skips those from dataset creation and instead adds those to a new "dataset_patterns" dictionary that's kept on the catalog.
Any datasets that match a pattern get resolved on load time and then added to the catalog as proper dataset instances. So resolution only has to happen once and not at every access.
I'm not super happy yet with remove_pattern_matches() this is a helper method to exclude any datasets that match a pattern from the runner checks for missing inputs, free outputs and unregistered datasets that need a default dataset.
Bonus: Added two CLI methods (these are really ugly code, but just to show what could be possible): kedro catalog show prints the raw catalog config files to the CLI. kedro catalog resolve resolves any patterns in the catalog config file against datasets used in the pipeline and prints this to the CLI.

Test project that I've used to test the prototype: https://github.com/merelcht/dataset-factories

Checklist

Read the contributing guidelines
Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the RELEASE.md file
Added tests to cover my changes

Signed-off-by: Merel Theisen <[email protected]>

idanov · 2023-05-10T20:23:56Z

kedro/framework/cli/catalog.py

+    pipeline_datasets = []
+    for pipe in pipelines.keys():
+        pl_obj = pipelines.get(pipe)
+        if pl_obj:
+            pipeline_ds = pl_obj.data_sets()
+            for ds in pipeline_ds:
+                pipeline_datasets.append(ds)
+        else:
+            existing_pls = ", ".join(sorted(pipelines.keys()))
+            raise KedroCliError(
+                f"'{pipe}' pipeline not found! Existing pipelines: {existing_pls}"
+            )


Arguably shorter and more readable than the original snippet:

Suggested change

pipeline_datasets = []

for pipe in pipelines.keys():

pl_obj = pipelines.get(pipe)

if pl_obj:

pipeline_ds = pl_obj.data_sets()

for ds in pipeline_ds:

pipeline_datasets.append(ds)

else:

existing_pls = ", ".join(sorted(pipelines.keys()))

raise KedroCliError(

f"'{pipe}' pipeline not found! Existing pipelines: {existing_pls}"

)

ds_groups = [pipe.data_sets() for pipe in pipelines.values() if pipe]

pipeline_datasets = [ds for group in ds_groups for ds in group]

Not sure why we need the error?

idanov · 2023-05-10T21:00:10Z

kedro/framework/cli/catalog.py

+    for ds_name, ds_config in catalog_conf.items():
+        if "}" in ds_name:
+            for pipeline_dataset in set(pipeline_datasets):
+                result = parse(ds_name, pipeline_dataset)
+                if result:
+                    config_copy = copy.deepcopy(ds_config)
+                    # Match results to patterns in catalog entry
+                    for key, value in config_copy.items():
+                        if isinstance(value, Iterable) and "}" in value:
+                            string_value = str(value)
+                            config_copy[key] = string_value.format_map(result.named)
+                    catalog_copy[pipeline_dataset] = config_copy
+


I find this extremely hard to read if not for the comments. I think it would be much better if the code actually does what the comment says, step by step, rather than nesting levels.

Moreover, this complexity seems to have hidden a logical incorrectness of the code: all datasets that appear exactly as they are in the catalog, shouldn't be matched against the patterns (which the original code doesn't do).

Suggested change

for ds_name, ds_config in catalog_conf.items():

if "}" in ds_name:

for pipeline_dataset in set(pipeline_datasets):

result = parse(ds_name, pipeline_dataset)

if result:

config_copy = copy.deepcopy(ds_config)

# Match results to patterns in catalog entry

for key, value in config_copy.items():

if isinstance(value, Iterable) and "}" in value:

string_value = str(value)

config_copy[key] = string_value.format_map(result.named)

catalog_copy[pipeline_dataset] = config_copy

skip = [ds_name for catalog_conf.keys() if "}" not in ds_name]

datasets = set(pipeline_datasets) - set(skip)

patterns = [ds_name for catalog_conf.keys() if "}" in ds_name]

matches = [(ds, pattern, parse(pattern, ds)) for ds in datasets for pattern in patterns]

matches = [(ds, pattern, result) for ds, pattern, result in matches if result]

for ds, pattern, result in matches:

cfg = copy.deepcopy(catalog_conf[pattern])

# Match results to patterns in catalog entry

for key, value in cfg.items():

if isinstance(value, Iterable) and "}" in value:

string_value = str(value)

cfg[key] = string_value.format_map(result.named)

catalog_copy[pipeline_dataset] = cfg

The code for populating the template should probably be a separate function for readability and also to ensure that we can call it recursively (as each value can be a dictionary with other values, etc).

The original code does all the matching and it doesn't stop at the first match, maybe that's intended so we can see all possible matches?

Oh yeah you can completely ignore this code. As I wrote in the description this is just bonus for the prototype to show what's possible, but in no way meant to be merged. I just went for a quick implementation, because this needs to proper parsing logic and will change anyway.

idanov · 2023-05-10T21:06:16Z

kedro/io/data_catalog.py

@@ -520,6 +543,34 @@ def add_feed_dict(self, feed_dict: Dict[str, Any], replace: bool = False) -> Non

            self.add(data_set_name, data_set, replace)

+    def match_name_against_dataset_factories(self, dataset_input_name: str) -> Optional[AbstractDataSet]:


Crazily long name 😅

idanov · 2023-05-10T21:08:04Z

kedro/io/data_catalog.py

+        """
+        dataset = None
+        # Loop through all dataset patterns and check if the given dataset name has a match.
+        for dataset_name, dataset_config in self.dataset_patterns.items():


The dataset_name here is a pattern and the dataset_config is a template/factory, right? We could use that terminology for ease and readability.

idanov · 2023-05-10T21:13:03Z

kedro/io/data_catalog.py

+                config_copy = copy.deepcopy(dataset_config)
+                # Match results to patterns in catalog entry
+                for key, value in config_copy.items():
+                    # Find all dataset fields that need to be resolved with
+                    # the values that were matched.
+                    if isinstance(value, Iterable) and "}" in value:
+                        string_value = str(value)
+                        # result.named: {'root_namespace': 'germany', 'dataset_name': 'companies'}
+                        # format_map fills in dict values into a string with {...} placeholders
+                        # of the same key name.
+                        config_copy[key] = string_value.format_map(result.named)


As above, will make this a function for readability and call it recursively down the line, returning materialised config dictionary.

idanov · 2023-05-10T21:13:47Z

kedro/io/data_catalog.py

+                        config_copy[key] = string_value.format_map(result.named)
+                # Create dataset from catalog config.
+                dataset = AbstractDataSet.from_config(dataset_name, config_copy)
+        return dataset


Probably worth caching the match here before returning and adding it to self._data_sets.

idanov · 2023-05-10T21:31:05Z

kedro/io/data_catalog.py

+        if self.dataset_patterns:
+            dataset_list_minus_matched = []
+            for dataset in dataset_list:
+                # If dataset matches a pattern, remove it from the list.
+                for dataset_name in self.dataset_patterns.keys():
+                    result = parse(dataset_name, dataset)
+                    if result:
+                        break
+                    else:
+                        dataset_list_minus_matched.append(dataset)
+            return set(dataset_list_minus_matched)
+        return dataset_list


This is much nicer:

Suggested change

if self.dataset_patterns:

dataset_list_minus_matched = []

for dataset in dataset_list:

# If dataset matches a pattern, remove it from the list.

for dataset_name in self.dataset_patterns.keys():

result = parse(dataset_name, dataset)

if result:

break

else:

dataset_list_minus_matched.append(dataset)

return set(dataset_list_minus_matched)

return dataset_list

if not self.dataset_patterns:

return dataset_list

dataset_list_minus_matched = dataset_list.copy()

for dataset in dataset_list:

matches = (parse(pattern, dataset) for pattern in self.dataset_patterns.keys())

# If dataset matches any pattern, remove it from the list.

if any(matches):

dataset_list_minus_matched -= dataset

return set(dataset_list_minus_matched)

As a general rule:

The shorter case in an if statement should always go first

If there's return, there's no need of else (like in the original code)

There's usually a nicer solution than using a break in a loop (not always though)

noklam · 2023-05-11T10:36:25Z

kedro/io/data_catalog.py

@@ -255,6 +261,7 @@ class to be loaded is specified with the key ``type`` and their
            >>> catalog.save("boats", df)
        """
        data_sets = {}
+        dataset_patterns = {}


What are the difference between these two patterns?

Signed-off-by: Merel Theisen <[email protected]>

merelcht · 2023-06-13T14:30:21Z

Closing this prototype PR, because the proper implementation is now open: #2635

merelcht added 7 commits May 2, 2023 10:31

Prototype approach with settings

050959c

Signed-off-by: Merel Theisen <[email protected]>

Match dataset against pattern and resolve from there

9bcd31e

Signed-off-by: Merel Theisen <[email protected]>

Add catalog show and resolve commands

431719d

Signed-off-by: Merel Theisen <[email protected]>

Small cleanups

76b05d6

Signed-off-by: Merel Theisen <[email protected]>

Add comments to explain code

f563d44

Signed-off-by: Merel Theisen <[email protected]>

Fix error if no patterns in the catalog

8644fff

Signed-off-by: Merel Theisen <[email protected]>

Add parse to requirements

10b2bcc

Signed-off-by: Merel Theisen <[email protected]>

merelcht self-assigned this May 5, 2023

merelcht requested review from idanov, noklam, antonymilne and ankatiyar May 5, 2023 09:58

merelcht linked an issue May 5, 2023 that may be closed by this pull request

Prototype dataset factories #2510

Closed

Fix small bugs

97ccec8

Signed-off-by: Merel Theisen <[email protected]>

ankatiyar mentioned this pull request May 5, 2023

[DRAFT] Dataset factory parsing rules demo #2559

Closed

5 tasks

Fix bug with removing datasets that match when multiple patterns present

6403465

Signed-off-by: Merel Theisen <[email protected]>

idanov reviewed May 10, 2023

View reviewed changes

noklam reviewed May 11, 2023

View reviewed changes

noklam mentioned this pull request May 15, 2023

feat: Add metadata attribute to datasets kedro-org/kedro-plugins#189

Merged

4 tasks

merelcht mentioned this pull request May 17, 2023

Prototype dataset factories #2510

Closed

merelcht added 5 commits May 23, 2023 17:14

Fix import

50b540e

Signed-off-by: Merel Theisen <[email protected]>

Fix import

45229d6

Signed-off-by: Merel Theisen <[email protected]>

Remove patterns from resolved catalog

d3176a0

Signed-off-by: Merel Theisen <[email protected]>

Remove patterns from resolved catalog, fix

9a418de

Signed-off-by: Merel Theisen <[email protected]>

Fix check for existing datasets in catalog/pattern match

61f4b9e

Signed-off-by: Merel Theisen <[email protected]>

merelcht mentioned this pull request May 30, 2023

Eliminate the need of having all datasets defined in the catalog #2423

Closed

merelcht closed this Jun 13, 2023

merelcht deleted the dataset-factories-prototype branch August 8, 2023 12:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROTOTYPE NOT TO BE MERGED] Dataset factories prototype #2560

[PROTOTYPE NOT TO BE MERGED] Dataset factories prototype #2560

merelcht commented May 5, 2023 •

edited

Loading

idanov May 10, 2023 •

edited

Loading

idanov May 10, 2023

merelcht May 11, 2023 •

edited

Loading

idanov May 10, 2023

idanov May 10, 2023

idanov May 10, 2023

idanov May 10, 2023

idanov May 10, 2023

noklam May 11, 2023

merelcht commented Jun 13, 2023

		@@ -520,6 +543,34 @@ def add_feed_dict(self, feed_dict: Dict[str, Any], replace: bool = False) -> Non

		self.add(data_set_name, data_set, replace)

		def match_name_against_dataset_factories(self, dataset_input_name: str) -> Optional[AbstractDataSet]:

[PROTOTYPE NOT TO BE MERGED] Dataset factories prototype #2560

[PROTOTYPE NOT TO BE MERGED] Dataset factories prototype #2560

Conversation

merelcht commented May 5, 2023 • edited Loading

Description

Development notes

Checklist

idanov May 10, 2023 • edited Loading

Choose a reason for hiding this comment

idanov May 10, 2023

Choose a reason for hiding this comment

merelcht May 11, 2023 • edited Loading

Choose a reason for hiding this comment

idanov May 10, 2023

Choose a reason for hiding this comment

idanov May 10, 2023

Choose a reason for hiding this comment

idanov May 10, 2023

Choose a reason for hiding this comment

idanov May 10, 2023

Choose a reason for hiding this comment

idanov May 10, 2023

Choose a reason for hiding this comment

noklam May 11, 2023

Choose a reason for hiding this comment

merelcht commented Jun 13, 2023

merelcht commented May 5, 2023 •

edited

Loading

idanov May 10, 2023 •

edited

Loading

merelcht May 11, 2023 •

edited

Loading