Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROTOTYPE NOT TO BE MERGED] Dataset factories prototype #2560

Closed
wants to merge 14 commits into from

Conversation

merelcht
Copy link
Member

@merelcht merelcht commented May 5, 2023

Description

#2510

Prototype for the dataset factories proposal in #2423
This doesn't include any sophisticated parsing rules, but just uses the basic parse() method from https://github.com/r1chardj0n3s/parse#readme.

Development notes

  • When the DataCatalog class instance gets instantiated from the config in: https://github.com/kedro-org/kedro/blob/807c4e64fb/kedro/framework/context/context.py#L287, it now checks for entries that are patterns (they contain "}" ) and skips those from dataset creation and instead adds those to a new "dataset_patterns" dictionary that's kept on the catalog.
  • Any datasets that match a pattern get resolved on load time and then added to the catalog as proper dataset instances. So resolution only has to happen once and not at every access.
  • I'm not super happy yet with remove_pattern_matches() this is a helper method to exclude any datasets that match a pattern from the runner checks for missing inputs, free outputs and unregistered datasets that need a default dataset.
  • Bonus: Added two CLI methods (these are really ugly code, but just to show what could be possible): kedro catalog show prints the raw catalog config files to the CLI. kedro catalog resolve resolves any patterns in the catalog config file against datasets used in the pipeline and prints this to the CLI.

Test project that I've used to test the prototype: https://github.com/merelcht/dataset-factories

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes

@merelcht merelcht self-assigned this May 5, 2023
@merelcht merelcht linked an issue May 5, 2023 that may be closed by this pull request
Signed-off-by: Merel Theisen <[email protected]>
Comment on lines +200 to +211
pipeline_datasets = []
for pipe in pipelines.keys():
pl_obj = pipelines.get(pipe)
if pl_obj:
pipeline_ds = pl_obj.data_sets()
for ds in pipeline_ds:
pipeline_datasets.append(ds)
else:
existing_pls = ", ".join(sorted(pipelines.keys()))
raise KedroCliError(
f"'{pipe}' pipeline not found! Existing pipelines: {existing_pls}"
)
Copy link
Member

@idanov idanov May 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arguably shorter and more readable than the original snippet:

Suggested change
pipeline_datasets = []
for pipe in pipelines.keys():
pl_obj = pipelines.get(pipe)
if pl_obj:
pipeline_ds = pl_obj.data_sets()
for ds in pipeline_ds:
pipeline_datasets.append(ds)
else:
existing_pls = ", ".join(sorted(pipelines.keys()))
raise KedroCliError(
f"'{pipe}' pipeline not found! Existing pipelines: {existing_pls}"
)
ds_groups = [pipe.data_sets() for pipe in pipelines.values() if pipe]
pipeline_datasets = [ds for group in ds_groups for ds in group]

Not sure why we need the error?

Comment on lines +218 to +230
for ds_name, ds_config in catalog_conf.items():
if "}" in ds_name:
for pipeline_dataset in set(pipeline_datasets):
result = parse(ds_name, pipeline_dataset)
if result:
config_copy = copy.deepcopy(ds_config)
# Match results to patterns in catalog entry
for key, value in config_copy.items():
if isinstance(value, Iterable) and "}" in value:
string_value = str(value)
config_copy[key] = string_value.format_map(result.named)
catalog_copy[pipeline_dataset] = config_copy

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this extremely hard to read if not for the comments. I think it would be much better if the code actually does what the comment says, step by step, rather than nesting levels.

Moreover, this complexity seems to have hidden a logical incorrectness of the code: all datasets that appear exactly as they are in the catalog, shouldn't be matched against the patterns (which the original code doesn't do).

Suggested change
for ds_name, ds_config in catalog_conf.items():
if "}" in ds_name:
for pipeline_dataset in set(pipeline_datasets):
result = parse(ds_name, pipeline_dataset)
if result:
config_copy = copy.deepcopy(ds_config)
# Match results to patterns in catalog entry
for key, value in config_copy.items():
if isinstance(value, Iterable) and "}" in value:
string_value = str(value)
config_copy[key] = string_value.format_map(result.named)
catalog_copy[pipeline_dataset] = config_copy
skip = [ds_name for catalog_conf.keys() if "}" not in ds_name]
datasets = set(pipeline_datasets) - set(skip)
patterns = [ds_name for catalog_conf.keys() if "}" in ds_name]
matches = [(ds, pattern, parse(pattern, ds)) for ds in datasets for pattern in patterns]
matches = [(ds, pattern, result) for ds, pattern, result in matches if result]
for ds, pattern, result in matches:
cfg = copy.deepcopy(catalog_conf[pattern])
# Match results to patterns in catalog entry
for key, value in cfg.items():
if isinstance(value, Iterable) and "}" in value:
string_value = str(value)
cfg[key] = string_value.format_map(result.named)
catalog_copy[pipeline_dataset] = cfg

The code for populating the template should probably be a separate function for readability and also to ensure that we can call it recursively (as each value can be a dictionary with other values, etc).

The original code does all the matching and it doesn't stop at the first match, maybe that's intended so we can see all possible matches?

Copy link
Member Author

@merelcht merelcht May 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah you can completely ignore this code. As I wrote in the description this is just bonus for the prototype to show what's possible, but in no way meant to be merged. I just went for a quick implementation, because this needs to proper parsing logic and will change anyway.

@@ -520,6 +543,34 @@ def add_feed_dict(self, feed_dict: Dict[str, Any], replace: bool = False) -> Non

self.add(data_set_name, data_set, replace)

def match_name_against_dataset_factories(self, dataset_input_name: str) -> Optional[AbstractDataSet]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Crazily long name 😅

"""
dataset = None
# Loop through all dataset patterns and check if the given dataset name has a match.
for dataset_name, dataset_config in self.dataset_patterns.items():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dataset_name here is a pattern and the dataset_config is a template/factory, right? We could use that terminology for ease and readability.

Comment on lines +559 to +569
config_copy = copy.deepcopy(dataset_config)
# Match results to patterns in catalog entry
for key, value in config_copy.items():
# Find all dataset fields that need to be resolved with
# the values that were matched.
if isinstance(value, Iterable) and "}" in value:
string_value = str(value)
# result.named: {'root_namespace': 'germany', 'dataset_name': 'companies'}
# format_map fills in dict values into a string with {...} placeholders
# of the same key name.
config_copy[key] = string_value.format_map(result.named)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, will make this a function for readability and call it recursively down the line, returning materialised config dictionary.

config_copy[key] = string_value.format_map(result.named)
# Create dataset from catalog config.
dataset = AbstractDataSet.from_config(dataset_name, config_copy)
return dataset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth caching the match here before returning and adding it to self._data_sets.

Comment on lines +622 to +633
if self.dataset_patterns:
dataset_list_minus_matched = []
for dataset in dataset_list:
# If dataset matches a pattern, remove it from the list.
for dataset_name in self.dataset_patterns.keys():
result = parse(dataset_name, dataset)
if result:
break
else:
dataset_list_minus_matched.append(dataset)
return set(dataset_list_minus_matched)
return dataset_list
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much nicer:

Suggested change
if self.dataset_patterns:
dataset_list_minus_matched = []
for dataset in dataset_list:
# If dataset matches a pattern, remove it from the list.
for dataset_name in self.dataset_patterns.keys():
result = parse(dataset_name, dataset)
if result:
break
else:
dataset_list_minus_matched.append(dataset)
return set(dataset_list_minus_matched)
return dataset_list
if not self.dataset_patterns:
return dataset_list
dataset_list_minus_matched = dataset_list.copy()
for dataset in dataset_list:
matches = (parse(pattern, dataset) for pattern in self.dataset_patterns.keys())
# If dataset matches any pattern, remove it from the list.
if any(matches):
dataset_list_minus_matched -= dataset
return set(dataset_list_minus_matched)

As a general rule:

  1. The shorter case in an if statement should always go first
  2. If there's return, there's no need of else (like in the original code)
  3. There's usually a nicer solution than using a break in a loop (not always though)

@@ -255,6 +261,7 @@ class to be loaded is specified with the key ``type`` and their
>>> catalog.save("boats", df)
"""
data_sets = {}
dataset_patterns = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the difference between these two patterns?

@merelcht
Copy link
Member Author

Closing this prototype PR, because the proper implementation is now open: #2635

@merelcht merelcht closed this Jun 13, 2023
@merelcht merelcht deleted the dataset-factories-prototype branch August 8, 2023 12:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prototype dataset factories
3 participants