[DataCatalog]: Refactor dataset factory resolution logic #3925

ElenaKhaustova · 2024-06-04T17:10:01Z

Description

The current design complicates dataset pattern resolution, leading to confusion.
Resolution logic residing in the private _get_dataset() method forces people to stick to private API since using the public exists() method instead is not straightforward
Developers often forget that dataset factory resolution requires _get_dataset(), leading to further bugs.
Resolution logic duplicates between DataCatalog class and CLI, making it harder to maintain.

We propose:

Move the resolution logic out of the _get_dataset() and make it standard across all the modules and available for users via public API.
Explore the feasibility of implementing simpler resolution logic for dataset factories to ensure that datasets are resolved when needed without iterating through all of them.
Enhance documentation for advanced users to clearly explain the dataset resolution process and the usage of dataset factories.

This issue also relates to a more global question raised by @astrojuanlu: "The most important philosophical question here is "opening up" the DataCatalog abstraction and make datasets first-class citizens, and not an implementation detail. This was mentioned as far back as 2022 #1778 (comment)"

Context

Kedro-Viz case

After obtaining the catalog, the next step is to populate the catalog repositories. At this point, there's an encounter with a limitation - the DataCatalog does not include datasets resolved from factory patterns. To overcome the limitation, methods like pipeline.data_sets() and pipeline.datasets() are employed to access datasets, followed by the usage of _get_dataset(). The need arises from the inability of the public API to lazily load datasets and resolve factory patterns, which is necessary for Kedro-Viz's operations, especially before starting the server.

https://github.com/kedro-org/kedro-viz/blob/8fe5fa4810bb639013222d4bf1da3d9d337fb6d3/package/kedro_viz/data_access/managers.py#L72

MLFlow case

They use DataCatlog.exists() method to resolve factory patterns in after_pipeline_run hook to log pipeline artifacts. They find it unintuitive and admit that people often forget about that which leads to bugs that are hard to find.

https://github.com/Galileo-Galilei/kedro-mlflow/blob/64b8e94e1dafa02d979e7753dab9b9dfd4d7341c/kedro_mlflow/framework/hooks/mlflow_hook.py#L365

Logic duplication

Currently, dataset factory resolution logic resides in two places: the DataCatalog._get_dataset() method and the list_datasets() CLI.

kedro/kedro/io/data_catalog.py

Line 385 in 27f5405

def _get_dataset(

kedro/kedro/framework/cli/catalog.py

Line 85 in 27f5405

# resolve any factory datasets in the pipeline

It makes it hard to maintain and keep consistent as every time we need to modify the logic we have to make it in two places and, including tests.

An example of such PR: #3859

The text was updated successfully, but these errors were encountered:

ElenaKhaustova added the Issue: Feature Request New feature or improvement to existing feature label Jun 4, 2024

ElenaKhaustova added this to the Redesign the API for IO (catalog) milestone Jun 4, 2024

iamelijahko mentioned this issue Jun 6, 2024

Research summary of insights for redesigning Kedro's data catalog API #3934

Open

github-actions bot mentioned this issue Jul 1, 2024

Monthly issue metrics report #3975

Open

ElenaKhaustova self-assigned this Aug 2, 2024

This was referenced Aug 5, 2024

Design DataCatalog2.0 #3995

Open

[DataCatalog2.0]: Draft of AbstractDataCatalog and KedroDataCatalog (work in progress) #4070

Closed

[DataCatalog2.0]: Refactor catalog CLI (work in progress) #4071

Closed

ElenaKhaustova mentioned this issue Aug 20, 2024

[DataCatalog]: Move pattern resolution logic outside of DataCatalog #4110

Closed

ElenaKhaustova mentioned this issue Aug 29, 2024

[DataCatalog2.0]: Move pattern resolution logic to the separate component #4123

Merged

7 tasks

merelcht assigned noklam, merelcht and ankatiyar Sep 2, 2024

ElenaKhaustova closed this as completed Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DataCatalog]: Refactor dataset factory resolution logic #3925

[DataCatalog]: Refactor dataset factory resolution logic #3925

ElenaKhaustova commented Jun 4, 2024 •

edited

Loading

[DataCatalog]: Refactor dataset factory resolution logic #3925

[DataCatalog]: Refactor dataset factory resolution logic #3925

Comments

ElenaKhaustova commented Jun 4, 2024 • edited Loading

Description

Context

ElenaKhaustova commented Jun 4, 2024 •

edited

Loading