[DataCatalog2.0]: Move pattern resolution logic - project cli #4124

ElenaKhaustova · 2024-08-29T09:46:39Z

Description

This PR is a part of #4110 and is done on top of #4123

It serves as an example of changes required to keep both implementations together so that users can switch between them and use both, see the related comment

For the reviewers: this PR does not include tests, please see the suggested order of work in this comment

Development notes

This PR includes modification of context, session, runners and kedro run command to be compatible with KedroDataCatalog and DataCatalogConfigResolver.

Currently, one can use old DataCatalog as usual without any changes or use KedroDataCatalog with kedro run by adding the following flag:

kedro run --new_catalog

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

Read the contributing guidelines
Signed off each commit with a Developer Certificate of Origin (DCO)
Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the RELEASE.md file
Added tests to cover my changes
Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

Signed-off-by: Elena Khaustova <[email protected]>

…ern-resolution-logic-project-cli

…ern-resolution-logic-project-cli Signed-off-by: Elena Khaustova <[email protected]>

…ern-resolution-logic-project-cli

Signed-off-by: Elena Khaustova <[email protected]>

astrojuanlu · 2024-08-30T10:52:06Z

kedro/framework/cli/project.py

@@ -199,6 +200,7 @@ def package(metadata: ProjectMetadata) -> None:
    help=PARAMS_ARG_HELP,
    callback=_split_params,
 )
+@click.option("--new_catalog", "new_catalog", is_flag=True, help=NEW_CATALOG_ARG_HELP)


How does this flag play along with the DATA_CATALOG_CLASS defined in settings.py?

kedro/kedro/templates/project/{{ cookiecutter.repo_name }}/src/{{ cookiecutter.python_package }}/settings.py

Lines 44 to 46 in 7a16e1a

# Class that manages the Data Catalog.

# from kedro.io import DataCatalog

# DATA_CATALOG_CLASS = DataCatalog

Is this suggestion just for the PoC or to be included in the final code? Adding a temporary flag just for the new catalog has a very low API surface to benefit ratio. Removing ANY flag from the CLI is a breaking change, including this. Could we just utilise the existing methodology as pointed by @astrojuanlu ?

I did this temporarily to simplify running the project with both catalogs without modifying DATA_CATALOG_CLASS for the project before running it. This flag and DATA_CATALOG_CLASS_NEW will go away later on.

…ern-resolution-logic-project-cli

idanov · 2024-09-02T12:40:31Z

kedro/framework/cli/project.py

@@ -199,6 +200,7 @@ def package(metadata: ProjectMetadata) -> None:
    help=PARAMS_ARG_HELP,
    callback=_split_params,
 )
+@click.option("--new_catalog", "new_catalog", is_flag=True, help=NEW_CATALOG_ARG_HELP)


Is this suggestion just for the PoC or to be included in the final code? Adding a temporary flag just for the new catalog has a very low API surface to benefit ratio. Removing ANY flag from the CLI is a breaking change, including this. Could we just utilise the existing methodology as pointed by @astrojuanlu ?

idanov · 2024-09-02T12:42:45Z

kedro/runner/runner.py

+
+        # Seems like it's not needed even for DataCatalog
+        # catalog = catalog.shallow_copy()


Will that stay in? Or just for the PoC?

I should go IMO as it doesn't seem useful, but I was going to double-check why it was there

idanov · 2024-09-02T12:47:26Z

kedro/runner/runner.py

@@ -99,9 +101,13 @@ def run(
            )

        # Identify MemoryDataset in the catalog
+        catalog_datasets = (


That's exactly the kind of thing an abstract DataCatalog (or a Protocol to this end) will prevent. Having code to check what type of instance we are working with is a bit of an anti-pattern in most cases (bar some exceptional cases emulating pattern matching). Could we somehow create a minimal interface which is used within Kedro, give it a name and program against it?

idanov · 2024-09-02T12:51:29Z

kedro/runner/runner.py

@@ -380,7 +398,7 @@ def _find_initial_node_group(pipeline: Pipeline, nodes: Iterable[Node]) -> list[

 def run_node(
    node: Node,
-    catalog: DataCatalog,
+    catalog: DataCatalog | KedroDataCatalog,


If this is for the PoC, could we (just temporarily) come up with something like:

type AbstractDataCatalog = DataCatalog | KedroDataCatalog

And then simply use that everywhere until we design the AbstractDataCatalog?

merelcht · 2024-09-03T10:54:18Z

kedro/runner/runner.py

@@ -110,9 +116,10 @@ def run(
        free_outputs = pipeline.outputs() - (set(registered_ds) - memory_datasets)

        # Register the default dataset pattern with the catalog


How does this work for the new Catalog?

All the patterns are managed by DataCatalogConfigResolver which has dataset_patterns, default_pattern and runtime_patterns and stores them explicitly -

kedro/kedro/io/catalog_config_resolver.py

Line 98 in 506470a

self._runtime_patterns: Patterns = {}

Previously we used shallow copy just to add runtime_patterns (line that you're pointing to). Now we have a dedicated method for that -

kedro/kedro/io/catalog_config_resolver.py

Line 246 in 506470a

def add_runtime_patterns(self, dataset_patterns: Patterns) -> None:

We add runtime patterns to the DataCatalogConfigResolver at the level of the session where runner is already initialised, so runner does not depend on DataCatalogConfigResolver.

kedro/kedro/framework/session/session.py

Line 409 in 506470a

catalog_config_resolver.add_runtime_patterns(runner._extra_dataset_patterns)

ElenaKhaustova · 2024-09-09T15:39:27Z

Replaced with #4123 and #4151

ElenaKhaustova added 15 commits August 21, 2024 16:36

Added --new_catalog key for run command

18c3464

Signed-off-by: Elena Khaustova <[email protected]>

Implemented _get_catalog_new from the context

7282979

Signed-off-by: Elena Khaustova <[email protected]>

Updated context and session

efe5a6d

Signed-off-by: Elena Khaustova <[email protected]>

Simplified dataset init

50c862f

Signed-off-by: Elena Khaustova <[email protected]>

Updated runner

e018b4b

Signed-off-by: Elena Khaustova <[email protected]>

Updated sequential runner

8f0471c

Signed-off-by: Elena Khaustova <[email protected]>

Updated rich markup

01c89c0

Signed-off-by: Elena Khaustova <[email protected]>

Make SequentialRunner work

1e9bdd9

Signed-off-by: Elena Khaustova <[email protected]>

Update ThreadRunner

e9ed105

Signed-off-by: Elena Khaustova <[email protected]>

Updated ParallelRunner

9e93e0e

Signed-off-by: Elena Khaustova <[email protected]>

Merge branch '4110-move-pattern-resolution-logic' into 4110-move-patt…

1741083

…ern-resolution-logic-project-cli

Merge branch '4110-move-pattern-resolution-logic' into 4110-move-patt…

e2322ca

…ern-resolution-logic-project-cli Signed-off-by: Elena Khaustova <[email protected]>

Merge branch '4110-move-pattern-resolution-logic' into 4110-move-patt…

4addf3f

…ern-resolution-logic-project-cli

Refactored context and session

2380c4a

Signed-off-by: Elena Khaustova <[email protected]>

Refactored runners

da6d693

Signed-off-by: Elena Khaustova <[email protected]>

ElenaKhaustova changed the title ~~Move pattern resolution logic project cli~~ [DataCatalog2.0]: Move pattern resolution logic project cli Aug 29, 2024

ElenaKhaustova changed the title ~~[DataCatalog2.0]: Move pattern resolution logic project cli~~ [DataCatalog2.0]: Move pattern resolution logic - project cli Aug 29, 2024

This was referenced Aug 29, 2024

[DataCatalog2.0]: Run project with new catalog (work in progress) #4084

Closed

[DataCatalog2.0]: Move pattern resolution logic - catalog cli #4130

Closed

Design DataCatalog2.0 #3995

Open

ElenaKhaustova marked this pull request as ready for review August 29, 2024 22:08

ElenaKhaustova requested a review from merelcht as a code owner August 29, 2024 22:08

ElenaKhaustova requested review from astrojuanlu, idanov, noklam, DimedS, lrcouto and ankatiyar August 29, 2024 22:09

astrojuanlu reviewed Aug 30, 2024

View reviewed changes

Merge branch '4110-move-pattern-resolution-logic' into 4110-move-patt…

506470a

…ern-resolution-logic-project-cli

idanov requested changes Sep 2, 2024

View reviewed changes

merelcht reviewed Sep 3, 2024

View reviewed changes

ElenaKhaustova closed this Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DataCatalog2.0]: Move pattern resolution logic - project cli #4124

[DataCatalog2.0]: Move pattern resolution logic - project cli #4124

ElenaKhaustova commented Aug 29, 2024 •

edited

Loading

astrojuanlu Aug 30, 2024

idanov Sep 2, 2024

ElenaKhaustova Sep 2, 2024

idanov Sep 2, 2024

idanov Sep 2, 2024

ElenaKhaustova Sep 2, 2024

idanov Sep 2, 2024

idanov Sep 2, 2024

merelcht Sep 3, 2024

ElenaKhaustova Sep 3, 2024

ElenaKhaustova commented Sep 9, 2024

	# Class that manages the Data Catalog.
	# from kedro.io import DataCatalog
	# DATA_CATALOG_CLASS = DataCatalog


		# Seems like it's not needed even for DataCatalog
		# catalog = catalog.shallow_copy()

		@@ -110,9 +116,10 @@ def run(
		free_outputs = pipeline.outputs() - (set(registered_ds) - memory_datasets)

		# Register the default dataset pattern with the catalog

[DataCatalog2.0]: Move pattern resolution logic - project cli #4124

[DataCatalog2.0]: Move pattern resolution logic - project cli #4124

Conversation

ElenaKhaustova commented Aug 29, 2024 • edited Loading

Description

Development notes

Developer Certificate of Origin

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ElenaKhaustova commented Sep 9, 2024

ElenaKhaustova commented Aug 29, 2024 •

edited

Loading