Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataCatalog2.0]: Move pattern resolution logic - project cli #4124

Conversation

ElenaKhaustova
Copy link
Contributor

@ElenaKhaustova ElenaKhaustova commented Aug 29, 2024

Description

This PR is a part of #4110 and is done on top of #4123

It serves as an example of changes required to keep both implementations together so that users can switch between them and use both, see the related comment

For the reviewers: this PR does not include tests, please see the suggested order of work in this comment

Development notes

This PR includes modification of context, session, runners and kedro run command to be compatible with KedroDataCatalog and DataCatalogConfigResolver.

Currently, one can use old DataCatalog as usual without any changes or use KedroDataCatalog with kedro run by adding the following flag:

kedro run --new_catalog

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

  • Read the contributing guidelines
  • Signed off each commit with a Developer Certificate of Origin (DCO)
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes
  • Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
…ern-resolution-logic-project-cli

Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
@ElenaKhaustova ElenaKhaustova changed the title Move pattern resolution logic project cli [DataCatalog2.0]: Move pattern resolution logic project cli Aug 29, 2024
@ElenaKhaustova ElenaKhaustova changed the title [DataCatalog2.0]: Move pattern resolution logic project cli [DataCatalog2.0]: Move pattern resolution logic - project cli Aug 29, 2024
@ElenaKhaustova ElenaKhaustova marked this pull request as ready for review August 29, 2024 22:08
@@ -199,6 +200,7 @@ def package(metadata: ProjectMetadata) -> None:
help=PARAMS_ARG_HELP,
callback=_split_params,
)
@click.option("--new_catalog", "new_catalog", is_flag=True, help=NEW_CATALOG_ARG_HELP)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this flag play along with the DATA_CATALOG_CLASS defined in settings.py?

# Class that manages the Data Catalog.
# from kedro.io import DataCatalog
# DATA_CATALOG_CLASS = DataCatalog

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this suggestion just for the PoC or to be included in the final code? Adding a temporary flag just for the new catalog has a very low API surface to benefit ratio. Removing ANY flag from the CLI is a breaking change, including this. Could we just utilise the existing methodology as pointed by @astrojuanlu ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this temporarily to simplify running the project with both catalogs without modifying DATA_CATALOG_CLASS for the project before running it. This flag and DATA_CATALOG_CLASS_NEW will go away later on.

@@ -199,6 +200,7 @@ def package(metadata: ProjectMetadata) -> None:
help=PARAMS_ARG_HELP,
callback=_split_params,
)
@click.option("--new_catalog", "new_catalog", is_flag=True, help=NEW_CATALOG_ARG_HELP)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this suggestion just for the PoC or to be included in the final code? Adding a temporary flag just for the new catalog has a very low API surface to benefit ratio. Removing ANY flag from the CLI is a breaking change, including this. Could we just utilise the existing methodology as pointed by @astrojuanlu ?

Comment on lines +86 to +88

# Seems like it's not needed even for DataCatalog
# catalog = catalog.shallow_copy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will that stay in? Or just for the PoC?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should go IMO as it doesn't seem useful, but I was going to double-check why it was there

@@ -99,9 +101,13 @@ def run(
)

# Identify MemoryDataset in the catalog
catalog_datasets = (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's exactly the kind of thing an abstract DataCatalog (or a Protocol to this end) will prevent. Having code to check what type of instance we are working with is a bit of an anti-pattern in most cases (bar some exceptional cases emulating pattern matching). Could we somehow create a minimal interface which is used within Kedro, give it a name and program against it?

@@ -380,7 +398,7 @@ def _find_initial_node_group(pipeline: Pipeline, nodes: Iterable[Node]) -> list[

def run_node(
node: Node,
catalog: DataCatalog,
catalog: DataCatalog | KedroDataCatalog,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is for the PoC, could we (just temporarily) come up with something like:

type AbstractDataCatalog = DataCatalog | KedroDataCatalog

And then simply use that everywhere until we design the AbstractDataCatalog?

@@ -110,9 +116,10 @@ def run(
free_outputs = pipeline.outputs() - (set(registered_ds) - memory_datasets)

# Register the default dataset pattern with the catalog
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work for the new Catalog?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the patterns are managed by DataCatalogConfigResolver which has dataset_patterns, default_pattern and runtime_patterns and stores them explicitly -

self._runtime_patterns: Patterns = {}

Previously we used shallow copy just to add runtime_patterns (line that you're pointing to). Now we have a dedicated method for that -

def add_runtime_patterns(self, dataset_patterns: Patterns) -> None:

We add runtime patterns to the DataCatalogConfigResolver at the level of the session where runner is already initialised, so runner does not depend on DataCatalogConfigResolver.

catalog_config_resolver.add_runtime_patterns(runner._extra_dataset_patterns)

@ElenaKhaustova
Copy link
Contributor Author

Replaced with #4123 and #4151

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants