-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DataCatalog2.0]: Move pattern resolution logic - project cli #4124
[DataCatalog2.0]: Move pattern resolution logic - project cli #4124
Conversation
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
…ern-resolution-logic-project-cli
…ern-resolution-logic-project-cli Signed-off-by: Elena Khaustova <[email protected]>
…ern-resolution-logic-project-cli
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
@@ -199,6 +200,7 @@ def package(metadata: ProjectMetadata) -> None: | |||
help=PARAMS_ARG_HELP, | |||
callback=_split_params, | |||
) | |||
@click.option("--new_catalog", "new_catalog", is_flag=True, help=NEW_CATALOG_ARG_HELP) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this flag play along with the DATA_CATALOG_CLASS
defined in settings.py
?
Lines 44 to 46 in 7a16e1a
# Class that manages the Data Catalog. | |
# from kedro.io import DataCatalog | |
# DATA_CATALOG_CLASS = DataCatalog |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this suggestion just for the PoC or to be included in the final code? Adding a temporary flag just for the new catalog has a very low API surface to benefit ratio. Removing ANY flag from the CLI is a breaking change, including this. Could we just utilise the existing methodology as pointed by @astrojuanlu ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this temporarily to simplify running the project with both catalogs without modifying DATA_CATALOG_CLASS
for the project before running it. This flag and DATA_CATALOG_CLASS_NEW
will go away later on.
…ern-resolution-logic-project-cli
@@ -199,6 +200,7 @@ def package(metadata: ProjectMetadata) -> None: | |||
help=PARAMS_ARG_HELP, | |||
callback=_split_params, | |||
) | |||
@click.option("--new_catalog", "new_catalog", is_flag=True, help=NEW_CATALOG_ARG_HELP) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this suggestion just for the PoC or to be included in the final code? Adding a temporary flag just for the new catalog has a very low API surface to benefit ratio. Removing ANY flag from the CLI is a breaking change, including this. Could we just utilise the existing methodology as pointed by @astrojuanlu ?
|
||
# Seems like it's not needed even for DataCatalog | ||
# catalog = catalog.shallow_copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will that stay in? Or just for the PoC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should go IMO as it doesn't seem useful, but I was going to double-check why it was there
@@ -99,9 +101,13 @@ def run( | |||
) | |||
|
|||
# Identify MemoryDataset in the catalog | |||
catalog_datasets = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's exactly the kind of thing an abstract DataCatalog (or a Protocol to this end) will prevent. Having code to check what type of instance we are working with is a bit of an anti-pattern in most cases (bar some exceptional cases emulating pattern matching). Could we somehow create a minimal interface which is used within Kedro, give it a name and program against it?
@@ -380,7 +398,7 @@ def _find_initial_node_group(pipeline: Pipeline, nodes: Iterable[Node]) -> list[ | |||
|
|||
def run_node( | |||
node: Node, | |||
catalog: DataCatalog, | |||
catalog: DataCatalog | KedroDataCatalog, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is for the PoC, could we (just temporarily) come up with something like:
type AbstractDataCatalog = DataCatalog | KedroDataCatalog
And then simply use that everywhere until we design the AbstractDataCatalog
?
@@ -110,9 +116,10 @@ def run( | |||
free_outputs = pipeline.outputs() - (set(registered_ds) - memory_datasets) | |||
|
|||
# Register the default dataset pattern with the catalog |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this work for the new Catalog?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the patterns are managed by DataCatalogConfigResolver
which has dataset_patterns
, default_pattern
and runtime_patterns
and stores them explicitly -
kedro/kedro/io/catalog_config_resolver.py
Line 98 in 506470a
self._runtime_patterns: Patterns = {} |
Previously we used shallow copy just to add runtime_patterns
(line that you're pointing to). Now we have a dedicated method for that -
kedro/kedro/io/catalog_config_resolver.py
Line 246 in 506470a
def add_runtime_patterns(self, dataset_patterns: Patterns) -> None: |
We add runtime patterns to the DataCatalogConfigResolver
at the level of the session where runner is already initialised, so runner does not depend on DataCatalogConfigResolver
.
kedro/kedro/framework/session/session.py
Line 409 in 506470a
catalog_config_resolver.add_runtime_patterns(runner._extra_dataset_patterns) |
Description
This PR is a part of #4110 and is done on top of #4123
It serves as an example of changes required to keep both implementations together so that users can switch between them and use both, see the related comment
For the reviewers: this PR does not include tests, please see the suggested order of work in this comment
Development notes
This PR includes modification of
context
,session
,runners
andkedro run
command to be compatible withKedroDataCatalog
andDataCatalogConfigResolver
.Currently, one can use old
DataCatalog
as usual without any changes or useKedroDataCatalog
withkedro run
by adding the following flag:Developer Certificate of Origin
We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a
Signed-off-by
line in the commit message. See our wiki for guidance.If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.
Checklist
RELEASE.md
file