Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataCatalog2.0]: Move pattern resolution logic to the separate component #4123

Merged
merged 89 commits into from
Sep 12, 2024

Conversation

ElenaKhaustova
Copy link
Contributor

@ElenaKhaustova ElenaKhaustova commented Aug 29, 2024

Description

Solves #4110

Relates to #3925

Please see the suggested order of work in this comment

Development notes

This PR includes the following:

  1. Moving pattern resolution logic to a separate component - CatalogConfigResolver
  2. Updating DataCatalog to use CatalogConfigResolver internally
  3. Some refactoring to make kedro run command and kedro catalog commands work after the catalog updates
  4. DataCatalog interface remains the same, so no breaking changes introduced

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

  • Read the contributing guidelines
  • Signed off each commit with a Developer Certificate of Origin (DCO)
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes
  • Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
@ElenaKhaustova ElenaKhaustova marked this pull request as ready for review September 9, 2024 15:53
@ElenaKhaustova ElenaKhaustova self-assigned this Sep 10, 2024
kedro/framework/cli/catalog.py Outdated Show resolved Hide resolved
kedro/framework/cli/catalog.py Outdated Show resolved Hide resolved
kedro/framework/cli/catalog.py Outdated Show resolved Hide resolved
for ds_name in datasets:
is_param = ds_name.startswith("params:") or ds_name == "parameters"
if ds_name in explicit_datasets or is_param:
for ds_name in pipeline_datasets:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big fan of continue since it harms the flow of thinking while reading the code. Could we do this instead:

for ds_name in (pipeline_datasets - explicit_datasets - filter(is_parameter, pipeline_datasets)):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will have to do this since explicit_datasets is a dictionary:

for ds_name in (pipeline_datasets - set(explicit_datasets.keys()) - filter(is_parameter, pipeline_datasets)):

and for me, it looks way more complex to understand than

for ds_name in pipeline_datasets:
    if ds_name in explicit_datasets or is_parameter(ds_name):
        continue

It's also slightly worse from the complexity point of view though I agree both cost nothing.

Curious what other think.

kedro/framework/session/session.py Outdated Show resolved Hide resolved
kedro/io/catalog_config_resolver.py Outdated Show resolved Hide resolved
kedro/io/catalog_config_resolver.py Outdated Show resolved Hide resolved
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really good @ElenaKhaustova ! I love the separation of dataset factoreis resolution logic from the catalog ⭐

I left some comments and questions, mostly around naming/positioning.

kedro/io/data_catalog.py Show resolved Hide resolved
kedro/io/data_catalog.py Outdated Show resolved Hide resolved
kedro/io/data_catalog.py Outdated Show resolved Hide resolved
kedro/io/data_catalog.py Outdated Show resolved Hide resolved
kedro/io/data_catalog.py Show resolved Hide resolved
tests/io/test_data_catalog.py Show resolved Hide resolved
kedro/io/catalog_config_resolver.py Outdated Show resolved Hide resolved
kedro/io/catalog_config_resolver.py Outdated Show resolved Hide resolved
kedro/io/catalog_config_resolver.py Outdated Show resolved Hide resolved
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my comments @ElenaKhaustova, I don't have any blocking remarks anymore. Don't forget to add this change to the release notes!

kedro/io/data_catalog.py Show resolved Hide resolved
kedro/io/data_catalog.py Show resolved Hide resolved
tests/io/test_data_catalog.py Show resolved Hide resolved
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Copy link
Contributor

@ankatiyar ankatiyar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @ElenaKhaustova ⭐ 👏🏾 💯
Left a small comment but looks good!

kedro/io/catalog_config_resolver.py Outdated Show resolved Hide resolved
Copy link
Member

@astrojuanlu astrojuanlu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that there are no changes in DataCatalog.from_config, only in DataCatalog.__init__ (rarely used in interactive contexts) and so my understanding is that this has zero user impact 👍🏼

self._dataset_patterns = dataset_patterns or {}
self._config_resolver = config_resolver or CatalogConfigResolver()

# Kept to avoid breaking changes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@ElenaKhaustova ElenaKhaustova merged commit 7e02653 into main Sep 12, 2024
41 checks passed
@ElenaKhaustova ElenaKhaustova deleted the 4110-move-pattern-resolution-logic branch September 12, 2024 10:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants