Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add kedro catalog factory list CLI command #2796

Conversation

AhdraMeraliQB
Copy link
Contributor

@AhdraMeraliQB AhdraMeraliQB commented Jul 14, 2023

Ignore the target branch for now, #2793 to be merged first.

Description

This PR introduces the command `kedro catalog factory list <edited> which lists out all dataset factories in the catalog config in order of how they are matched (i.e. if an explicitly named dataset could match several of the factories in the catalog, it would be resolved with the first on the list.)

The priority is determined as such:

  1. Decreasing specificity (number of characters outside the curly brackets)
  2. Decreasing number of placeholders (number of curly bracket pairs)
  3. Alphabetically

Question for reviewers

An alternative suggestion for the name of the command was kedro catalog patterns. I went with factories to attempt to streamline the terminology used in the docs and other communications, but am otherwise impartial. Any thoughts on this?

Added note on streamlining the terminology

It looks like dataset factory is equiv to factory pattern, catalog factory, dataset pattern, dataset factory pattern and others that have been used interchangeably. As per #2670 any mentions of a catalog entry that makes use of placeholders will be referred to as a dataset factory. Catalog factories refer to all dataset factories within a specific catalog.

What this means in practice / Why this is useful

As seen in the test case, a catalog is created with the following patterns:

  1. an_example_{place}_{holder},
  2. an_example_{placeholder},
  3. an_{example_placeholder},
  4. on_{example_placeholder},

The explicitly declared an_example_data_set could match any of patterns 1, 2, or 3: an_example_{data}_{set}, an_example_{data_set}, an_{example_data_set}. Priority matching means that it will be resolved with the first pattern.
kedro catalog factories allows users to confirm the order in which factory patterns are considered for matching.

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes

Signed-off-by: Ahdra Merali <[email protected]>
@AhdraMeraliQB AhdraMeraliQB linked an issue Jul 14, 2023 that may be closed by this pull request
Ahdra Merali added 5 commits July 14, 2023 11:14
Signed-off-by: Ahdra Merali <[email protected]>
Signed-off-by: Ahdra Merali <[email protected]>
Signed-off-by: Ahdra Merali <[email protected]>
@AhdraMeraliQB AhdraMeraliQB marked this pull request as ready for review July 14, 2023 10:44
Signed-off-by: Ahdra Merali <[email protected]>
@ankatiyar
Copy link
Contributor

Implementation wise, manually tested and it works as expected! 💯
Suggestion on the name of the commands -
Could we do something like kedro catalog factories --list and kedro catalog factories --resolve for #2791? IMO kedro catalog factories does not explain what it does enough.
We could even do kedro catalog list --factories for this. Would like to hear what everyone else thinks?

@merelcht
Copy link
Member

Implementation wise, manually tested and it works as expected! 💯 Suggestion on the name of the commands - Could we do something like kedro catalog factories --list and kedro catalog factories --resolve for #2791? IMO kedro catalog factories does not explain what it does enough. We could even do kedro catalog list --factories for this. Would like to hear what everyone else thinks?

I think this is an interesting point. I tend to agree that kedro catalog factories isn't very self explanatory. It would be good to get @amandakys's designer point of view. We need to ensure this is consistent with the other CLI commands, just like the exercise we did for kedro run and all the flags there.

@merelcht
Copy link
Member

This new command should also be added to the docs with a short explanation.

@amandakys
Copy link

amandakys commented Jul 14, 2023

I had a look at the existing 'list' style commands and we have:

kedro catalog list - list datasets per pipeline per type

  • also has the optional argument --pipelines that allows you to further narrow down the list of datasets returned by pipeline name.

kedro registry list - list all registered pipelines in your project
we also have kedro registry describe which outputs all the nodes in a specified pipeline


As a note here - it could be more intuitive to reduce the number of keywords by changing kedro registry describe to be kedro registry list --pipelines=<pipeline-name> or kedro register list --nodes --pipelines=<pipeline-name>. There is potential for discussion here on our use of arguments, when to use as a input vs as an argument to a --pipeline flag. If there was interest in this work i'd be happy to do an audit


The convention appears to be putting the verb at the end of the command with additional parameters being used to narrow down the output. So initial thoughts would be that kedro catalog factory list is the most consistent.

However, it does depend on whether there are future plans for providing different types of outputs to allow users to better understand their catalog i.e. returning a full 'expanded' catalog, or providing a dataset name and returning which pattern it matched. In this case we could consider the addition of another key word like resolve or adding arguments to the list command like --datasets

@astrojuanlu
Copy link
Member

astrojuanlu commented Jul 14, 2023

quick xref #2402 and #2723

(This is only tangentially related to the current discussion, but adding them since I saw a mention of kedro registry list)

@merelcht
Copy link
Member

Thanks for the detailed analysis @amandakys ! I think reading this and keeping in mind that we have #2791 coming up I would suggest going for kedro catalog factory list and then the command for #2791 can be kedro catalog factory resolve.

Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! I left some small comments. When those are resolved + the name of the command this will be pretty much ready for merging 🙂

kedro/framework/cli/catalog.py Outdated Show resolved Hide resolved
tests/framework/cli/test_catalog.py Show resolved Hide resolved
Ahdra Merali added 2 commits July 19, 2023 09:49
Signed-off-by: Ahdra Merali <[email protected]>
Signed-off-by: Ahdra Merali <[email protected]>
@AhdraMeraliQB AhdraMeraliQB requested review from noklam and SajidAlamQB and removed request for ankatiyar July 19, 2023 09:04
@AhdraMeraliQB AhdraMeraliQB changed the title Add kedro catalog factories CLI command Add kedro catalog factory list CLI command Jul 19, 2023
@AhdraMeraliQB AhdraMeraliQB marked this pull request as draft July 19, 2023 09:16
@AhdraMeraliQB
Copy link
Contributor Author

AhdraMeraliQB commented Jul 19, 2023

Fixed!
Coverted back to draft as currently only works with kedro catalog "factory list" and not kedro catalog factory list

Ahdra Merali added 2 commits July 19, 2023 10:20
Copy link
Contributor

@SajidAlamQB SajidAlamQB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the implementation looks good and concise the test are also well-structured and verifies the core functionality. I've just left some minor comments and nitpicks. I think it will be ready to merge once the the other reviewer comments are also addressed. Happy to approve once thats done. 😄

kedro/framework/cli/catalog.py Outdated Show resolved Hide resolved
kedro/framework/cli/catalog.py Outdated Show resolved Hide resolved
tests/framework/cli/test_catalog.py Show resolved Hide resolved
Ahdra Merali added 5 commits July 20, 2023 10:49
Signed-off-by: Ahdra Merali <[email protected]>
Signed-off-by: Ahdra Merali <[email protected]>
Signed-off-by: Ahdra Merali <[email protected]>
@AhdraMeraliQB
Copy link
Contributor Author

AhdraMeraliQB commented Jul 20, 2023

Still in draft mode as I've got another question for reviewers:

I've refactored test_get_cli_structure_help with two approaches, iterative and recursive. I'm not particularly partial to either method; whilst iteration is faster and more memory-efficient, we are not handling any considerable amounts of data with this test.

Does anyone have any opinions on this?

@AhdraMeraliQB AhdraMeraliQB marked this pull request as ready for review July 20, 2023 17:20
@AhdraMeraliQB AhdraMeraliQB marked this pull request as draft July 20, 2023 17:20
@AhdraMeraliQB
Copy link
Contributor Author

Also to note - the code that is tested by test_get_cli_structure_help (and the tests themselves) are duplicated in kedro-telemetry which brings two things:

  1. Any refactoring decided here should be applied there
  2. Why do we have duplicate code? Can it be removed from Kedro framework?

Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @AhdraMeraliQB ! ⭐ The recursive approach in the cli test is great 👍

tests/tools/test_cli.py Outdated Show resolved Hide resolved
Signed-off-by: Ahdra Merali <[email protected]>
@AhdraMeraliQB AhdraMeraliQB marked this pull request as ready for review July 21, 2023 14:01
Copy link
Contributor

@noklam noklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementations look all good, I have other opinion but I have seen most people agree with the current naming, so I don't want to block it because of this.

I just want to leave my comment. I am not a fan of kedro catalog factory list because it is hard to remember and I have never seen any cli that is 4 levels deep. While data factory is something we called, I really think this should be just an advance feature of catalog. Similarly we have kedro pipeline instead of kedro modular_pipeline.

In additional, having everything as one command is easier to discover and read in documentations compare to multiple commands. I feel like this is a bit similar to our discussion with pipeline structure (flat vs hierarchical)

@astrojuanlu
Copy link
Member

I agree with @noklam that kedro catalog factory list seemed unusually long for me as well. @amandakys point above about consistency is important though 🤔 @noklam do you have any alternatives in mind, taking into account the discussion above?

I agree we should try to limit bikeshedding

@noklam
Copy link
Contributor

noklam commented Jul 24, 2023

In my mind I prefer just kedro catalog list and introduce new arguments - - resolve for the addition functions

Pro:

  • consistent with the current command, less intrusive.

Cons:

  • May blurred things a bit but I don't think this is too important.

If we are going to change these CLI, it's a breaking change and we cannot add it later. Introducing a new command is not breaking but it is more intrusive we introduce a new one and later reactor it into other commands.

@AhdraMeraliQB
Copy link
Contributor Author

After discussion with @merelcht, @noklam, @amandakys, @ankatiyar, and @astrojuanlu we have decided to implement this functionality under the command kedro catalog rank, and keep #2791 as kedro catalog resolve. As this affects several choices made on this branch, I will be closing this in favour of a new, clean PR.

@AhdraMeraliQB AhdraMeraliQB deleted the feat/add-catalog-pattens-command branch September 8, 2023 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add kedro catalog rank command
7 participants