Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update starters to use new /conf structure #2753

Closed
Tracked by #2752
amandakys opened this issue Jul 3, 2023 · 6 comments · Fixed by kedro-org/kedro-starters#141
Closed
Tracked by #2752

Update starters to use new /conf structure #2753

amandakys opened this issue Jul 3, 2023 · 6 comments · Fixed by kedro-org/kedro-starters#141
Assignees
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@amandakys
Copy link

amandakys commented Jul 3, 2023

Description

We want to update the new /conf folder structure to look like

├── conf
    └── base
	├── catalog_{pipeline-name}.yml          #blank files
	└── parameters_{pipeline-name}.yml       #blank files
        ├── catalog.yml          #blank files   
	└── parameters.yml       #blank files

with separate catalog and parameters files for each pipeline in the project

rather than

├── conf
    └── base
        └── parameters.yml          
	    ├── {pipeline-name}.yml     #blank files
        ├── catalog.yml              #blank files   
	└── parameters.yml           #blank files

This change needs to be made across the starters

This is a follow up on the work to update the kedro pipeline create command #2752

@amandakys amandakys added the Issue: Feature Request New feature or improvement to existing feature label Jul 3, 2023
@deepyaman
Copy link
Member

What goes in catalog_{pipeline-name}.yml? By default, how would a user know whether to put it in one or the other (especially for datasets at pipeline boundaries)? Should I define all of the datasets for a pipeline in the respective catalog_{pipeline-name}.yml? Does where I define the dataset have any effect for things like micro packaging?

Feel like it's these are "hard" questions to know what to do, especially as a newer user. Apologies if this was already discussed/clear.

@astrojuanlu
Copy link
Member

The scope of this is replacing

├── conf
    └── base
        └── pipeline_name
            └── catalog.yml

with

├── conf
    └── base
	├── catalog_{pipeline-name}.yml

so, no new complexity or indirection is added, we just make the file hierarchy more flat.

You raise a good point but that's not what we are intending to solve here, I'd say. @amandakys correct me if I'm wrong.

@amandakys
Copy link
Author

amandakys commented Jul 12, 2023

@astrojuanlu to clarify in the current state the kedro pipeline create command does not create a new catalog file for the pipeline, it only creates a parameters.yml file so the existing state would look like

├── conf
    └── base
        └── parameters
            └── {pipeline-name}.yml

In the structure proposed in this ticket, it is expected for all non pipeline specific parameters and catalog entries to go in the non pipeline specific catalog.yml and parameters.yml files. We could also choose to name them catalog_global.yml or something similar to better indicate their function.

├── conf
    └── base
        └── catalog.yml
        └── parameters.yml
        └── catalog_{pipeline-name}.yml
        └── parameters_{pipeline-name}.yml

To your point about complexity @deepyaman - I don't think this introduced any ambiguity that doesn't already exist as users are currently free to define the scope of their catalog files. The behaviour of kedro pipeline create has previously already introduced a convention regarding the scope of config files to be pipeline specific. For parameters, the separation between global parameters and pipeline specific parameters already exists.

For new users in particular, the source of confusion being that they would be unsure as to where to put catalog entries is potentially minimised as it introduces the concept of pipeline-specific separation. The extant default would be to place all their catalog entries into base/catalog.yml until such time that this becomes untenable, and this would be linked to other work about managing the excessive length of catalog files #2423

Alternatively we could continue with the trend of not providing platform specific catalog files upon creating a new pipeline. In which case we would only provide the parameters_{pipeline-name}.yml but not the catalog_{pipeline-name}.yml and leave that for users to add if they want (which is what they currently have to do)

@astrojuanlu
Copy link
Member

Quick comment on one of the points:

We could also choose to name them catalog_global.yml or something similar to better indicate their function.

I'd be -1 on that, since it can create ambiguity with a pipeline called global.

@astrojuanlu
Copy link
Member

I'm going to vote on continuity over consistency here. I agree splitting catalog files puts them at the same level as parameters, which increases consistency and informs users that they can do such a thing. But I also agree with @deepyaman and @yetudada that it might create confusion for arguably the most important file of the whole project. On top of that, we were not creating separate catalog files before, so by continuing to use a "global" one by default, we keep the same behavior.

@DimedS DimedS self-assigned this Jul 31, 2023
@DimedS
Copy link
Contributor

DimedS commented Aug 2, 2023

I discovered that currently, we have only one starter (spaceflights) that contains a 'parameters' folder with two pipeline_name.yml files inside. There are no starters with a catalog folder. So, it seems that we only need to move and rename the two spaceflights YAML files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants