Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for customization of pipeline templates #2701

Merged
merged 71 commits into from
Sep 12, 2023
Merged

Allow for customization of pipeline templates #2701

merged 71 commits into from
Sep 12, 2023

Conversation

jasonmhite
Copy link
Contributor

@jasonmhite jasonmhite commented Jun 19, 2023

Implement ability to override the cookiecutter template used to generate pipelines so that the user can customize them. Controlled via settings.PIPELINE_TEMPLATE_PATH.

Description

The current implementation provides a way to make custom starter templates, however the template used to generate new pipelines is hardcoded. My team would like to be able to customize the way pipelines are generated to a format that is more amenable to our workflow. In particular, we would also like to be able to include this as part of the starter project template, so that our team members can have the new pipeline template automatically when they use our starter. Per the discussion in #2543 this is currently not possible any other way than modifying Kedro because the pipeline template path is hardcoded, hence this PR.

Development notes

This only requires a very small change to _create_pipeline in kedro/framework/cli/pipeline.py. When populating the template_path variable, my patch first checks settings.py to see if variable PIPELINE_TEMPLATE_PATH has been set. If it has been, it uses that value as the path for the pipeline cookiecutter template. If it is not set, it falls back to the old default of Path(kedro.__file__).parent / "templates" / "pipeline", thus the change should be completely backwards-compatible.

I'm marking this PR as a work in progress because I'm not sure if there is a better way to do the check in settings.py. I'd also like some feedback on what the preferred way to document and test this feature.

Actually implementing a working template is up to the user. This repo demonstrates what I want, and I have tested that this template works perfectly. I have also tested with overriding the pipeline template in an existing project. I have further tested that the patch doesn't break any of my existing projects that don't attempt to override the pipeline template.

It's a bit out of scope for this PR since it's more how you would implement the actual starter template, but as a note you need to make sure cookiecutter doesn't attempt to render the embedded pipeline template in your starter. You can see how I did this in cookiecutter.json, but the tl;dr is you add the template path to the _copy_without_render array like so:

{
    "project_name": "New Kedro Project",
    "repo_name": "{{ cookiecutter.project_name.strip().replace(' ', '-').replace('_', '-').lower() }}",
    "python_package": "{{ cookiecutter.project_name.strip().replace(' ', '_').replace('-', '_').lower() }}",
    "kedro_version": "{{ cookiecutter.kedro_version }}",
    "_copy_without_render": [
        ".templates"
    ]
}

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes

@jasonmhite
Copy link
Contributor Author

@idanov Any thoughts?

@astrojuanlu
Copy link
Member

Hi @jasonmhite, sorry we thought this was still in progress because it's marked as draft. Is it ready for discussion?

@jasonmhite
Copy link
Contributor Author

@astrojuanlu Yeah, I was hoping someone could take a look before I marked it as ready to review. Specifically I wanted to resolve three things first:

  1. Are you ok with the way I did an attribute lookup on settings.py or is there a more preferred way?
  2. What's the best place to document this feature?
  3. What do you wanna see in the way of tests?

@datajoely
Copy link
Contributor

Just want to say, users have been asking for this for a long time and this is a neat solution @jasonmhite

@noklam noklam self-requested a review June 23, 2023 09:48
@noklam
Copy link
Contributor

noklam commented Jun 23, 2023

Thank you @jasonmhite for the great start!

  • Are you ok with the way I did an attribute lookup on settings.py or is there a more preferred way?
    I find that more intuitive to have it as an extra arguments in the CLI. It will be a bit magical if the CLI reading config from projects, I don't think we did this to any CLI commands.
    Also, do you end up in a project specific pipeline template or usually shared across multiple project? As I expect this to be closer to a starter, so using project's settings seems a bit weird to me.
  • What's the best place to document this feature?

I think it's best to update it in CLI and piplines doc. i.e. create_a_pipeline.md, modular_pipeline.md. I don't know what's the best way to showcase this feature, as this is more an advance and opinionated feature, each company/team may have their own convention. Cc @stichbury for idea.

In addition, we need to document WHY one may do this and what it is useful for. We also need to make it clear that user cannot change the parameters file arbitrarily (It could break ConfigLoader), I think this is mostly about the pipeline.py right?

  • What do you wanna see in the way of tests?
    If we are going the CLI way, you should be able to find existing tests. The new option need to be tested, at least for a success and failing case. Where is this "template" gonna sit in? I think it's best to support these options.
  • Local / path
  • Git

@jasonmhite
Copy link
Contributor Author

jasonmhite commented Jun 23, 2023

@noklam So the goal for me is that we want to distribute the pipeline template as part of a starter among collaborators. So my team can share a common starter and then "automatically" they get the pipeline template we prefer. This includes a slightly different file layout for the pipeline code and also adding some standard imports of internal tools we use, but there are lots of other possible uses. I put it in settings.py since that is easy to distribute with a starter and in my mind that really is a project specific setting; I might mix-and-match different templates across projects depending on what I'm doing (but never within) and this is the simplest way I could think of that covers all our uses without breaking stuff. The new pipeline template is embedded in the demo starter I linked in the original issue, though I guess there's no reason why it has to be within a starter and the way I implemented things is pretty flexible.

I'm kind of averse to making it an extra flag in the CLI because if I understand what you're suggesting, every time we made a new pipeline users would have to remember to kedro pipeline create --pipeline-template=.templates or something like that, which is kind of tedious. It also seems to be a lot more complicated to implement. I do think catching an AttributeError reading from settings.py and falling back to the default is kind of clunky but it seemed to be the simplest option that doesn't break anything.

I don't really envision that this is a feature a normal user would use directly, in my mind it's just a place that someone writing custom starter can use if they also want to override the way pipelines are generated. It seems like a mistake to hardcode the pipeline template when starter templates are possible, but I also doubt it's something users will often want to manually override.

Regarding docs/testing, good point on the parameters files being a bit sensitive. I do see this as a "developer" feature akin to writing plugins and not something a normal user would do routinely, which is why I was curious where you think it's best to document. But it's also a bit of a "you'd better know what you're doing if you want to write a custom pipeline template" situation, so there are likely to be some gotchas.

@antonymilne
Copy link
Contributor

Thank you very much for opening a PR for discussion @jasonmhite and for all the careful thought you've obviously put into this! I didn't even think of the obvious problem about needing to use _copy_without_render, so amazing work realising that and figuring out the solution also 💯

My initial reaction here is similar to @noklam, that I was expecting tempate_path to be a CLI flag on kedro pipeline create rather than a variable in settings.py. I do see why you've done it this way, and I could be persuaded that it's a good approach, but my intuition is that a CLI flag would be better here. Would be very interested in @idanov's thoughts on whether this is a good use of settings.py or not.

Let me propose something that would satisfy both our preference for a CLI flag and your requirement for not needing to manually specify template_path every time. How about template_path has a default value that will look in <your_project_root>/templates/pipeline? This way you would still be able to ship your pipeline template with a starter (without editing settings.py) and just do kedro pipeline create to use it, just like you want to do. You could also create pipelines from multiple templates in the same project this way by doing kedro pipeline create --template_path=somewhere/else/pipeline. I get that this might not be a requirement for you but I think might be a legitimate use case for other people, and the settings.py approach does not enable it.

@jasonmhite
Copy link
Contributor Author

@antonymilne Actually I'm perfectly fine with defaulting to checking for a static templates path within the project directory, it covers my needs. I assume the flow you're imagining is to first check for the existence of something like <project_root>/templates/pipeline and if the folder doesn't exist fall back to the previous default global path? I certainly also see why people might want to use multiple templates within a project, so I definitely see the use case for having the CLI flag in addition to the ability to override the default.

I'll update my implementation to do the path checks as described above. I'll also take a look at the CLI code to see how much trouble it is to implement; I'm not super familiar with the Kedro codebase yet so I may need help on that depending on how complicated it is.

@jasonmhite
Copy link
Contributor Author

jasonmhite commented Jun 27, 2023

Annnnd... done. Added a check for a templates directory and a CLI flag. The CLI flag takes precedence, then the templates dir, then fallback to a global default. If that looks like an acceptable solution I can work on documentation/tests.

One concern is this doesn't do much validation to check if the target folder is a valid template, but I'm of the opinion of that it's the user's responsibility. I/we can spell out the limitations in the docs.

@antonymilne

@datajoely
Copy link
Contributor

Separately to this awesome contribution @jasonmhite, it would be great to learn what you (and others) are customising in your pipeline templates :)

@jasonmhite
Copy link
Contributor Author

@datajoely The example project starter+pipeline template I linked in the original comment shows the main stuff. Basically just address some minor nuisances.

  1. We like to combine pipelines and node definitions into one file, at least at initially. Perhaps if this file gets too long we can break it out separately later, but it's nice to start with the simpler layout. This also bypasses having to import new nodes in the pipelines files.
  2. The combined file from 2 we prefer to name pipeline_{blah name}.py. Having 3 or 4 different files named nodes.py or pipeline.py open in the editor is kinda confusing and appending the pipeline name makes it easier to navigate tabs in Vim/PyCharm/etc.
  3. Add some prebaked includes we commonly use.

@antonymilne
Copy link
Contributor

This looks great, thank you very much @jasonmhite! Just what I was imagining. I'm fine with it not doing the validation check to see if it's a valid template, just like you have it now.

There's just one thing I'm not sure about: should we be restricting template_path to a local folder like you do now or do we allow for the full generality of what cookiecutter allows for template_path, which could also be e.g. a link to a github repository.

Making it more general is an easy change:

  • template_path argument just becomes a string rather than a Path
  • we no longer check to see if the directory exists
  • the stuff in _create_pipeline remains as you have it now, since cookiecutter will work with URLs etc. already

Pros of making this change:

  • it's more powerful/general for no real extra work
  • it's more consistent with how we handle starter templates, which also allow this generality

Cons:

  • it's not really necessary since this is a new feature and we don't know if anyone would want to use the more general functionality. So no one will mind if we don't make it general to begin with
  • we'll never achieve the same level of generality as starter templates without also adding checkout and directory arguments, which I don't want to do now since it will be too much additional complexity I think

So overall I don't really mind if we make it general or just keep it as you have it here. wdyt @noklam @astrojuanlu?

@noklam
Copy link
Contributor

noklam commented Jul 2, 2023

I think it would be a nice to have to make it general. I would prefer we can extend it without any breaking change if we are not going to do this now.

@jasonmhite
Copy link
Contributor Author

jasonmhite commented Jul 5, 2023

Sorry for the delay, juggling lots of different things.

Regarding support for cookiecutter templates it doesn't matter to me but I can make that change. I will note that this kinda goes back to the putting a config option settings.py issue vs just checking for a static local directory. If the template dir was a project level setting that would also make it trivial to support pointing to a VCS template as the "default" pipeline for a project, which would be consistent with the way that it would accept a VCS URL on the command line flag. In the current implementation the only option for an embedded default pipeline is a static local folder and I don't see a natural way to let it point to a VCS URL without bumping it up to a setting of some sort.

@noklam noklam marked this pull request as ready for review July 6, 2023 13:33
@noklam noklam requested a review from merelcht as a code owner July 6, 2023 13:33
noklam
noklam previously approved these changes Jul 6, 2023
Copy link
Contributor

@noklam noklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy about the PR as it. We can always add more support to it if it turns out to be common interest.

I won't worry too much about the Path issue for now since this is expect to be a CLI arg, even if we later change the type it shouldn't be considered as a breaking change.

@noklam noklam requested a review from astrojuanlu July 6, 2023 13:35
@noklam
Copy link
Contributor

noklam commented Jul 6, 2023

DCO & some linting issue failing

Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for this contribution @jasonmhite! ⭐ I've read the full conversation and I'm happy to proceed with the implementation like this. We can always make it more general if there's a need for it from users. I'm also happy with not adding validation, but that should be mentioned clearly in the docs so users know what a "valid" template requires.

In terms of tests, it would be good to add some unit tests into tests/framework/cli/pipeline/test_pipeline.py to verify the expected behaviour.

For docs, I would add a description about this behaviour to https://docs.kedro.org/en/stable/kedro_project_setup/starters.html#how-to-create-a-kedro-starter and perhaps also to https://docs.kedro.org/en/stable/nodes_and_pipelines/modular_pipelines.html @stichbury what are your thoughts on this?

@merelcht merelcht linked an issue Jul 7, 2023 that may be closed by this pull request
@jasonmhite
Copy link
Contributor Author

So I've been looking at writing some unit tests. The test framework you guys have set up is fairly intricate, could I get someone more familiar with the harness to help me with writing them?

@noklam
Copy link
Contributor

noklam commented Jul 10, 2023

Sure, maybe start with creating the test without the implementation to see what tests u want to create?

@ankatiyar ankatiyar added the Community Issue/PR opened by the open-source community label Jul 13, 2023
@jasonmhite
Copy link
Contributor Author

Just an update: still working on this, haven't had much time the past couple weeks.

@noklam
Copy link
Contributor

noklam commented Jul 27, 2023

Thanks for the update, let us know if you need help if get stuck!

@noklam
Copy link
Contributor

noklam commented Aug 10, 2023

@jasonmhite just checking is this ready to be review again?

@jasonmhite
Copy link
Contributor Author

@noklam Yes sorry, I have to find time for this on the side. I wrote what I consider sufficient unit tests, but the way I mocked up injecting the template files during the tests is pretty ugly. I was hoping to maybe clean that up a bit, but actually thinking about it maybe it'd be better if I got some help on that from you. Mocking files is not something I've done much of in pytest.

jmnunezd and others added 24 commits September 11, 2023 10:58
* Update on credentials.md

Updating example code in credentials.md to make it usable when copy/pasting. Since as of now it will generate a TypeError

Signed-off-by: Jose <[email protected]>

* adding a more explicit code example, now we directly show the user that project_path is a pathlib.Path object

Signed-off-by: Jose <[email protected]>

* Update docs/source/configuration/credentials.md

Co-authored-by: Jo Stichbury <[email protected]>

---------

Signed-off-by: Jose <[email protected]>
Co-authored-by: Jo Stichbury <[email protected]>
Co-authored-by: Nok Lam Chan <[email protected]>
Signed-off-by: Jason Hite <[email protected]>
Signed-off-by: Dmitry Sorokin <[email protected]>
Signed-off-by: Jason Hite <[email protected]>
* Change starters to use OmegaConfigLoader

Signed-off-by: lrcouto <[email protected]>

* Fix linter error

Signed-off-by: lrcouto <[email protected]>

* Suppress import outside toplevel linting for starters template

Signed-off-by: L. R. Couto <[email protected]>

* Fix linter error

Signed-off-by: lrcouto <[email protected]>

* Add changes to release notes

Signed-off-by: lrcouto <[email protected]>

---------

Signed-off-by: lrcouto <[email protected]>
Signed-off-by: L. R. Couto <[email protected]>
Signed-off-by: Jason Hite <[email protected]>
… `ipynb_checkpoints` (#2977)

* Check plugins implement valid hooks

Signed-off-by: Nok Chan <[email protected]>

* Add release note

Signed-off-by: Nok Chan <[email protected]>

* Staging work - add custom functions to check hidden folder and files. Tests still failing

Signed-off-by: Nok Chan <[email protected]>

* Fix test - checkpoints should use the same environment

Signed-off-by: Nok Chan <[email protected]>

* Revert "Check plugins implement valid hooks"

This reverts commit f10bede.

* Update RELEASE.md

Co-authored-by: Ankita Katiyar <[email protected]>

* fix lint

Signed-off-by: Nok <[email protected]>

---------

Signed-off-by: Nok Chan <[email protected]>
Signed-off-by: Nok <[email protected]>
Co-authored-by: Ankita Katiyar <[email protected]>
Signed-off-by: Jason Hite <[email protected]>
…upport defaults `0` or `None` (#2976)

* Add None support to globals

Signed-off-by: Ankita Katiyar <[email protected]>

* Add warning when default value is used

Signed-off-by: Ankita Katiyar <[email protected]>

* Check keys

Signed-off-by: Ankita Katiyar <[email protected]>

* Nok's suggestions

Signed-off-by: Ankita Katiyar <[email protected]>

* Create the test to check the non-existing keys

Signed-off-by: Nok <[email protected]>

* add more tests to catch case when global key is not a dict

Signed-off-by: Nok <[email protected]>

* Fix the null test

Signed-off-by: Nok <[email protected]>

* Introduce sentinel value _NO_VALUE

Signed-off-by: Nok <[email protected]>

* rename test

Signed-off-by: Nok <[email protected]>

* Improve error mesasge and raise InterpolationResolutionError when key does not exist and no default

Signed-off-by: Nok <[email protected]>

* Fix non exist default test

Signed-off-by: Nok <[email protected]>

* Fix test

Signed-off-by: Nok <[email protected]>

* Use omegaconf to replace the custom resolving logic

Signed-off-by: Nok <[email protected]>

* uncommented some tests

Signed-off-by: Nok <[email protected]>

* Remove dead code

Signed-off-by: Ankita Katiyar <[email protected]>

* Update error message

Signed-off-by: Ankita Katiyar <[email protected]>

---------

Signed-off-by: Ankita Katiyar <[email protected]>
Signed-off-by: Nok <[email protected]>
Co-authored-by: Nok <[email protected]>
Signed-off-by: Jason Hite <[email protected]>
Signed-off-by: Jason Hite <[email protected]>
* Fix docstrings on kedro/extras/datasets

Signed-off-by: lrcouto <[email protected]>

* Fix formatting error

Signed-off-by: lrcouto <[email protected]>

* Explicitly define code block

Signed-off-by: lrcouto <[email protected]>

* Fix empty line under code block

Signed-off-by: lrcouto <[email protected]>

* Fix formatting error

Signed-off-by: lrcouto <[email protected]>

* Fix broken link

Signed-off-by: lrcouto <[email protected]>

* Bump kedro-datasets version

Signed-off-by: Ankita Katiyar <[email protected]>

* Update links in partitioned and incremental datasets

Signed-off-by: Ankita Katiyar <[email protected]>

* Update links in partitioned and incremental datasets

Signed-off-by: Ankita Katiyar <[email protected]>

* Update links in partitioned dataset

Signed-off-by: Ankita Katiyar <[email protected]>

* Update links in partitioned dataset

Signed-off-by: Ankita Katiyar <[email protected]>

* Add polars.GenericDataSet to .rst

Signed-off-by: Ankita Katiyar <[email protected]>

---------

Signed-off-by: lrcouto <[email protected]>
Signed-off-by: Ankita Katiyar <[email protected]>
Co-authored-by: Ankita Katiyar <[email protected]>
Signed-off-by: Jason Hite <[email protected]>
#2966)

* Minor changes to create a PR and test Vale styles

Signed-off-by: Jo Stichbury <[email protected]>

* fix some vale warnings

Signed-off-by: Jo Stichbury <[email protected]>

---------

Signed-off-by: Jo Stichbury <[email protected]>
Co-authored-by: Juan Luis Cano Rodríguez <[email protected]>
Signed-off-by: Jason Hite <[email protected]>
* Resynced

Signed-off-by: rxm7706 <[email protected]>

* restore lower bound for pluggy

Signed-off-by: rxm7706 <[email protected]>

---------

Signed-off-by: rxm7706 <[email protected]>
Co-authored-by: Juan Luis Cano Rodríguez <[email protected]>
Signed-off-by: Jason Hite <[email protected]>
Signed-off-by: rxm7706 <[email protected]>
Signed-off-by: Jason Hite <[email protected]>
* Check plugins implement valid hooks

Signed-off-by: Nok Chan <[email protected]>

* add Metadtahook

Signed-off-by: Nok Chan <[email protected]>

* Fix docs according to comments

Signed-off-by: Nok <[email protected]>

---------

Signed-off-by: Nok Chan <[email protected]>
Signed-off-by: Nok <[email protected]>
Co-authored-by: Jo Stichbury <[email protected]>
Signed-off-by: Jason Hite <[email protected]>
Signed-off-by: Jason Hite <[email protected]>
Signed-off-by: Ankita Katiyar <[email protected]>
Signed-off-by: Jason Hite <[email protected]>
… tests (#3010)

* Bump build version

Signed-off-by: Ankita Katiyar <[email protected]>

* Remove telemetry from test default starter

Signed-off-by: Ankita Katiyar <[email protected]>

* Add package_name back

Signed-off-by: Ankita Katiyar <[email protected]>

* Pin build only for 3.7

Signed-off-by: Ankita Katiyar <[email protected]>

* Try upgrade pip

Signed-off-by: Ankita Katiyar <[email protected]>

* Add constraint

Signed-off-by: Ankita Katiyar <[email protected]>

* Update for windows

Signed-off-by: Ankita Katiyar <[email protected]>

* Run build-reqs with backtracking resolver

Signed-off-by: Ankita Katiyar <[email protected]>

* Run build-reqs with backtracking resolver

Signed-off-by: Ankita Katiyar <[email protected]>

---------

Signed-off-by: Ankita Katiyar <[email protected]>
Signed-off-by: Jason Hite <[email protected]>
Signed-off-by: Ankita Katiyar <[email protected]>
Signed-off-by: Jason Hite <[email protected]>
Signed-off-by: Jo Stichbury <[email protected]>
Signed-off-by: Jason Hite <[email protected]>
Co-authored-by: Jo Stichbury <[email protected]>
Signed-off-by: Jason Hite <[email protected]>
@jasonmhite
Copy link
Contributor Author

@ankatiyar Should be fixed now.

Copy link
Contributor

@ankatiyar ankatiyar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this contribution @jasonmhite! ⭐

@ankatiyar ankatiyar merged commit d3171f1 into kedro-org:main Sep 12, 2023
61 of 70 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community Issue/PR opened by the open-source community
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Override new pipeline template