Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Template filepaths with command line arguments #602

Closed
satyakiroy1992 opened this issue Nov 8, 2020 · 7 comments
Closed

Template filepaths with command line arguments #602

satyakiroy1992 opened this issue Nov 8, 2020 · 7 comments
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@satyakiroy1992
Copy link

Description

Is your feature request related to a problem? A clear and concise description of what the problem is: "I'm always frustrated when ..."
Is there any way to Template configurations based on command line arguments?
I need a data pipeline that has dynamic folder path inside raw_01. E.g. If I am running the pipeline for store_1 then the data will be in raw_01/{store_01}/file.csv, and similarly for store_2
I am able to do this with the DataCatalog API but is there a way to do it with the catalog YML.

Context

Why is this change important to you? How would you use it? How can it benefit other users?
This is useful as when you have multiple models for different stores etc. and the data is organized in folders.

Possible Implementation

(Optional) Suggest an idea for implementing the addition or change.
If there were a way to set the globals_dict in TemplatedConfigLoader with the command line params then that could solve the problem I think.

Possible Alternatives

(Optional) Describe any alternative solutions or features you've considered.

@satyakiroy1992 satyakiroy1992 added the Issue: Feature Request New feature or improvement to existing feature label Nov 8, 2020
@mzjp2
Copy link
Contributor

mzjp2 commented Nov 8, 2020

I can get this working (on 0.16.6 at least) by doing the following:

# conf/base/catalog.yml
example_iris_data:
  type: pandas.CSVDataSet
  filepath: data/${folder_name}/iris.csv
# src/<project_name>/hooks.py
@hook_impl
def register_config_loader(self, conf_paths: Iterable[str]) -> ConfigLoader:
    click_ctx = click.get_current_context(silent=True)
    return TemplatedConfigLoader(conf_paths, globals_dict={
        "folder_name": click_ctx.params.get("params").get("folder_name")
    })

and:

$ kedro run --params folder_name:non-existent-folder-name-cause-error
<snip>
[Errno 2] No such file or directory: '[redacted]/data/non-existent-folder-name-cause-error/iris.csv'

you can also add your own CLI option to your kedro_cli.py instead of hijacking the existing param option (which feeds the parameter into Kedro's parameter namespace) and then modify the click_ctx.params.get("params") line to click_ctx.params.get("my_custom_option").

Is this roughly what you were looking for?

@mzjp2
Copy link
Contributor

mzjp2 commented Nov 8, 2020

Also, another option - with the flexibility of global_dict, you don't need to rely on CLI parameters and can also use environment variables (I find this useful when running in different deployments (dev/prod) for example on cloud infrastructure, where it's simply a matter of setting the environment variables):

# src/<project_name>/hooks.py
@hook_impl
def register_config_loader(self, conf_paths: Iterable[str]) -> ConfigLoader:
    return TemplatedConfigLoader(conf_paths, globals_dict={
        "folder_name": os.getenv("FOLDER_NAME")
    })

and having FOLDER_NAME='folder_name' set in my environment variables.

@WaylonWalker
Copy link
Contributor

Alternatively I often use steel_toes for things like this. It can append a branch name to the filepath. for instance data/iris_store_01.csv, data/iris_store_02.csv.

I like @mzjp2's solution better for your use case, but this is an alternative.

@satyakiroy1992
Copy link
Author

Thanks a lot @mzjp2 .
I got the first solution working.
I also understand the second solution. Any advantage of using os environment variable vs setting FOLDER_NAME in a global.yml with TemplatedConfigLoader and passing the env argument in cli ?

@WaylonWalker
Copy link
Contributor

Any advantage of using os environment variable vs setting FOLDER_NAME

In my opinion it depends on where you are running. For instance if you are deploying with many of the various docker options changing the yaml requires a new image, but most of the time there there is a way to set ENV_VARS for deployment.

@satyakiroy1992
Copy link
Author

Ah I see. Thank you.

@everdark
Copy link

Just want to point out that the solution in the thread using register_config_loader no longer works in 0.18.
Instead, it should be something like:

CONFIG_LOADER_ARGS = {
    "globals_pattern": "*globals.yml",
    "globals_dict": {
        # programmatically create your globals here
    },
}

in the settings.py to dynamically change the globals.
However, in settings.py I'm not sure how to get the click context if you want to put info from cli to the globals instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
None yet
Development

No branches or pull requests

4 participants