Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "globals" functionality to OmegaConfigLoader #2175

Closed
merelcht opened this issue Jan 5, 2023 · 21 comments
Closed

Add "globals" functionality to OmegaConfigLoader #2175

merelcht opened this issue Jan 5, 2023 · 21 comments
Labels
Component: Configuration Issue: Feature Request New feature or improvement to existing feature Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation

Comments

@merelcht
Copy link
Member

merelcht commented Jan 5, 2023

Description

In the new OmegaConfigLoader templating works out of the box for parameters, but not for catalog files. This is because the catalog has to have certain characteristics and is loaded through settings

project_path=self.project_path, conf_dictionary=conf_catalog
)
conf_creds = self._get_config_credentials()
catalog = settings.DATA_CATALOG_CLASS.from_config(
catalog=conf_catalog,
credentials=conf_creds,
load_versions=load_versions,
save_version=save_version,
)
feed_dict = self._get_feed_dict()
catalog.add_feed_dict(feed_dict)

This also means that currently users cannot share "global" values between parameter and catalog configuration files.

Context

Users struggle with large catalog files, so we should make it possible for them to do templating with this new config loader.

This issue only focusses on solving the need of having a central place to store certain values: e.g. file paths, bucket names, constant values.

  • Users need this to reduce the chance of error in catalog configuration files (e.g. having to type a file path 10 times instead of 1).
  • And to have a central location with values that are considered "global"

Questions

  • Do globals need to work across environments? Yes, this is essential functionality.

Possible Implementation

Require template values to be preceded by a special character, e.g. _ so they're not read as a dataset in the catalog validation.

You could then have a catalog like this:

_pandas:
  type: pandas.CSVDataSet

example_iris_data:
  type: ${_pandas.type}
  filepath: data/01_raw/iris.csv

Globals can be "activated" for each config file by adding "globals*" to the config patterns:

CONFIG_LOADER_ARGS = {
    "config_patterns": {
        "parameters": ["parameters*", "parameters*/**", "**/parameters*", "globals*"],
        "catalog": ["catalog*", "catalog*/**", "**/catalog*", "globals*"],
    }
}

⚠️ If we go for this solution and share globals among configuration, then all template values need to be preceded by a special character to skip catalog validation.

Possible Alternative

Add globals functionality similar to that of the TemplatedConfigLoader.

  • Do users make use of both the globals file and globals_dict?
@merelcht merelcht added the Issue: Feature Request New feature or improvement to existing feature label Jan 5, 2023
@merelcht merelcht added this to the Configuration overhaul milestone Jan 5, 2023
@merelcht merelcht added the Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation label Jan 16, 2023
@MatthiasRoels
Copy link

Continuing our discussion from #2122, I modified my take on the OmegaConfLoader class so that we can use globals. You can find the code on my GitHub here.

For me, there are still two open to-do's:

  1. Expand the logic from _get_merge_strategy so that we can allow different strategies for user-defined config_patterns.
  2. But this is optional, make sure users can tweak the _build_conf_paths method. In this way, we can allow for more complex hierarchical patterns (similar to what Hydra offers)

All feedback is welcome! More than happy to contribute this implementation too!

@noklam noklam changed the title Add templating functionality for catalog in OmegaConfLoader Add templating functionality for catalog in OmegaConfigLoader Mar 14, 2023
@merelcht merelcht changed the title Add templating functionality for catalog in OmegaConfigLoader Add templating functionality to allow "global" values in the catalog in OmegaConfigLoader Mar 15, 2023
@astrojuanlu
Copy link
Member

A user confused by this: conversation in Slack.

@noklam
Copy link
Contributor

noklam commented Mar 22, 2023

Adding another data point from Slack

@noklam
Copy link
Contributor

noklam commented Mar 28, 2023

Issues:

Currently we required a custom configloader to solve the issues. I wasn't sure why we never put the fix in 0.18.x, but I think it's worth to bring this issue up again.
Hacky-solution:

Bonus:
Do we have a better way to avoid "templating" SQL query, often we warn user about the risk of sql injection but maybe there are something we can do from the framework.

This is a question that asked frequently:
https://kedro-org.slack.com/archives/C03RKP2LW64/p1680018283721969
https://kedro-org.slack.com/archives/C03RKP2LW64/p1680517556262889

@noklam
Copy link
Contributor

noklam commented Mar 28, 2023

Question @merelcht
Does this global apply to catalog only? or do we expect it works across all config pattern?
One of the common pattern people do is kedro run --params=path:something, then this path could be used as parameters and catalog at the same time.

@merelcht
Copy link
Member Author

merelcht commented Apr 5, 2023

Summary of the Technical Design discussion on 5/4/2023:

General comments:

  • Allowing "global" values in one place is a desirable feature to have in the OmegaConfigLoader.
  • Users can use globals to have a centralised place for values used frequently in configuration files, reducing the change of spelling mistakes.
  • It also allows users to have a single catalog files and then globals for e.g. production/testing environments.
  • Whatever solution we come up with, it needs to consistent across all configuration files. So any syntax should be the same for parameters and catalog files.
  • The reason for the globals_dict to exist in TemplatedConfigLoader is because it allows you to pass python directly, it's Turing complete, which isn't the case for a globals.yml file.

Comments on the proposed solution above:

  • Do we really need this special character (e.g. _) to indicate it's a templated value? Is there perhaps another, more clever, way to skip catalog validation?
    • If this proves to be very difficult, we could decide to ship a first version with the special character and later on a more sophisticated where the character isn't required anymore and catalog validation is done differently.
    • How to distinguish between things that should be datasets but might have mistakes and things that aren't datasets but template values instead?
  • The pro of using the special character is that it would allow for a super easy implementation to introduce templating into the catalog with OmegaConfigLoader
  • The con of using the special character is that it introduces special Kedro syntax into the omegaconf features.
  • It's very important that globals can work across environments. The proposed solution prototype doesn't enable that, so more changes are required.

Follow up actions:

  • Investigate/prototype alternative to special character. Find another way to skip catalog validation.
  • How are users using the globals_dict and do we need similar functionality for OmegaConfigLoader?

@MatthiasRoels
Copy link

MatthiasRoels commented Apr 5, 2023

Trying to understand the summary of the discussion, but I think it is easier to come up with a generic "example" and describe its desired behaviour. Let's assume we have the following structure:

conf/
  - base/
    - globals.yaml
    - parameters.yaml
    - catalog.yaml
  - prod/
    - globals.yaml
    - parameters.yaml
    - catalog.yaml

When we do e.g. kedro run --env=prod, we want to

  • be able to combine base/globals.yaml and prod/globals.yaml so that we can use the combined globals values in both catalog and parameters, right?
  • when resolving the globals, we don't want the individual global values to show up in either the parameters or catalog dict, for example if we have:
-- globals.yaml
s3_base_uri: ""

we don't want s3_base_uri to show up in our catalog dict when returned by the OmegaConfLoader, so we should be smart about this (hence the proposed special char).

  • Last point we have to think about is the soft-merge functionality as described in Use soft-merge for certain configuration files for OmegaConfLoader #2122. With OmegaConf, the default is in fact to do soft-merge but that might not always be what we want. The question is, how to determine when this is desired and when it isn't? For parameters, this seems useful but it might lead to strange behaviour with e.g. catalog.

Anyway, I hope we can use this "example" to clarify behaviour.

@noklam
Copy link
Contributor

noklam commented Apr 5, 2023

when resolving the globals, we don't want the individual global values to show up in either the parameters or catalog dict, for example if we have:

This is a good point that we didn't discuss explicitly, but I think it's nice to achieve.

@MatthiasRoels
Copy link

This is a good point that we didn't discuss explicitly, but I think it's nice to achieve.

The simple solution: assuming there is always a globals key in front of interpolations e.g. ${globals.s3_base_uri}. If you then load globals as {“globals”: {actual content of globals}}, you can always pop this key before returning the final dict! As a bonus, this globals key serves as your special character (which also allows for easy searching?)

@noklam
Copy link
Contributor

noklam commented Apr 5, 2023

I would categorise it as a variant to the _ suffix

@MatthiasRoels
Copy link

Absolutely, but far easier to filter out!

@merelcht
Copy link
Member Author

The globals suffix is a nice idea, but I'm wondering if it will be readable enough in practice. In omegaconf the dot syntax is handled as a nested structure (https://omegaconf.readthedocs.io/en/2.3_branch/usage.html#config-node-interpolation) so suppose you'd have globals both in your catalog and your parameters values for both would all have to go under the globals key in globals.yml:

catalog.yml

example_iris_data:
  type: ${globals.data_type}
  filepath: data/01_raw/iris.csv
parameters.yml

train_fraction: 0.8
random_state: 2
target_column: ${globals.column}
globals.yml


globals:
  data_type: pandas.CSVDataSet
  column: species

@MatthiasRoels
Copy link

@merelcht, this isn’t exactly what I had in (although this might be cleaner). What I thought about doing was the following:

catalog.yml

example_iris_data:
  type: ${globals.data_type}
  filepath: data/01_raw/iris.csv
parameters.yml

train_fraction: 0.8
random_state: 2
target_column: ${globals.column}
globals.yml

data_type: pandas.CSVDataSet
column: species

And only during loading of globals.yaml, we create an object with a globals key:

globals_dict = {
    “globals”: {
        “data_type”: “pandas.CSVDataSet”,
        “column”: “species”,
    }
}

@merelcht
Copy link
Member Author

@merelcht, this isn’t exactly what I had in (although this might be cleaner). What I thought about doing was the following:

Ah okay I see, thanks for clarifying! That's indeed a nicer solution. I'll do some more thinking about this 😄

@merelcht merelcht changed the title Add templating functionality to allow "global" values in the catalog in OmegaConfigLoader Add "globals" functionality to OmegaConfigLoader Apr 12, 2023
@bgereke
Copy link

bgereke commented Apr 19, 2023

I'm probably off-base with these comments, but adding on the off chance that I'm not. I'm imagining three scenarios:

  1. a single globals.yml that is shared across environments, parameters, and catalog:

    conf/

    • globals.yaml
    • base/
      • parameters.yaml
      • catalog.yaml
    • prod/
      • parameters.yaml
      • catalog.yaml
  2. a separate globals.yml in each environment that is NOT shared across environments but IS shared across parameters and catalog within its own environment:

    conf/

    • base/
      • globals.yaml
      • parameters.yaml
      • catalog.yaml
    • prod/
      • globals.yaml
      • parameters.yaml
      • catalog.yaml
  3. similar to 2, but the globals.yml are resolved in some way across environments (e.g., if a global is missing from prod/globals.yml, it could be taken from base/globals.yml

My personal opinion based on prior experience w/ Kedro is that option 3 might overcomplicate things and result in unintended behavior when users don't fully understand what/how parameters are being shared/inherited across environments. I've run into an issue in the past with spark.yml when handing my kedro project off to an MLE to orchestrate with Airflow. In that case, we created a prod environment with the spark.yml deleted. When he attempted to set the spark configs on the cluster from Airflow, he was confused why none of his configs were taking effect. The reason turned out to be that his configs were being overwritten by the spark.yml in the base environment and we ended up modifying SparkHooks to prevent this override.

@noklam
Copy link
Contributor

noklam commented Apr 19, 2023

@bgereke Thanks for your comments. I have a couple follow up questions.

a separate globals.yml in each environment that is NOT shared across environments but IS shared across parameters and catalog within its own environment:

Does that mean things like "sparks" should be excluded? and why do you need a global only for parameters and catalog?
What will be defined in globals.yml and base/globals.yml separately?

When it's only file-base, I think it's obvious that local config should always win. i.e. local > base > globals. But which config should be override if one is using the command line kedro run --params? Or should we not enable this at all?

I agree 3. is confusing and it's hard to even figure out where's the value coming from.

So overall you think there is a need of globals (similar to environment variable) and an environment-awared globals

@bgereke
Copy link

bgereke commented Apr 19, 2023

@noklam I'll see if I can answer your Qs:

Does that mean things like "sparks" should be excluded? Why do you need a global only for parameters and catalog?

Not totally sure what you're asking here, but I hadn't considered the option of templating spark.yml. I'm probably a little skeptical of whether that would be a good idea.

What will be defined in globals.yml and base/globals.yml separately?

I was actually considering options 1 and 2 as mutually exclusive in my comment above. Either globals go in conf/globals.yml or conf/env/globals.yml but never both. Probably I'm not thinking enough about breaking changes for option 1 since globals.yml already lives inside environments. If that's the case, I'd vote for globals NOT being shared across environments but would need to better understand the use cases where this is needed and why those cases can't be solved by creating an environment-specific globals.yml.

But which config should be override if one is using the command line kedro run --params? Or should we not enable this at all?

I also think globals.yml should be overridable by --params by default. In fact, I already create a custom TemplatedConfigLoader to enable this today as my primary use for globals.yml is to integrate with other systems like Airflow and Weight and Biases. Another thing to consider in these cases is whether or not to support nested config in globals.yml since that can complicate integrating with those systems.

@noklam
Copy link
Contributor

noklam commented Apr 19, 2023

Thanks for clarifying, it makes sense as a mutually exclusive option!

Not totally sure what you're asking here, but I hadn't considered the option of templating spark.yml. I'm probably a little skeptical of whether that would be a good idea.

catalog , parameters and logging come out of the box with Kedro Project, but you can also extend the pattern to anything else. You can access your spark config with `config_loader["sparks"] once you register the pattern (this was added a few months ago, so I am not sure if you are using it already)

I also think globals.yml should be overridable by --params by default. In fact, I already create a custom TemplatedConfigLoader to enable this today as my primary use for globals.yml is to integrate with other systems like Airflow and Weight and Biases.

What kind of configuration do you override usually? curious of what goes in the globals.yml and the --param arugment.

Another thing to consider in these cases is whether or not to support nested config in globals.yml since that can complicate integrating with those systems.

Nested config for global will work for OmegaConfigLoader by default.

Currently we have a parameters_globals, which is roughly 2. except it only works for parameters now. The tricky bit is--params will be a leaking global that get pass to all environment.

@astrojuanlu
Copy link
Member

astrojuanlu commented May 23, 2023

Late to the party, just wanted to note that now we're saying that templating does not work for catalog files:

```{note}
Templating currently only works for parameter files, but not for catalog files.
```

However, it does work - I tested with this catalog.yml:

dataset1:
  type: pandas.CSVDataSet
  filepath: ${.metadata.location}
  metadata:
    location: train.csv

dataset2:
  type: ${dataset1.type}
  filepath: ${dataset1.filepath}

More precisely, at the moment it's not allowed to use templated variables from parameter files in catalog files (the scope of this issue).

@ankatiyar
Copy link
Contributor

ankatiyar commented Jul 13, 2023

Summary of Technical Design Discussion on 13/07/2023

  • After the introduction of variable interpolation in OmegaConfigLoader, two issues still stand -
    • Sharing variables across different configs (catalog, parameters etc)
    • Sharing variables across different envs (base, local etc)

Solutions discussed :

  1. Sharing all keys starting with "_" across environment and configs: This might lead to confusion about what is being shared across environments. Not distinct enough from general variable interpolation.
  2. globals.yml with keys starting with "_" passed through settings.py as globals_pattern(similar to TemplatedConfigLoader): More intentional that Solution 1
  3. Solution 2 + custom "globals" resolver - This the solution we will implement -> Add globals functionality to OmegaConfigLoader using custom "globals" resolver #2794

Other Details :

Q: Configurable globals_pattern?
A: Allow users to pass globals_pattern through settings.py but have it be globals_pattern: {"*globals.yml"} by default.

Q: Where would globals.yml go?
A: The globals.yml files would stay within conf/base/ and conf/local/ instead of a top level conf/globals.yml. The common keys in local/globals.yml will overwrite the ones in base/globals.yml

conf/
  |_ base/
    |_ globals.yml
    |_ catalog.yml
  |_ local/
    |_ globals.yml
    |_ catalog.yml

Q: Soft or hard merge common keys?
A: Overwrite keys from base/globals.yml with keys from local/globals.yml for now but allow soft-merging in Kedro 0.19

Q: Allow globals_dict like TemplatedConfigLoader?`
A: No, this use case for dynamically providing global values can be solved using "custom resolvers"

@merelcht
Copy link
Member Author

Closing in favour of the implementation ticket: #2794

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Configuration Issue: Feature Request New feature or improvement to existing feature Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation
Projects
Archived in project
Development

No branches or pull requests

6 participants