[Configuration Management] Allow config in config or config in code. #3808

sheldontsen-qb · 2024-04-12T09:48:25Z

sheldontsen-qb
Apr 12, 2024

Description

Was just thinking that it would be great to let users pick between config in code vs config in yaml. What I mean by this is:

input_obj = KedroDataset(...) 
output_obj = KedroDataSet(...)

@dataclass
class MyParam:
   a = ...
   b = ...
my_param = MyParam()

Pipeline(
   [
      Node(func, ["input", "params:my_param", "output") # currently string, how it works right now
      Node(func, [input_obj, my_param], output_obj) # actual objects
   ]
)

It's not too many lines of code changed (I monkeypatched a few files just to check), but would also add another dimension of flexibility on how to use kedro. Giving option back to users to decide how they would like to manage their configuration.

Context

Right now I cannot use dataclasses to manage my configuration, which is a reasonable pattern to want to use. I've seen preferences of fully leveraging IDE, vs the mindset that there is clear separation between parameters vs code. I believe kedro should offer this type of flexibility to end users while still prescribing a preferred default.

Possible Implementation

I did monkeypatch a few files and got a small prototype working awhile back. Basically where kedro does a catalog.load for the string. For starters, we can allow the following:

    for name in node.inputs:
        hook_manager.hook.before_dataset_loaded(dataset_name=name, node=node)
                
        if isinstance(name, string):
           inputs[name] = catalog.load(name)   # original catalog.load(value) logic then pass to the node
        elif isinstance(name, AbstractDataset)
          inputs[str(name)] = name.load()  # call the dataset load method
        elif isinstance(name, dataclass)
          inputs[str(name)] = name            # pass directly to the node
        elif isinstance(name, Any)
          inputs[str(name)] = name            # tbh no difference from top unless we want to do something custom
        else:
           raise Exception("only string, AbstractDataSet, dataclass allowed for now?") # not sure if we want limiting or not

Changes can be made here:

kedro/kedro/runner/runner.py

Line 494 in 44817b8

inputs[name] = catalog.load(name)

Note the last elif basically means anything goes through, so I can define any object in any function used by a node and it cleanly gets passed through. Limiting to dataclasses also seems incomplete.

To make it flexible you could move this mapping of object to load_func to settings.py so users can always handle this themselves instead of putting if..elif in the kedro codebase.

Possible Alternatives

N/A

datajoely · 2024-04-12T09:50:54Z

datajoely
Apr 12, 2024
Collaborator

Prototype from @limdauto a few years ago along the same lines: https://web.archive.org/web/20210921071139/https://kedrozerotohero.com/experiments/define-data-catalog-using-python

0 replies

datajoely · 2024-04-12T09:51:51Z

datajoely
Apr 12, 2024
Collaborator

@benhorsburgh also had a similar idea about marshaling / unmarshaling parameters with Pydatntic

0 replies

sheldontsen-qb · 2024-04-12T09:53:10Z

sheldontsen-qb
Apr 12, 2024
Author

Yes, this could also at the same time be used to unpack parameters into a kedro node so users have a way to avoid writing:


def my_func(df, p):
    p1 = p.get("p1")
    ...

0 replies

astrojuanlu · 2024-04-12T12:56:20Z

astrojuanlu
Apr 12, 2024
Maintainer

Thanks for opening this @sheldontsen-qb ! This has come up a few times so it's good to start collecting some use cases. I'm moving this to Discussions to also link it to a similar one we had not so long ago #3788

0 replies

astrojuanlu · 2024-04-12T12:56:40Z

astrojuanlu
Apr 12, 2024
Maintainer

Also reposting this https://sre.google/workbook/configuration-specifics/ from @datajoely

0 replies

sheldontsen-qb · 2024-04-22T08:28:19Z

sheldontsen-qb
Apr 22, 2024
Author

Hello folks, any update on this?

1 reply

astrojuanlu May 7, 2024
Maintainer

Hi @sheldontsen-qb, I gave this a very quick look.

We have an ongoing workstream on redesigning the DataCatalog https://github.com/kedro-org/kedro/milestone/12 so there's a chance we will be able to look into this ~this year.

I reckon that it's not a very exciting timeframe though. So I quickly considered what workarounds exist.

Initially I thought that you could maybe get away by creating a custom *ConfigLoader, but it's not really the case. In the end, the DataCatalog is going to be instantiated from a dictionary:

kedro/kedro/framework/context/context.py

Lines 223 to 232 in 9da8b37

    
           conf_catalog = self.config_loader["catalog"] 
        
           # turn relative paths in conf_catalog into absolute paths 
        
           # before initializing the catalog 
        
           conf_catalog = _convert_paths_to_absolute_posix( 
        
               project_path=self.project_path, conf_dictionary=conf_catalog 
        
           ) 
        
           conf_creds = self._get_config_credentials() 
        
           catalog: DataCatalog = settings.DATA_CATALOG_CLASS.from_config( 
        
               catalog=conf_catalog,

And then the "parameters" configuration gets added:

kedro/kedro/framework/context/context.py

Lines 238 to 239 in 9da8b37

    
           feed_dict = self._get_feed_dict() 
        
           catalog.add_feed_dict(feed_dict)

So it looks like users could potentially write their own *DataCatalog for this (notice that settings.py allows you to override the DATA_CATALOG_CLASS) but it's not very clear to me how. Also the DataCatalog doesn't have an interface, so it's not clear what methods should be implemented and what should they return without scrutinizing the codebase. But again, if we intend to redesign it (see first sentence) it might not make sense for us to document the current interface.

At this point I'd need some help from the engineering team. We will try to give this a closer look ~soon.

Notice that this use case has already been requested, with other names: #3788

And finally, let me just say that I'm not the only one who thinks that the current design of parameters is a bit weird. As I said in #3732 (comment):

Shouldn't we redesign the DataCatalog API instead so that parameters are first class citizens, and not fake datasets?

Calychas · 2024-05-17T10:45:42Z

Calychas
May 17, 2024

Maybe also worth looking into OmegaConf's structured_config, although not sure of its usability in this case

0 replies

noklam · 2024-08-14T15:57:36Z

noklam
Aug 14, 2024
Collaborator

#4085

This isn't directly related to the topic but I think it's interesting. Essentially it try to merge config in YAML and config in code. You could probably extend this via a hook or something similar that convert dataclass into the dictionary form.

class CustomConfigLoader(OmegaConfigLoader):

    def __init__(
        self,
        conf_source: str,
        env: str | None = None,
        runtime_params: dict[str, Any] | None = None,
        *,
        config_patterns: dict[str, list[str]] | None = None,
        base_env: str | None = None,
        default_run_env: str | None = None,
        custom_resolvers: dict[str, Callable] | None = None,
        merge_strategy: dict[str, str] | None = None,
    ):

        super().__init__(
            conf_source=conf_source,
            env=env,
            runtime_params=runtime_params,
            config_patterns=config_patterns,
            base_env=base_env,
            default_run_env=default_run_env,
            custom_resolvers=custom_resolvers,
            merge_strategy=merge_strategy,
        )
        self["catalog"] = {**self["catalog"], **CATALOG}

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Configuration Management] Allow config in config or config in code. #3808

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

[Configuration Management] Allow config in config or config in code. #3808

sheldontsen-qb Apr 12, 2024

Description

Context

Possible Implementation

Possible Alternatives

Replies: 8 comments · 1 reply

datajoely Apr 12, 2024 Collaborator

datajoely Apr 12, 2024 Collaborator

sheldontsen-qb Apr 12, 2024 Author

astrojuanlu Apr 12, 2024 Maintainer

astrojuanlu Apr 12, 2024 Maintainer

sheldontsen-qb Apr 22, 2024 Author

astrojuanlu May 7, 2024 Maintainer

Calychas May 17, 2024

noklam Aug 14, 2024 Collaborator

sheldontsen-qb
Apr 12, 2024

Replies: 8 comments 1 reply

datajoely
Apr 12, 2024
Collaborator

datajoely
Apr 12, 2024
Collaborator

sheldontsen-qb
Apr 12, 2024
Author

astrojuanlu
Apr 12, 2024
Maintainer

astrojuanlu
Apr 12, 2024
Maintainer

sheldontsen-qb
Apr 22, 2024
Author

astrojuanlu May 7, 2024
Maintainer

Calychas
May 17, 2024

noklam
Aug 14, 2024
Collaborator