Convert an existing project into a project using Kedro as a Library + Framework #2924

yetudada · 2023-08-14T12:50:37Z

Description

This follows the insights from #2901; one of the issued flagged is how we make it easier to use Kedro with existing projects. This task focuses on putting yourself into the shoes of our users as they refactor their work into adopting Kedro.

Possible Implementation

The scope of this task involves taking this project and converting into a project which uses Kedro as a:

Library (so try to use the ConfigLoader, DataCatalog and Pipeline)
Framework (so try to adopt the framework and project template)

You can work with the post authors, Daisy Wood and Mitchell West. Mitchell additionally added: "Happy to work with the Kedro team for guidance getting our approach up and running if it would be helpful? Next steps for this approach is to hopefully create a standalone python package, so would be great to see what a Kedro implementation would look like."

Outcomes

Share your learnings on what steps you took and what the process was like. It would be great to see a process diagram and possible workflow improvements that could be made.

noklam · 2023-08-22T12:50:35Z

Resource

#2924 (comment)

My refactor exercise repository (see the original notebook and refactor notebook)
branch: refactor-with-library and refactor-kedro-pipeline (only partially for the pre-processing pipeline)

Migration Process

Pure Notebook from scratch
Notebook Using yaml + DataCatalog
Notebook using OmegaConfigLoader + DataCatalog
Move the Notebook to a Kedro Pipeline (refactor nodes, pipelines)

Methodology

I started with a notebook, I put it in folder/nested_folder which is an unusual location to try to break as much assumptions as I can. During refactoring, I try to do one change at a time and re-run the program to ensure the logic isn't changed, but sometime it is not possible.

Issues I encountered

Templating with parameters.yml quickly hit the plateau and requires OmegaConfigLoader for templating advance types. I use OmegaConf.register_new_resolver("T", dynamic_injection,replace=True) to handle types like torch.bfloat16 or arbitrary type
Using OmegaConfigLoader in notebook is not easy. You need OmegaConfigLoader(".",base_env="",default_run_env="")
DataCatalog does not takes conf_source like ConfigLoader, a function _convert_paths_to_absolute_posix is used in KedroContext instead.
AutoModelForCausalLM.from_pretrained takes positional and keyword argument. How do I specify it in a node?
tokenizer is a global singleton get used in functions (node) implicitly, which means it is used in a function but not passed as an argument. When you run it in a notebook it's fine because everything running sequentially, but not true when you run it as a kedro node. Kedro needs "Pure Python Function", but there are stateful class being pass around. For example, LabelEncoder need to be fit before transform, changing the order will change the code
%reload_kedro is slow - but I need to do this a lot when I refactor code
%loadext autoreload 2 and %autoreload 2 is helpful - but it does not works everytime, %reload_kedro is still needed to refresh nodes.py code sometime (not sure yet)
Hot reload in parameters.yml and catalog.yml comes in handy during refactoring - are we sure we want to deprecate it? Removing KedroContext params and catalog hot-reload
Undocumented Feature, you can change src with source_dir in pyproject.toml

noklam · 2023-08-23T12:59:41Z

Theme that will not be discussed in this TD

Python Function vs Nodes

Having parameters.yml is the first step but it doesn't end there. You may want to organise your parameters group, but sometimes you found that Kedro will force you to organise it in a way (not intended). i.e. 5. Kedro doesn't support full Python function signature

Unpack Dictionary dict unpacking with "**params:foo.bar" in node(inputs=) #2749
How to support a function or node that takes position-only + keywords-only argument?

For example

def my_function(*model_kwargs, **kwargs):
    return "something"

Hot-reload

6., 7. and 8. are related to the larger milestone Improve the usability and debugging experience for Jupyter notebooks

noklam · 2023-08-23T16:35:18Z

2023-08-23

Discussion for DataCatalog path conversion issue

Should we move this to DataCatalog.from_config and add a new argument data_source (similar to conf_source)?
Or should we change ConfigLoader because DataCatalog expects a dict and the responsibility to read config falls into ConfigLoader? It's blurry because it's not exactly "reading" config, it changes the content of catalog.yml
Or DataCatalog._convert_path or DataCatalog.convert_path

conf_catalog = config_loader["catalog"] # config
conf_catalog = _convert_paths_to_absolute_posix(Path("../../").resolve(), conf_catalog) # config/catalog
catalog = DataCatalog.from_config(conf_catalog) # catalog
catalog.load("example_data")

Discussed Solution

DataCatalog.from_config(data_source=<some_folder>)
1.1 DataCatalog.from_config(c, path_parameters=("filepath", "file")) - Extra parameters to define the path parameters.
config_loader["catalog"] + an optional argument root to change the default root. i.e. OmegaConfigLoader(root="some_path")
following Raised by @astrojuanlu
remove logic from DataCatalog, add custom resolver
OmegaConfigLoader(custom_resolvers={"path": lambda p: abspath(...)}) # but it is not backwards compatible

"""
test_ds:
  type: pandas.CSVDataSet
  filepath: "${path:data/01_raw/thing.csv}"
"""

somehow the config loader transforms the keys in a backwards compatible way?
substitution for the root of the project
"""
test_ds:
type: pandas.CSVDataSet
filepath: "${root}"/data/01_raw/thing.csv
"""

Notes:

2,3 & 5 requires config_loader
No obvious implementation for 4.
Mostly favor 1 and it variants
@SajidAlamQB and @ankatiyar thinks that configs should be handle by config_loader because it clarifies the responsibility of reading config.
- Some idea about introducing a new parameters data_source, root etc in ConfigLoader and resolve it automatically with `config_loader["catalog"]
@deepyaman, @astrojuanlu ,@noklam and @amandakys favors to change DataCatalog.from_config(new_argument)
- options include data_source or root
- @astrojuanlu also suggested that we may even provide specific pattern to merge. ie.e DataCatalog.from_config(file_parameters=["filepath", "path"])
@amandakys and @deepyaman point out that if we choose to do it in config_loader, user will be forced to use config loader with DataCatalog.
The decision is to change DataCatalog.from_config, we haven't decided what should be the name of the argument.

noklam · 2023-08-24T14:39:47Z

noklam · 2023-08-31T11:07:04Z

Closing as the exercise is finished and we identified a few key issues. Relevant comments are spread into these issues: #1460, #2819 (comment)

yetudada added the Issue: Feature Request New feature or improvement to existing feature label Aug 14, 2023

yetudada added this to the Using Kedro with existing projects milestone Aug 14, 2023

noklam self-assigned this Aug 14, 2023

yetudada added the Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation label Aug 21, 2023

noklam mentioned this issue Aug 23, 2023

Add ability to Specify a "root" for DataCatalog.from_config #2965

Closed

MatthiasRoels mentioned this issue Aug 23, 2023

Make it easier to use the Config Loader #2819

Closed

noklam mentioned this issue Aug 25, 2023

dict unpacking with "**params:foo.bar" in node(inputs=) #2749

Open

noklam closed this as completed Aug 31, 2023

github-actions bot mentioned this issue Sep 1, 2023

Monthly issue metrics report #2996

Closed

merelcht mentioned this issue Sep 13, 2023

Make it easier to use Kedro as a library #3029

Closed

astrojuanlu mentioned this issue Feb 9, 2024

Extend path-like keys in the catalog #1934

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert an existing project into a project using Kedro as a Library + Framework #2924

Convert an existing project into a project using Kedro as a Library + Framework #2924

yetudada commented Aug 14, 2023 •

edited

Loading

noklam commented Aug 22, 2023 •

edited by yetudada

Loading

noklam commented Aug 23, 2023 •

edited

Loading

noklam commented Aug 23, 2023 •

edited

Loading

noklam commented Aug 24, 2023

noklam commented Aug 31, 2023

Convert an existing project into a project using Kedro as a Library + Framework #2924

Convert an existing project into a project using Kedro as a Library + Framework #2924

Comments

yetudada commented Aug 14, 2023 • edited Loading

Description

Possible Implementation

Outcomes

noklam commented Aug 22, 2023 • edited by yetudada Loading

Resource

Migration Process

Methodology

Issues I encountered

noklam commented Aug 23, 2023 • edited Loading

Theme that will not be discussed in this TD

Python Function vs Nodes

Hot-reload

noklam commented Aug 23, 2023 • edited Loading

noklam commented Aug 24, 2023

noklam commented Aug 31, 2023

yetudada commented Aug 14, 2023 •

edited

Loading

noklam commented Aug 22, 2023 •

edited by yetudada

Loading

noklam commented Aug 23, 2023 •

edited

Loading

noklam commented Aug 23, 2023 •

edited

Loading