Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert an existing project into a project using Kedro as a Library + Framework #2924

Closed
yetudada opened this issue Aug 14, 2023 · 5 comments
Assignees
Labels
Issue: Feature Request New feature or improvement to existing feature Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation

Comments

@yetudada
Copy link
Contributor

yetudada commented Aug 14, 2023

Description

This follows the insights from #2901; one of the issued flagged is how we make it easier to use Kedro with existing projects. This task focuses on putting yourself into the shoes of our users as they refactor their work into adopting Kedro.

Possible Implementation

The scope of this task involves taking this project and converting into a project which uses Kedro as a:

  • Library (so try to use the ConfigLoader, DataCatalog and Pipeline)
  • Framework (so try to adopt the framework and project template)

You can work with the post authors, Daisy Wood and Mitchell West. Mitchell additionally added: "Happy to work with the Kedro team for guidance getting our approach up and running if it would be helpful? Next steps for this approach is to hopefully create a standalone python package, so would be great to see what a Kedro implementation would look like."

Outcomes

Share your learnings on what steps you took and what the process was like. It would be great to see a process diagram and possible workflow improvements that could be made.

@yetudada yetudada added the Issue: Feature Request New feature or improvement to existing feature label Aug 14, 2023
@noklam noklam self-assigned this Aug 14, 2023
@yetudada yetudada added the Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation label Aug 21, 2023
@noklam
Copy link
Contributor

noklam commented Aug 22, 2023

Resource

#2924 (comment)

My refactor exercise repository (see the original notebook and refactor notebook)
branch: refactor-with-library and refactor-kedro-pipeline (only partially for the pre-processing pipeline)

Migration Process

  1. Pure Notebook from scratch
  2. Notebook Using yaml + DataCatalog
  3. Notebook using OmegaConfigLoader + DataCatalog
  4. Move the Notebook to a Kedro Pipeline (refactor nodes, pipelines)

Methodology

I started with a notebook, I put it in folder/nested_folder which is an unusual location to try to break as much assumptions as I can. During refactoring, I try to do one change at a time and re-run the program to ensure the logic isn't changed, but sometime it is not possible.

Issues I encountered

  1. Templating with parameters.yml quickly hit the plateau and requires OmegaConfigLoader for templating advance types. I use OmegaConf.register_new_resolver("T", dynamic_injection,replace=True) to handle types like torch.bfloat16 or arbitrary type

  2. Using OmegaConfigLoader in notebook is not easy. You need OmegaConfigLoader(".",base_env="",default_run_env="")

  3. DataCatalog does not takes conf_source like ConfigLoader, a function _convert_paths_to_absolute_posix is used in KedroContext instead.

  4. AutoModelForCausalLM.from_pretrained takes positional and keyword argument. How do I specify it in a node?

  5. tokenizer is a global singleton get used in functions (node) implicitly, which means it is used in a function but not passed as an argument. When you run it in a notebook it's fine because everything running sequentially, but not true when you run it as a kedro node. Kedro needs "Pure Python Function", but there are stateful class being pass around. For example, LabelEncoder need to be fit before transform, changing the order will change the code

  6. %reload_kedro is slow - but I need to do this a lot when I refactor code

  7. %loadext autoreload 2 and %autoreload 2 is helpful - but it does not works everytime, %reload_kedro is still needed to refresh nodes.py code sometime (not sure yet)

  8. Hot reload in parameters.yml and catalog.yml comes in handy during refactoring - are we sure we want to deprecate it? Removing KedroContext params and catalog hot-reload

  9. Undocumented Feature, you can change src with source_dir in pyproject.toml

@noklam
Copy link
Contributor

noklam commented Aug 23, 2023

Theme that will not be discussed in this TD

Python Function vs Nodes

Having parameters.yml is the first step but it doesn't end there. You may want to organise your parameters group, but sometimes you found that Kedro will force you to organise it in a way (not intended). i.e. 5. Kedro doesn't support full Python function signature

For example

def my_function(*model_kwargs, **kwargs):
    return "something"

Hot-reload

6., 7. and 8. are related to the larger milestone Improve the usability and debugging experience for Jupyter notebooks

@noklam
Copy link
Contributor

noklam commented Aug 23, 2023

2023-08-23

Discussion for DataCatalog path conversion issue

  • Should we move this to DataCatalog.from_config and add a new argument data_source (similar to conf_source)?
  • Or should we change ConfigLoader because DataCatalog expects a dict and the responsibility to read config falls into ConfigLoader? It's blurry because it's not exactly "reading" config, it changes the content of catalog.yml
  • Or DataCatalog._convert_path or DataCatalog.convert_path
conf_catalog = config_loader["catalog"] # config
conf_catalog = _convert_paths_to_absolute_posix(Path("../../").resolve(), conf_catalog) # config/catalog
catalog = DataCatalog.from_config(conf_catalog) # catalog
catalog.load("example_data")

Discussed Solution

  1. DataCatalog.from_config(data_source=<some_folder>)
    1.1 DataCatalog.from_config(c, path_parameters=("filepath", "file")) - Extra parameters to define the path parameters.
  2. config_loader["catalog"] + an optional argument root to change the default root. i.e. OmegaConfigLoader(root="some_path")
    following Raised by @astrojuanlu
  3. remove logic from DataCatalog, add custom resolver
    OmegaConfigLoader(custom_resolvers={"path": lambda p: abspath(...)}) # but it is not backwards compatible
"""
test_ds:
  type: pandas.CSVDataSet
  filepath: "${path:data/01_raw/thing.csv}"
"""
  1. somehow the config loader transforms the keys in a backwards compatible way?
  2. substitution for the root of the project
    """
    test_ds:
    type: pandas.CSVDataSet
    filepath: "${root}"/data/01_raw/thing.csv
    """

Notes:

  • 2,3 & 5 requires config_loader

  • No obvious implementation for 4.

  • Mostly favor 1 and it variants

  • @SajidAlamQB and @ankatiyar thinks that configs should be handle by config_loader because it clarifies the responsibility of reading config.

    • Some idea about introducing a new parameters data_source, root etc in ConfigLoader and resolve it automatically with `config_loader["catalog"]
  • @deepyaman, @astrojuanlu ,@noklam and @amandakys favors to change DataCatalog.from_config(new_argument)

    • options include data_source or root
    • @astrojuanlu also suggested that we may even provide specific pattern to merge. ie.e DataCatalog.from_config(file_parameters=["filepath", "path"])
  • @amandakys and @deepyaman point out that if we choose to do it in config_loader, user will be forced to use config loader with DataCatalog.

  • The decision is to change DataCatalog.from_config, we haven't decided what should be the name of the argument.

@noklam
Copy link
Contributor

noklam commented Aug 31, 2023

Closing as the exercise is finished and we identified a few key issues. Relevant comments are spread into these issues: #1460, #2819 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation
Projects
Archived in project
Development

No branches or pull requests

2 participants