GitHub

Kedro to Dataiku

Convert Kedro project to Dataiku project in minutes

This is a tool to enable one to deploy a Kedro (>=0.16.5) project on Dataiku DSS instance without modifying the original Kedro project at all.

Automatic
Fast
Flexible

Features

Create Dataiku datasets automatically based on Kedro dataset catalog
Convert Kedro nodes into Dataiku recipes
Convert Kedro pipelines into Dataiku flow
Create flow zones in Dataiku project based on Kedro pipeline segmentation
Load all raw input data for the Kedro project into corresponding datasets in Dataiku proeject
Support PySpark through PySpark recipes in Dataiku
Support source code clone from git repository
Enable code editing via Dataiku project library

Adaption of Kedro project

As Dataiku flow is basically pandas dataset based, and every single Kedro node will be converted to Dataiku recipe, it is recommended to make inputs and ouputs of Kedro nodes in dataframe format. Pandas DataFrame, PySpark DataFrame, dictionary of Pandas DataFrame will be saved into Dataiku datasets which can be previewed, while other types of inputs/outputs (array, string, dictionary, etc.) will be saved in managed folders as pickle object.
Nodes in Kedro must have distinct function names.
As Dataiku recipe must have at least one dataset output, it is recommened to make sure that each Kedro node has at least one dataset output too. However, just in case some nodes in Kedro do not explicitly have any output, this tool will automatically create dummy dataset outputs which are actually not meaningful.
If there are local data files, it is required to put them under data folder
In order to create Dataiku zones in flow automatically, there must be pipepline segmentation defined in context.pipelines. The keys of context.pipelines can be used to define zones. For example, if there are pipelines defined as the following, then we can use ["int","primary","master","modeling"] to define the zones.
```
return {
    "int": int,
    "primary": primary,
    "master": master,
    "modeling": modeling,
    "__default__": (
                int
                + primary
                + master
                + modeling
            ),
}
```

Installation

As the package depends on dataiku which is internal module in DSS instance, it is recommneded to install and use this package inside Dataiku DSS.

Install it in Dataiku DSS code enviroment like any other pip packages, or install in Jupyter notebook by

%pip install kedro_to_dataiku

The required packages "dataiku" and "kedro" (>=0.16.5) will be the ones already exist in the Dataiku DSS environment.

Instead of installing the package, one can also upload the kernel file https://github.com/ppvastar/kedro_to_dataiku/blob/main/kedro_to_dataiku/kedro_to_dataiku.py to Dataiku project library

lib/python

so that one can

from kedro_to_dataiku import *

in project code.

Usage

Create a managed folder in Dataiku project. Let us suppose it to be "workspace".
Compress (into zip) and upload the whole Kedro project root folder (containing subfolders like data, conf, src, etc) into the managed folder, and uncompress it there.
Open Jupyter notebook in Dataiku, follow the following steps:

Initial set up

import dataiku
from kedro_to_dataiku import *

### the absolut path to the Kedro project root folder in Dataiku DSS filesystem.
kedro_project_path=dataiku.Folder("workspace").get_path()+"[relative path of the kedro project root folder]"
### package_name: name of the folder in "[kedro project root folder]/src/" which contains "nodes" and "pipelines" subfolders
package_name="[Kedro project package name]"
### set dataset connection (location). Or any other established connections (like S3) in Dataiku DSS.
connection="filesystem_managed" [or any other established connections (like S3) in Dataiku DSS]
### data foramt in Dataiku dataset: csv or PARQUET_HIVE
format_type="csv"
### define recipe type. Or use "pyspark" if want to create pyspark recipes. 
recipe_type="python" 
### use source code residing in kedro_project_path+"/src". Otherwise, if True, will use source code imported as Dataiku python library -- this option will enable us to edit the soruce code residing in library.
src_in_lib=False 
### a list of zones to be created. They are from the keys of context.pipelines in the Kedro project. Example: ["int","primary","master","master_ds","modeling"]. Or just keep it as None so that no zones will be created automatically.
zone_list=None
### if want to load the raw input data to Dataiku datasets. 
load_data=False
### if some inputs/outputs of Kedro projects are not Pandas dataframe/Spark dataframe/dictionary of Pandas dataframe format, they will be saved in managed folders instead of Dataiku datasets. This is critical to clarify.
folder_list=None

Fast creation and clean

## fast creation and clean
### one command to create the projects
create_all(kedro_project_path, package_name, connection, recipe_type,folder_list,zone_list,load_data,format_type,src_in_lib)
### one command to clean the projects. Make sure not to delete the managed folder hosting the Kedro project.
delete_all(excluded=["workspace"])

Create the project step by step

### create datasets
input_list,dataset_list=create_datasets(kedro_project_path, package_name,connection,folder_list,format_type,src_in_lib)
### create recipes
create_recipes(kedro_project_path, package_name,folder_list,recipe_type,src_in_lib)
### create zones
create_zones(zone_list,folder_list,kedro_project_path, package_name,src_in_lib)
### load raw input datasets
load_input_datasets(input_list,kedro_project_path, package_name,src_in_lib)

Try some other tools:

### list all datastes
act_on_project(target="dataset",cmd="list")
### clear data in all datastes
act_on_project(target="dataset",cmd="clear",excluded=None)
### delete all datastes
act_on_project(target="dataset",cmd="delete",excluded=None)

### return all recipes
act_on_project(target="recipe",cmd="list")
### delete all recipes
act_on_project(target="recipe",cmd="delete",excluded=None)

### return all zones
act_on_project(target="zone",cmd="list")
### delete all zones except the "Default". Caution: do not delete this Default zone otherwise the project flow will corrupt.
act_on_project(target="zone",cmd="delete")

### return all folders
act_on_project(target="folder",cmd="list")
### delete all folders
act_on_project(target="folder",cmd="delete",excluded=['workspace'])

In Dataiku, the src code in managed folder is not editable. If one want to do simple and fast edit on code within dataiku after deployment, one can import the source code to project library (https://doc.dataiku.com/dss/latest/python/reusing-code.html) which is editable. To do this, just load (one can use git) the folder in "[kedro project root folder]/src/" which usually contains "nodes" and "pipelines" subfolders into the lib/python path (keep the module name as the kedro package name), and then set
```
src_in_lib=True 
```
in previouly mentioned steps.

By doing so, the soruce code (nodes, pipelines, etc) in this library "lib/python/[package_name]" will be used instead of the orginal one under "[kedro project root folder]/src/[package_name]"

One can also clone Kedro project from git repository to the managed folder we created previously.

git_url="[git repository URL]"
kedro_project_path_in_git="[relative path of Kedro project root folder on git repository]"
### Keep existing data folder under the kedro_project_path in Dataiku managed folder
clone_from_git(kedro_project_path,git_url,kedro_project_path_in_git)

When code has been edited under path lib/python/[package_name], one may want to copy it back to the managed folder (for further download or other operations). To do this, one can use copy_lib function:

### overwrite=False, module lib/python/[package_name] will be copied to a new folder: [kedro_project_path]/src/[package_name]_lib
copy_lib(kedro_project_path,package_name,overwrite=False)
### overwrite=True, [kedro_project_path]/src/[package_name] in managed folder will just be overwritten with the lib/python/[package_name]
copy_lib(kedro_project_path,package_name,overwrite=True)

Example

Taking the Kedro 0.17.0 IRIS starter project as an example (https://github.com/ppvastar/kedro_example_iris)

One can create a "workspace" managed folder on dataiku, and then upload the project root folder "iris" into the "workspace".

Then run code like the following in Dataiku project jupyter notebook:

from kedro_to_dataiku import *
import dataiku

kedro_project_path=dataiku.Folder("workspace").get_path()+"/iris"
package_name="iris"
## change the connection according to actual situation
connection="S3_DSS" 
recipe_type="python"
src_in_lib=False
load_data=True
format_type="csv"
folder_list=["example_model","example_predictions"]
zone_list=["ds","de"]

create_all(kedro_project_path, package_name, connection, recipe_type,folder_list,zone_list,load_data,format_type,src_in_lib)

As a result, the Dataiku flow is created and raw input data is loaded immediately:

The flow is already ready for execution. Note that this simple Iris example is not a typical example, as at the final end there is actually no output (except some log information to report the model accuracy). As a resullt, the end of the Dataiku low is a dummy dataset called as "report_accurary_dummy_output" which will not be utilized.

To clear everything just now generated,

delete_all(excluded=["workspace"])

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
dist		dist
image		image
kedro_to_dataiku.egg-info		kedro_to_dataiku.egg-info
kedro_to_dataiku		kedro_to_dataiku
LICENSE.txt		LICENSE.txt
README.md		README.md
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Convert Kedro project to Dataiku project in minutes

Features

Adaption of Kedro project

Installation

Usage

Example

About

Releases

Packages

Languages

License

ppvastar/kedro_to_dataiku

Folders and files

Latest commit

History

Repository files navigation

Convert Kedro project to Dataiku project in minutes

Features

Adaption of Kedro project

Installation

Usage

Example

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages