Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Kedro-Viz functionality through a notebook, without Kedro Framework. #1459

Closed
NeroOkwa opened this issue Jul 21, 2023 · 7 comments
Closed

Comments

@NeroOkwa
Copy link
Contributor

NeroOkwa commented Jul 21, 2023

Description

Make it possible to use Kedro-Viz (pipeline visualisation and experiment tracking) without Kedro framework by using a notebook.

For example I will be able to build a pipeline in notebook and have nodes that output metrics; I will be able to %run_viz and Kedro-Viz will open up with a view of my pipeline and experiments.

Context

Currently, Kedro-Viz is tightly coupled with Kedro framework making it impossible for non-kedro users to use Kedro-Viz. This was highlighted as a pain point in the experiment tracking user research:

"In this case if I really like experiment tracking I might not consider using it if it isn't a kedro project... I am not sure it is a good direction to go with it being completely integrated, especially if there is a new thing like Mlflow"

Secondly, from the non-technical user research #1280 we discovered a group of 'low-code' users that only use notebooks ( e.g. Data Analyst, J. Data Scientist, Researchers). This is a sizeable group (estimated at 70%) within data teams. Providing a notebook access to Kedro-Viz would make it easier for these users to use Kedro-Viz.

What's happening?

If I wanted to use Kedro-Viz in a notebook, without Kedro Framework then this would not be possible. So if I had a setup like this:

my-project
├── my-notebook.ipynb
├── Customer-Churn-Records.csv
├── parameters.yml
├── catalog.yml
└── requirements.txt

Then I’d never be able to see a pipeline visualisation even if, I had:
requirements.txt

kedro==0.18.11
kedro-viz==6.3.3
kedro-datasets[pandas.CSVDataSet]~=1.1

my-notebook.ipynb

from kedro.config import OmegaConfigLoader
from kedro.io import DataCatalog
from kedro.pipeline import Pipeline, node
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from typing import Dict
import logging
import pandas as pd


### Insert something new to load catalog.yml and parameters.yml


def preprocess_data(data: pd.DataFrame) -> pd.DataFrame:
    data = data.drop(columns=['RowNumber', 'CustomerId', 'Surname'])
    le = LabelEncoder()
    data['Gender'] = le.fit_transform(data['Gender'])
    data = pd.get_dummies(data, columns=['Geography', 'Card Type'])
    return data


def split_data(data: pd.DataFrame, test_size: float, random_state: int) -> Dict:
    X = data.drop(columns='Exited')
    y = data['Exited']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
    return dict(train=(X_train, y_train), test=(X_test, y_test))


def train_model(train: Dict, random_state: int) -> RandomForestClassifier:
    X_train, y_train = train['train']
    rf_clf = RandomForestClassifier(random_state=random_state)
    rf_clf.fit(X_train, y_train)
    return rf_clf


def evaluate_model(model: RandomForestClassifier, test: Dict) -> None:
    X_test, y_test = test['test']
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    confusion_mat = confusion_matrix(y_test, y_pred)
    class_report = classification_report(y_test, y_pred)
    log = logging.getLogger(__name__)
    log.info("Model Accuracy: %s", accuracy)
    log.info("Confusion Matrix: \n%s", confusion_mat)
    log.info("Classification Report: \n%s", class_report)


my_pipeline = pipeline([
        node(preprocess_data, "customers", "preprocessed_customers"),
				node(split_data, ["preprocessed_customers", "params:test_size", "params:random_state"], "split_data"),
        node(train_model, ["split_data", "params:random_state"], "rf_model"),
        node(evaluate_model, ["rf_model", "split_data"], None),
    ])

%run_viz my_pipeline

It should be possible to see the following in another cell in my Jupyter notebook, with the option to open it up in another tab:

Screenshot 2023-07-18 at 11 55 07 (1)

Outcome

A user will be able to use Kedro-Viz from a notebook, without the need/setup of a Kedro framework.

Evidence markers

@datajoely
Copy link
Contributor

I love this!

@yetudada
Copy link
Contributor

I love this!

What do you love about this? 😄

@datajoely
Copy link
Contributor

I think I have two thoughts -

  1. This is a neat way of making Kedro Viz useful to people who don't want the complexity of the IDE and may be a stepping stone to getting people into that space.

  2. The second point is something I know others have mentioned before - it annoys me that we need to actually load a valid Kedro project with all of it's imports and dependencies just to visualise the pipeline flow. Kedro Viz (in my mind) should load instantly, you shouldn't have to wait for Spark to spin up (especially because you can't run the pipeline anyway). I've long thought Kedro should be able to create a session lazily so you can read the pipeline structure for Viz cheaply without incurring the other costs.

@astrojuanlu
Copy link
Member

Idea: a kedro-openlineage plugin that emits static OpenLineage metadata events, either in ndjson format or to an HTTP endpoint, which are then consumed by Kedro Viz. This is possible with openlineage-python 1.0, released yesterday.

@datajoely
Copy link
Contributor

100000% also lots of LFAI projects there we should deffo do this

@datajoely
Copy link
Contributor

This thread on Slack shows a user wanting to merge Viz from 3 different Kedro projects that can't exist side by side since they have conflicting dependencies. Kedro Viz doesn't need to run this, it just needs to visualise the pipeline structure:
https://linen-slack.kedro.org/t/14142730/hi-everyone-is-it-possible-to-combine-multiple-kedro-project#d84d8f45-eecc-4c1b-b639-4556c1edcd76

@noklam
Copy link
Contributor

noklam commented Mar 25, 2024

I realised I didn't leave a comment here. I created this last year https://github.com/noklam/kedro-viz-lite. I actually don't remember if I succeed at the end, the logic are mostly in https://github.com/noklam/kedro-viz-lite/blob/main/kedro_viz_lite/core.py.

This lead to my subsequent proposal for the kedro viz build and kedro viz GH page.

My use case for this is explore Pipeline structure, particular when I need to confirm my pipeline works as expected with namespace. The alternative of this is creating a full-blown Kedro project which is a lot of boilerplate. What I care is just the DAGs, and it should be enough as long as I have the DataCatalog and Pipeline. It's also because kedro viz is kind of slow to start up, thus making it hard when I just want to debug quickly. (--reload sometimes just break completely if I have an incomplete Kedro Project)

If this add a bit context, I was writing https://noklam.github.io/blog/posts/understand_namespace/2023-09-26-understand-kedro-namespace-pipeline.html when I think about this.

@kedro-org kedro-org locked and limited conversation to collaborators Mar 27, 2024
@rashidakanchwala rashidakanchwala converted this issue into discussion #1833 Mar 27, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Projects
Status: Done
Development

No branches or pull requests

7 participants