Rectify "modular pipelines" terminology #2723

astrojuanlu · 2023-06-23T17:02:44Z

Description

We're making various distinctions in our documentation about "Pipelines" and "Modular pipelines", for example in the TOC:

And in our wording:

In many typical Kedro projects, a single (“main”) pipeline increases in complexity as the project evolves. To keep your project fit for purpose, we recommend that you create modular pipelines, which are logically isolated and can be reused.

This wrapper really unlocks the power of modular pipelines.
from kedro.pipeline.modular_pipeline import pipeline

To the point that I believed namespaces were the same as modular pipelines.

However, it turns out that Pipelines and Modular Pipelines are mostly the same thing, and that kedro.pipeline.modular_pipeline.pipeline is not a wrapper over kedro.pipeline.pipeline: they're the same function.

kedro/kedro/pipeline/__init__.py

Line 5 in 160fd6b

from .modular_pipeline import pipeline

This is also related from this comment that I didn't fully understand back then: #2402 (comment)

Context

It's a key concept for reusability that many users use.

Possible Implementation

Remove mentions of "modular pipelines", just talk about "pipelines", some of which are registered (in pipeline_registry.register_pipelines) and some of which aren't.
In the pages that currently talk about "modular pipelines", replace the wording by "namespaced pipelines" or just "namespaces".
Remove mentions to kedro.pipeline.modular_pipeline.pipeline and just use kedro.pipeline.pipeline everywhere. (xref Simplify api hierarchy #712)

Possible Alternatives

There are maybe less disruptive paths but I can't think of alternative ways of rectifying the current terminology.

The text was updated successfully, but these errors were encountered:

astrojuanlu · 2023-06-25T10:24:54Z

Maybe ❓ Move pipeline from modular_pipeline.py to pipeline.py and delete modular_pipeline.py. This would break any imports from kedro.pipeline.modular_pipeline import pipeline but not from kedro.pipeline import pipeline

From #1147

stichbury · 2023-06-28T08:49:53Z

We had extensive discussions about how to refer to pipelines and did some user research. I've looked for the notes but because it was a couple of years ago and I think they were on the internal GitHub repo, I cannot find them. @yetudada and @idanov may have them, or @merelcht but I think we should revisit the discussion given that you've found the usage misleading as it currently stands.

astrojuanlu · 2023-06-28T09:19:51Z

I'm happy to have a look at those notes, but regardless, I think this terminology is unnecessarily complicated as it stands today. It gives the impression that there are 3 kinds of pipelines:

Just "pipelines"
Modular pipelines
Registered pipelines

When in fact, there's only one ("pipelines", which under the hood in Kedro are built with the modular_pipeline.pipeline helper), some of which happen to be registered (with pipeline autodiscovery, all of them in most cases).

Maybe let's chat about this next week.

noklam · 2023-07-18T16:48:44Z

I would suggest to review the modular pipeline as a whole.

I had a long discussion on Slack with one of our user. The docs are confusing and I struggle to understand.

The example also use a new pipeline which use a cooking analogy, which is nice but the problem is this pipeline does not exist anywhere. This is an advance and one of the more complicate feature, playing with the pipeline and seeing it in kedro-viz helps a lot to understand the feature.

https://docs.kedro.org/en/stable/nodes_and_pipelines/modular_pipelines.html#how-to-use-a-modular-pipeline-with-different-parameters.

Many users has been using tag over namespace, and currently namespace is basically just prefix dataset. People prefer flat structure over many hierarchies. For example, Strip project template #2756 is making this change for pipeline creation. On the other hand, keeping the structure make the pipeline more isolated and easier to work with micro-packaging, but I think this is less important. We also need to think of how this will work for universal deployment. What's the best way to organise pipeline easily and translate (compile) a Kedro DAG to other tool?

MatthiasRoels · 2023-08-09T13:19:18Z

I agree with @noklam here in that we should review the modular pipeline as a whole. For smaller pipelines and projects (where there are less pipelines in general), there is no actual issue other than the confusing terminology.

But for projects with lots of pipelines (and pipelines with lots of nodes), I think there is room for improvement of the concept of a kedro pipeline itself. In my view, there are 3 points of view to take into account when designing for a solution:

deployment: essentially, a pipeline is just a collection of nodes and the inputs/outputs determine a graph structure (which is exactly how kedro implements the concept!). However, translating a kedro node into a step to be executed by an orchestration tool (Airflow, Argo Workflows, ...) leads to a lot of compute overhead. Imagine for example a situation where you have many fast running nodes that need to be scheduled on a k8s cluster. For each of these nodes, a pod needs to start (container image needs to be downloaded on the node, pod needs to start, run, finish and communicate its status to the orchestrator). For many nodes, the overhead leads to a pipeline not being executed as efficient as possible. Hence the optimal case is either to create "bigger" nodes (combining logic from many nodes into one node) or run a collection of nodes in one orchestration task. The later hints towards something like running a sub-pipeline (or whatever). In any case, the most optimal scenario is if you are able to map a collection of nodes into one step in an orchestration tool (however that would work).
pipeline "discovery": kedro-viz is the best tool for the job here! But, for big pipelines, it might be helpful to see additional structure; collapsing nodes of the same namespace is very helpful, but also to have a view how deployment works out, i.e. which nodes are mapped onto the same step in the orchestrator and basically have a view on how that works.
development: to bring additional structure to big pipelines, it is useful to create sub-pipelines to re-use bigger chunks of work. This is more or less what a modular pipeline wants to achieve, but the need of introducing namespaces makes it quite complex to use I guess?

Anyway, these are just my thoughts on the topic.

astrojuanlu · 2023-08-09T13:51:53Z

Thanks @MatthiasRoels for the writeup! About (1), indeed @noklam has some thoughts about this, the granularity issue when deploying Kedro projects is something we want to look into (we have another issue about it but I don't remember which one is it), for (2) I've seen how Kedro Viz looks like for huge projects and indeed needs more work, and (3) what do you mean by sub-pipelines without namespaces?

MatthiasRoels · 2023-08-10T09:34:48Z

Cool, I am curious about @noklam's thoughts on this!

(3) what do you mean by sub-pipelines without namespaces?

This is not what I meant, what I wanted to say was that the concept of namespaces might be complex for some users when you just want make a subset of nodes re-usable as a whole. But I might be wrong on this too!

astrojuanlu · 2023-09-19T16:00:19Z

For the record (because I keep losing this link): issue in the private repository that collected research around terminology https://github.com/quantumblacklabs/private-kedro/issues/806

noklam · 2023-09-20T11:28:27Z

I need to get better at Github notification, I only saw this in an email today😅

(3) what do you mean by sub-pipelines without namespaces?
Currently the namespace is mainly used for two purposes:

Kedro-viz, the ability to filter, collapse pipeline
To avoid name conflicts, you cannot have two datasets with identical names, thus you need to apply namespace to stick a prefix.

I guess this is what you mean by using sub-pipelines without namespaces?

noklam · 2023-09-20T11:36:38Z

deployment: essentially, a pipeline is just a collection of nodes and the inputs/outputs determine a graph structure (which is exactly how kedro implements the concept!). However, translating a kedro node into a step to be executed by an orchestration tool (Airflow, Argo Workflows, ...) leads to a lot of compute overhead. Imagine for example a situation where you have many fast running nodes that need to be scheduled on a k8s cluster. For each of these nodes, a pod needs to start (container image needs to be downloaded on the node, pod needs to start, run, finish and communicate its status to the orchestrator). For many nodes, the overhead leads to a pipeline not being executed as efficient as possible. Hence the optimal case is either to create "bigger" nodes (combining logic from many nodes into one node) or run a collection of nodes in one orchestration task. The later hints towards something like running a sub-pipeline (or whatever). In any case, the most optimal scenario is if you are able to map a collection of nodes into one step in an orchestration tool (however that would work).

IMO, we need to clarify what should be done from kedro side and a kedro-plugin. Kedro shouldn't map to a specific orchestrator, this should be a plugin job.The idea of collapsing a modular pipeline/sub-pipeline to an Orchestrator node could be done by Kedro potentially. Here is an old idea that was proposed.

The 1-1 node mappings is a topic that comes up repeatedly, and at this point I think we can agree it is bad for most of the case. The logical first step is 1 pipeline = 1 node, of course it varies a lot for deployment and it also depends on how you structure your pipeline and how granular it is.

The serialisation/deserialisation cost goes up with number of nodes. Reducing the number of nodes should be the first thing to do. Some takes the approach of serialising the intermediate data to s3(or equivalent) for cross-node communication. https://pypi.org/project/vineyard-kedro/ takes this to next level and optimise it for K8s.

The challenge here for Kedro is, in a single Kedro run, the KedroSession orchestrate the whole run but in deployment it is running separately. So this orchestration step need to happen before they are sent to the Orchestrator. Essentially, when you collapse a pipeline as a node, you want everything become in-memory and only persist the data that are necessary for communication with other orchestrator nodes.

MatthiasRoels · 2023-09-20T12:09:12Z

I guess this is what you mean by using sub-pipelines without namespaces?

That’s exactly what I meant!

MatthiasRoels · 2023-09-20T12:24:02Z

Kedro shouldn't map to a specific orchestrator, this should be a plugin job.

Absolutely agree! But on the kedro side, some prep work can definitely be done that can be re-used in different plugins

The serialisation/deserialisation cost goes up with number of nodes. Reducing the number of nodes should be the first thing to do. Some takes the approach of serialising the intermediate data to s3(or equivalent) for cross-node communication. https://pypi.org/project/vineyard-kedro/ takes this to next level and optimise it for K8s.

Assuming you talk about orchestrator nodes, that’s exactly what you want to do. IMO, an object store (S3, GCS, MinIo,…) should work fine for the majority of use-cases!

The challenge here for Kedro is, in a single Kedro run, the KedroSession orchestrate the whole run but in deployment it is running separately. So this orchestration step need to happen before they are sent to the Orchestrator. Essentially, when you collapse a pipeline as a node, you want everything become in-memory and only persist the data that are necessary for communication with other orchestrator nodes.

That’s not necessarily true. You need to persist at least all datasets required in other orchestration nodes. But that doesn’t mean you don’t need to persist other datasets! I would imagine some sort of kedro compile method/cli where you construct the required data to be used by a plugin to create the required orchestrator resource (e.g. Airflow DAG). In that CLI/method, you can then do the required checks to validate if at least the datasets required for inter-node communication are persisted datasets.

noklam · 2023-09-20T12:38:15Z

I would imagine some sort of kedro compile method/cli where you construct the required data to be used by a plugin to create the required orchestrator resource (e.g. Airflow DAG).

I always want to specify data to persist(or memory) at runtime without touching catalog, that's for interactive workflow.

That’s not necessarily true. You need to persist at least all datasets required in other orchestration nodes. But that doesn’t mean you don’t need to persist other datasets! I would imagine some sort of kedro compile method/cli where you construct the required data to be used by a plugin to create the required orchestrator resource (e.g. Airflow DAG). In that CLI/method, you can then do the required checks to validate if at least the datasets required for inter-node communication are persisted datasets.

True, I focus on the minimal data that are required, of course in practice you want to customise. This is consistent with default to 1 pipeline = 1 orchestrator node, where you may want to further collapse pipeline or you may need more granularity. So this should be the default if no config is given.

MatthiasRoels · 2023-09-20T12:50:30Z

A bit of a braindump here, but if I think of an easy example where I have a kedro project consisting of 2 pipelines A and B (and obviously a __default__, which is the sum of the two). If I would then, at least conceptually, think about the process of creating a deployment for these two pipelines, the first step we should do it to figure out the order in which you need to run these two pipelines. There are three options:

A first, then B
B first, then A
A and B in parallel

Actually, there is a fourth option but that should result in a "compile" error: the scenario where A depends on dataset_1 and produces dataset_2 whereas B depends on dataset_2 and produces dataset_1 (this looks like an artificial scenario, but believe me when I say this, if you kedro pipelines are big enough, this can happen).

So kedro core (not a plugin) needs to figure out the correct order of execution as well as the exact kedro cmd required to run pipeline A resp. B. I think with that info, you can then create specific plugins to create target deployment. In my specific case, that would be a k8s resource for Argo Workflows. I'm even imagining starting from either a predefined template or a custom one provided by the user.

I see two potential starting points:

we start simple and ask the user to provide a list of kedro pipeline to orchestrate so that we can focus on implementing what I discussed above, as well as some plugins
we focus on how we can split a particular pipeline in part that we want to use for orchestration (either by tag, namespace, ...). I'm also wondering if we can somehow automatically create that split if we sum two pipelines; wouldn't it be cool if we can collapse A + B into "super-nodes" A and B. This way, the user can just specify one pipeline (__default__) and kedro automatically figures out the different parts to orchestrate.

astrojuanlu · 2023-10-26T10:13:51Z

This conversation branched off quite a bit, I'll try to center the main question again:

Can somebody explain me like I'm 5 years old what makes a "modular pipeline" different from a "pipeline"?

astrojuanlu · 2023-10-26T15:44:38Z

"A pipeline" = kedro.pipeline.pipeline.Pipeline, a Python class | "what you see in Kedro-Viz" | "what you can execute with kedro run" | "the combination of different Pipeline objects through +" | ... (possibly many things, this term is abused)
"A namespace(d) pipeline" = A pipeline (with the definition above) using namespaces (hence namespace=... in the initializer)
"A modular pipeline" = "The output of kedro pipeline create, hence a Python sub-package that is compatible with kedro micropkg package"

And more:

"Partial pipeline" = "A subset of a pipeline" https://docs.kedro.org/en/0.16.6/06_nodes_and_pipelines/02_pipelines.html#partial-pipelines (this page does not exist anymore)

So, if I'm correct, "a pipeline" and "a modular pipeline", depending on context, might be two entirely different categories of things: the former a Python class, the latter a directory structure. Furthermore: a modular pipeline contains a pipeline (kedro.pipeline.pipeline.Pipeline) definition.

And this is where this terminology, in my opinion, falls apart: a "modular pipeline" is not a kedro.pipeline.pipeline.Pipeline "gone modular", it's a wrapper (in the form of a bunch of Python modules with a specific structure) of a kedro.pipeline.pipeline.Pipeline. There is no IS-A (inheritance) relationship between "modular pipeline" and "pipeline", but rather a HAS-A (composition) relationship. A "modular pipeline" is not a pipeline, and it's not even a module because it's a package (a bunch of modules).

Not that I have better ideas now (and also I don't want to boil the ocean), but I wanted to at least give my interpretation.

astrojuanlu · 2023-11-23T15:32:43Z

A bit more insight on modular pipelines https://github.com/quantumblacklabs/private-kedro/issues/752#issuecomment-736680109 (private link)

(@idanov if you consent, you could copy-paste that comment here)

stichbury · 2024-01-12T10:01:52Z

I'm removing the documentation label from this as we have a docs task (#1998) to cover improvement of docs about modular pipelines. This ticket (to my mind) cover the philosophy of how we talk about modular pipelines and the language we want to use in communicating to users. It needs to happen ahead of the docs and then, when all is agreed, the docs can be overhauled. So #1998 is dependent on this (a "child" if you like) but this isn't a docs ticket.

astrojuanlu · 2024-06-30T09:53:54Z

After we merge #3948, I think the only things left are doing one last pass on the Kedro Frameworks docs and reviewing the Kedro-Viz ones.

As far as I understand (after 1 year of chewing on this issue), Kedro-Viz mostly cares about 2 things:

The pipeline registry. It doesn't matter if the registry refers to modular pipelines (hence those wrapped in the kedro pipeline create directory structure) or not.
Namespaces. These are shown as nested ("sub-pipelines" as @MatthiasRoels called them).

Since Kedro-Viz doesn't really have a user guide, there is not much to review. The word "modular" appears exactly once in the docs:

https://github.com/kedro-org/kedro-viz/blob/1d14055f5a75ba32e6db37f3bb8a24aec71986b8/docs/source/index.md?plain=1#L17

The codebase is another thing though. kedro-org/kedro-viz#1941 refers to "modular pipelines" and so do all the Python classes, but it's actually talking about namespaces. I reckon that doing a Search & Replace might have big, unintended consequences (cc @rashidakanchwala) so it's probably not worth the effort, but at least user-facing documentation should make the concepts crystal clear.

astrojuanlu · 2024-06-30T21:23:37Z

And change our tutorial too: https://docs.kedro.org/en/stable/tutorial/add_another_pipeline.html#modular-pipelines

astrojuanlu · 2024-07-02T11:10:05Z

So, long story short:

Come up with terminology we are all happy with ("namespaced pipelines"? "nested pipelines"? "sub-pipelines"?) Decide new name for "reused pipelines with namespaces" #4016
Further improvements in pipelines documentation (see @stichbury and @merelcht outstanding comments in Revise modular pipelines docs #3948, and also ideas from Improve documentation about namespaces #2825)
Amend Spaceflights tutorial where appropriate
Replace "modular" mentions in Kedro-Viz docs where appropriate

astrojuanlu · 2024-07-02T11:10:40Z

Moving this back to our Inbox so we can re-prioritise.

astrojuanlu · 2024-07-17T09:56:32Z

Opened an child issue exlusively about deciding a better name #4016

astrojuanlu added Component: Documentation 📄 Issue/PR for markdown and API documentation Issue: Feature Request New feature or improvement to existing feature labels Jun 23, 2023

astrojuanlu mentioned this issue Jul 14, 2023

Add kedro catalog factory list CLI command #2796

Closed

5 tasks

stichbury added this to the Improve onboarding experience for non-Kedro users (docs and examples) milestone Jul 17, 2023

merelcht assigned astrojuanlu Jul 17, 2023

astrojuanlu mentioned this issue Jul 17, 2023

pipeline.pipeline confuses IDE autocompletion #2805

Open

astrojuanlu mentioned this issue Aug 22, 2023

Revise the modular pipelines documentation to improve clarity #1998

Closed

astrojuanlu mentioned this issue Oct 23, 2023

Improve documentation about namespaces #2825

Closed

stichbury removed the Component: Documentation 📄 Issue/PR for markdown and API documentation label Jan 12, 2024

takikadiri mentioned this issue Jan 20, 2024

Drop AppPipeline in favor of pipeline namespaces takikadiri/kedro-boot#8

Closed

This was referenced Jan 22, 2024

[Parent] Improve documentation section that covers "Extend Kedro" #3202

Open

[Parent] Improve documentation section that covers "Nodes & pipelines" #3200

Open

astrojuanlu mentioned this issue Apr 4, 2024

Kedro-viz --lite : Build DAG without importing the code kedro-org/kedro-viz#1742

Open

1 task

astrojuanlu mentioned this issue Jun 30, 2024

Revise modular pipelines docs #3948

Merged

7 tasks

astrojuanlu added Component: Documentation 📄 Issue/PR for markdown and API documentation Issue: Bug Report 🐞 Bug that needs to be fixed and removed Issue: Feature Request New feature or improvement to existing feature labels Jun 30, 2024

astrojuanlu added the Type: Parent Issue label Jul 2, 2024

astrojuanlu mentioned this issue Jul 17, 2024

Decide new name for "reused pipelines with namespaces" #4016

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rectify "modular pipelines" terminology #2723

Rectify "modular pipelines" terminology #2723

astrojuanlu commented Jun 23, 2023

astrojuanlu commented Jun 25, 2023 •

edited

Loading

stichbury commented Jun 28, 2023

astrojuanlu commented Jun 28, 2023

noklam commented Jul 18, 2023

MatthiasRoels commented Aug 9, 2023

astrojuanlu commented Aug 9, 2023

MatthiasRoels commented Aug 10, 2023 •

edited

Loading

astrojuanlu commented Sep 19, 2023

noklam commented Sep 20, 2023

noklam commented Sep 20, 2023 •

edited

Loading

MatthiasRoels commented Sep 20, 2023

MatthiasRoels commented Sep 20, 2023 •

edited

Loading

noklam commented Sep 20, 2023

MatthiasRoels commented Sep 20, 2023

astrojuanlu commented Oct 26, 2023

astrojuanlu commented Oct 26, 2023 •

edited

Loading

astrojuanlu commented Nov 23, 2023

stichbury commented Jan 12, 2024

astrojuanlu commented Jun 30, 2024

astrojuanlu commented Jun 30, 2024

astrojuanlu commented Jul 2, 2024 •

edited

Loading

astrojuanlu commented Jul 2, 2024

astrojuanlu commented Jul 17, 2024

Rectify "modular pipelines" terminology #2723

Rectify "modular pipelines" terminology #2723

Comments

astrojuanlu commented Jun 23, 2023

Description

Context

Possible Implementation

Possible Alternatives

astrojuanlu commented Jun 25, 2023 • edited Loading

stichbury commented Jun 28, 2023

astrojuanlu commented Jun 28, 2023

noklam commented Jul 18, 2023

MatthiasRoels commented Aug 9, 2023

astrojuanlu commented Aug 9, 2023

MatthiasRoels commented Aug 10, 2023 • edited Loading

astrojuanlu commented Sep 19, 2023

noklam commented Sep 20, 2023

noklam commented Sep 20, 2023 • edited Loading

MatthiasRoels commented Sep 20, 2023

MatthiasRoels commented Sep 20, 2023 • edited Loading

noklam commented Sep 20, 2023

MatthiasRoels commented Sep 20, 2023

astrojuanlu commented Oct 26, 2023

astrojuanlu commented Oct 26, 2023 • edited Loading

astrojuanlu commented Nov 23, 2023

stichbury commented Jan 12, 2024

astrojuanlu commented Jun 30, 2024

astrojuanlu commented Jun 30, 2024

astrojuanlu commented Jul 2, 2024 • edited Loading

astrojuanlu commented Jul 2, 2024

astrojuanlu commented Jul 17, 2024

astrojuanlu commented Jun 25, 2023 •

edited

Loading

MatthiasRoels commented Aug 10, 2023 •

edited

Loading

noklam commented Sep 20, 2023 •

edited

Loading

MatthiasRoels commented Sep 20, 2023 •

edited

Loading

astrojuanlu commented Oct 26, 2023 •

edited

Loading

astrojuanlu commented Jul 2, 2024 •

edited

Loading