Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rectify "modular pipelines" terminology #2723

Open
astrojuanlu opened this issue Jun 23, 2023 · 23 comments
Open

Rectify "modular pipelines" terminology #2723

astrojuanlu opened this issue Jun 23, 2023 · 23 comments
Assignees
Labels
Component: Documentation 📄 Issue/PR for markdown and API documentation Issue: Bug Report 🐞 Bug that needs to be fixed Type: Parent Issue

Comments

@astrojuanlu
Copy link
Member

Description

We're making various distinctions in our documentation about "Pipelines" and "Modular pipelines", for example in the TOC:

image

And in our wording:

In many typical Kedro projects, a single (“main”) pipeline increases in complexity as the project evolves. To keep your project fit for purpose, we recommend that you create modular pipelines, which are logically isolated and can be reused.

This wrapper really unlocks the power of modular pipelines.
from kedro.pipeline.modular_pipeline import pipeline

To the point that I believed namespaces were the same as modular pipelines.

However, it turns out that Pipelines and Modular Pipelines are mostly the same thing, and that kedro.pipeline.modular_pipeline.pipeline is not a wrapper over kedro.pipeline.pipeline: they're the same function.

from .modular_pipeline import pipeline

This is also related from this comment that I didn't fully understand back then: #2402 (comment)

Context

It's a key concept for reusability that many users use.

Possible Implementation

  • Remove mentions of "modular pipelines", just talk about "pipelines", some of which are registered (in pipeline_registry.register_pipelines) and some of which aren't.
  • In the pages that currently talk about "modular pipelines", replace the wording by "namespaced pipelines" or just "namespaces".
  • Remove mentions to kedro.pipeline.modular_pipeline.pipeline and just use kedro.pipeline.pipeline everywhere. (xref Simplify api hierarchy #712)

Possible Alternatives

There are maybe less disruptive paths but I can't think of alternative ways of rectifying the current terminology.

@astrojuanlu astrojuanlu added Component: Documentation 📄 Issue/PR for markdown and API documentation Issue: Feature Request New feature or improvement to existing feature labels Jun 23, 2023
@astrojuanlu
Copy link
Member Author

astrojuanlu commented Jun 25, 2023

Maybe ❓ Move pipeline from modular_pipeline.py to pipeline.py and delete modular_pipeline.py. This would break any imports from kedro.pipeline.modular_pipeline import pipeline but not from kedro.pipeline import pipeline

From #1147

@stichbury
Copy link
Contributor

We had extensive discussions about how to refer to pipelines and did some user research. I've looked for the notes but because it was a couple of years ago and I think they were on the internal GitHub repo, I cannot find them. @yetudada and @idanov may have them, or @merelcht but I think we should revisit the discussion given that you've found the usage misleading as it currently stands.

@astrojuanlu
Copy link
Member Author

I'm happy to have a look at those notes, but regardless, I think this terminology is unnecessarily complicated as it stands today. It gives the impression that there are 3 kinds of pipelines:

  • Just "pipelines"
  • Modular pipelines
  • Registered pipelines

When in fact, there's only one ("pipelines", which under the hood in Kedro are built with the modular_pipeline.pipeline helper), some of which happen to be registered (with pipeline autodiscovery, all of them in most cases).

Maybe let's chat about this next week.

@noklam
Copy link
Contributor

noklam commented Jul 18, 2023

I would suggest to review the modular pipeline as a whole.

  1. I had a long discussion on Slack with one of our user. The docs are confusing and I struggle to understand.

The example also use a new pipeline which use a cooking analogy, which is nice but the problem is this pipeline does not exist anywhere. This is an advance and one of the more complicate feature, playing with the pipeline and seeing it in kedro-viz helps a lot to understand the feature.

https://docs.kedro.org/en/stable/nodes_and_pipelines/modular_pipelines.html#how-to-use-a-modular-pipeline-with-different-parameters.

  1. Many users has been using tag over namespace, and currently namespace is basically just prefix dataset. People prefer flat structure over many hierarchies. For example, Strip project template  #2756 is making this change for pipeline creation. On the other hand, keeping the structure make the pipeline more isolated and easier to work with micro-packaging, but I think this is less important. We also need to think of how this will work for universal deployment. What's the best way to organise pipeline easily and translate (compile) a Kedro DAG to other tool?

@MatthiasRoels
Copy link

I agree with @noklam here in that we should review the modular pipeline as a whole. For smaller pipelines and projects (where there are less pipelines in general), there is no actual issue other than the confusing terminology.

But for projects with lots of pipelines (and pipelines with lots of nodes), I think there is room for improvement of the concept of a kedro pipeline itself. In my view, there are 3 points of view to take into account when designing for a solution:

  1. deployment: essentially, a pipeline is just a collection of nodes and the inputs/outputs determine a graph structure (which is exactly how kedro implements the concept!). However, translating a kedro node into a step to be executed by an orchestration tool (Airflow, Argo Workflows, ...) leads to a lot of compute overhead. Imagine for example a situation where you have many fast running nodes that need to be scheduled on a k8s cluster. For each of these nodes, a pod needs to start (container image needs to be downloaded on the node, pod needs to start, run, finish and communicate its status to the orchestrator). For many nodes, the overhead leads to a pipeline not being executed as efficient as possible. Hence the optimal case is either to create "bigger" nodes (combining logic from many nodes into one node) or run a collection of nodes in one orchestration task. The later hints towards something like running a sub-pipeline (or whatever). In any case, the most optimal scenario is if you are able to map a collection of nodes into one step in an orchestration tool (however that would work).
  2. pipeline "discovery": kedro-viz is the best tool for the job here! But, for big pipelines, it might be helpful to see additional structure; collapsing nodes of the same namespace is very helpful, but also to have a view how deployment works out, i.e. which nodes are mapped onto the same step in the orchestrator and basically have a view on how that works.
  3. development: to bring additional structure to big pipelines, it is useful to create sub-pipelines to re-use bigger chunks of work. This is more or less what a modular pipeline wants to achieve, but the need of introducing namespaces makes it quite complex to use I guess?

Anyway, these are just my thoughts on the topic.

@astrojuanlu
Copy link
Member Author

Thanks @MatthiasRoels for the writeup! About (1), indeed @noklam has some thoughts about this, the granularity issue when deploying Kedro projects is something we want to look into (we have another issue about it but I don't remember which one is it), for (2) I've seen how Kedro Viz looks like for huge projects and indeed needs more work, and (3) what do you mean by sub-pipelines without namespaces?

@MatthiasRoels
Copy link

MatthiasRoels commented Aug 10, 2023

Cool, I am curious about @noklam's thoughts on this!

(3) what do you mean by sub-pipelines without namespaces?

This is not what I meant, what I wanted to say was that the concept of namespaces might be complex for some users when you just want make a subset of nodes re-usable as a whole. But I might be wrong on this too!

@astrojuanlu
Copy link
Member Author

For the record (because I keep losing this link): issue in the private repository that collected research around terminology https://github.com/quantumblacklabs/private-kedro/issues/806

@noklam
Copy link
Contributor

noklam commented Sep 20, 2023

I need to get better at Github notification, I only saw this in an email today😅

(3) what do you mean by sub-pipelines without namespaces?
Currently the namespace is mainly used for two purposes:

  1. Kedro-viz, the ability to filter, collapse pipeline
  2. To avoid name conflicts, you cannot have two datasets with identical names, thus you need to apply namespace to stick a prefix.

I guess this is what you mean by using sub-pipelines without namespaces?

@noklam
Copy link
Contributor

noklam commented Sep 20, 2023

deployment: essentially, a pipeline is just a collection of nodes and the inputs/outputs determine a graph structure (which is exactly how kedro implements the concept!). However, translating a kedro node into a step to be executed by an orchestration tool (Airflow, Argo Workflows, ...) leads to a lot of compute overhead. Imagine for example a situation where you have many fast running nodes that need to be scheduled on a k8s cluster. For each of these nodes, a pod needs to start (container image needs to be downloaded on the node, pod needs to start, run, finish and communicate its status to the orchestrator). For many nodes, the overhead leads to a pipeline not being executed as efficient as possible. Hence the optimal case is either to create "bigger" nodes (combining logic from many nodes into one node) or run a collection of nodes in one orchestration task. The later hints towards something like running a sub-pipeline (or whatever). In any case, the most optimal scenario is if you are able to map a collection of nodes into one step in an orchestration tool (however that would work).

IMO, we need to clarify what should be done from kedro side and a kedro-plugin. Kedro shouldn't map to a specific orchestrator, this should be a plugin job.The idea of collapsing a modular pipeline/sub-pipeline to an Orchestrator node could be done by Kedro potentially. Here is an old idea that was proposed.

The 1-1 node mappings is a topic that comes up repeatedly, and at this point I think we can agree it is bad for most of the case. The logical first step is 1 pipeline = 1 node, of course it varies a lot for deployment and it also depends on how you structure your pipeline and how granular it is.

The serialisation/deserialisation cost goes up with number of nodes. Reducing the number of nodes should be the first thing to do. Some takes the approach of serialising the intermediate data to s3(or equivalent) for cross-node communication. https://pypi.org/project/vineyard-kedro/ takes this to next level and optimise it for K8s.

The challenge here for Kedro is, in a single Kedro run, the KedroSession orchestrate the whole run but in deployment it is running separately. So this orchestration step need to happen before they are sent to the Orchestrator. Essentially, when you collapse a pipeline as a node, you want everything become in-memory and only persist the data that are necessary for communication with other orchestrator nodes.

@MatthiasRoels
Copy link

I guess this is what you mean by using sub-pipelines without namespaces?

That’s exactly what I meant!

@MatthiasRoels
Copy link

MatthiasRoels commented Sep 20, 2023

Kedro shouldn't map to a specific orchestrator, this should be a plugin job.

Absolutely agree! But on the kedro side, some prep work can definitely be done that can be re-used in different plugins

The serialisation/deserialisation cost goes up with number of nodes. Reducing the number of nodes should be the first thing to do. Some takes the approach of serialising the intermediate data to s3(or equivalent) for cross-node communication. https://pypi.org/project/vineyard-kedro/ takes this to next level and optimise it for K8s.

Assuming you talk about orchestrator nodes, that’s exactly what you want to do. IMO, an object store (S3, GCS, MinIo,…) should work fine for the majority of use-cases!

The challenge here for Kedro is, in a single Kedro run, the KedroSession orchestrate the whole run but in deployment it is running separately. So this orchestration step need to happen before they are sent to the Orchestrator. Essentially, when you collapse a pipeline as a node, you want everything become in-memory and only persist the data that are necessary for communication with other orchestrator nodes.

That’s not necessarily true. You need to persist at least all datasets required in other orchestration nodes. But that doesn’t mean you don’t need to persist other datasets! I would imagine some sort of kedro compile method/cli where you construct the required data to be used by a plugin to create the required orchestrator resource (e.g. Airflow DAG). In that CLI/method, you can then do the required checks to validate if at least the datasets required for inter-node communication are persisted datasets.

@noklam
Copy link
Contributor

noklam commented Sep 20, 2023

I would imagine some sort of kedro compile method/cli where you construct the required data to be used by a plugin to create the required orchestrator resource (e.g. Airflow DAG).

I always want to specify data to persist(or memory) at runtime without touching catalog, that's for interactive workflow.

That’s not necessarily true. You need to persist at least all datasets required in other orchestration nodes. But that doesn’t mean you don’t need to persist other datasets! I would imagine some sort of kedro compile method/cli where you construct the required data to be used by a plugin to create the required orchestrator resource (e.g. Airflow DAG). In that CLI/method, you can then do the required checks to validate if at least the datasets required for inter-node communication are persisted datasets.

True, I focus on the minimal data that are required, of course in practice you want to customise. This is consistent with default to 1 pipeline = 1 orchestrator node, where you may want to further collapse pipeline or you may need more granularity. So this should be the default if no config is given.

@MatthiasRoels
Copy link

A bit of a braindump here, but if I think of an easy example where I have a kedro project consisting of 2 pipelines A and B (and obviously a __default__, which is the sum of the two). If I would then, at least conceptually, think about the process of creating a deployment for these two pipelines, the first step we should do it to figure out the order in which you need to run these two pipelines. There are three options:

  1. A first, then B
  2. B first, then A
  3. A and B in parallel

Actually, there is a fourth option but that should result in a "compile" error: the scenario where A depends on dataset_1 and produces dataset_2 whereas B depends on dataset_2 and produces dataset_1 (this looks like an artificial scenario, but believe me when I say this, if you kedro pipelines are big enough, this can happen).

So kedro core (not a plugin) needs to figure out the correct order of execution as well as the exact kedro cmd required to run pipeline A resp. B. I think with that info, you can then create specific plugins to create target deployment. In my specific case, that would be a k8s resource for Argo Workflows. I'm even imagining starting from either a predefined template or a custom one provided by the user.

I see two potential starting points:

  1. we start simple and ask the user to provide a list of kedro pipeline to orchestrate so that we can focus on implementing what I discussed above, as well as some plugins
  2. we focus on how we can split a particular pipeline in part that we want to use for orchestration (either by tag, namespace, ...). I'm also wondering if we can somehow automatically create that split if we sum two pipelines; wouldn't it be cool if we can collapse A + B into "super-nodes" A and B. This way, the user can just specify one pipeline (__default__) and kedro automatically figures out the different parts to orchestrate.

@astrojuanlu
Copy link
Member Author

This conversation branched off quite a bit, I'll try to center the main question again:

Can somebody explain me like I'm 5 years old what makes a "modular pipeline" different from a "pipeline"?

@astrojuanlu
Copy link
Member Author

astrojuanlu commented Oct 26, 2023

  • "A pipeline" = kedro.pipeline.pipeline.Pipeline, a Python class | "what you see in Kedro-Viz" | "what you can execute with kedro run" | "the combination of different Pipeline objects through +" | ... (possibly many things, this term is abused)
  • "A namespace(d) pipeline" = A pipeline (with the definition above) using namespaces (hence namespace=... in the initializer)
  • "A modular pipeline" = "The output of kedro pipeline create, hence a Python sub-package that is compatible with kedro micropkg package"

And more:

So, if I'm correct, "a pipeline" and "a modular pipeline", depending on context, might be two entirely different categories of things: the former a Python class, the latter a directory structure. Furthermore: a modular pipeline contains a pipeline (kedro.pipeline.pipeline.Pipeline) definition.

And this is where this terminology, in my opinion, falls apart: a "modular pipeline" is not a kedro.pipeline.pipeline.Pipeline "gone modular", it's a wrapper (in the form of a bunch of Python modules with a specific structure) of a kedro.pipeline.pipeline.Pipeline. There is no IS-A (inheritance) relationship between "modular pipeline" and "pipeline", but rather a HAS-A (composition) relationship. A "modular pipeline" is not a pipeline, and it's not even a module because it's a package (a bunch of modules).

Not that I have better ideas now (and also I don't want to boil the ocean), but I wanted to at least give my interpretation.

@astrojuanlu
Copy link
Member Author

A bit more insight on modular pipelines https://github.com/quantumblacklabs/private-kedro/issues/752#issuecomment-736680109 (private link)

(@idanov if you consent, you could copy-paste that comment here)

@stichbury
Copy link
Contributor

I'm removing the documentation label from this as we have a docs task (#1998) to cover improvement of docs about modular pipelines. This ticket (to my mind) cover the philosophy of how we talk about modular pipelines and the language we want to use in communicating to users. It needs to happen ahead of the docs and then, when all is agreed, the docs can be overhauled. So #1998 is dependent on this (a "child" if you like) but this isn't a docs ticket.

@astrojuanlu
Copy link
Member Author

After we merge #3948, I think the only things left are doing one last pass on the Kedro Frameworks docs and reviewing the Kedro-Viz ones.

As far as I understand (after 1 year of chewing on this issue), Kedro-Viz mostly cares about 2 things:

  • The pipeline registry. It doesn't matter if the registry refers to modular pipelines (hence those wrapped in the kedro pipeline create directory structure) or not.
  • Namespaces. These are shown as nested ("sub-pipelines" as @MatthiasRoels called them).

Since Kedro-Viz doesn't really have a user guide, there is not much to review. The word "modular" appears exactly once in the docs:

https://github.com/kedro-org/kedro-viz/blob/1d14055f5a75ba32e6db37f3bb8a24aec71986b8/docs/source/index.md?plain=1#L17

The codebase is another thing though. kedro-org/kedro-viz#1941 refers to "modular pipelines" and so do all the Python classes, but it's actually talking about namespaces. I reckon that doing a Search & Replace might have big, unintended consequences (cc @rashidakanchwala) so it's probably not worth the effort, but at least user-facing documentation should make the concepts crystal clear.

@astrojuanlu astrojuanlu added Component: Documentation 📄 Issue/PR for markdown and API documentation Issue: Bug Report 🐞 Bug that needs to be fixed and removed Issue: Feature Request New feature or improvement to existing feature labels Jun 30, 2024
@astrojuanlu
Copy link
Member Author

@astrojuanlu
Copy link
Member Author

astrojuanlu commented Jul 2, 2024

So, long story short:

@astrojuanlu
Copy link
Member Author

Moving this back to our Inbox so we can re-prioritise.

@astrojuanlu
Copy link
Member Author

Opened an child issue exlusively about deciding a better name #4016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Documentation 📄 Issue/PR for markdown and API documentation Issue: Bug Report 🐞 Bug that needs to be fixed Type: Parent Issue
Projects
Status: To Do
Status: Backlog
Development

No branches or pull requests

4 participants