Modular Pipeline breaks when external outputs are used as inputs to the same modular pipeline and another modular pipeline #1105

rashidakanchwala · 2022-09-29T14:51:34Z

Description

If an output from a modular pipeline is used both as an input to a node in the same modular pipeline or as an input to a node in another modular pipeline then Kedro-viz no longer recognises it as an external output to the modular pipeline.

Context

User @david-zihao-xu reported this issue on his Kedro-project.

See a visualization of the problem here.

Steps to Reproduce

On demo project, notice the 'Prm Spine Table' does not have any input. When you expand it, you can see it is an output of a node from the modular pipeline - Ingestion. But when you collapse it again, it has no connection with ingestion.
In the demo-project/src/pipeline/data-ingestion/pipeline.py remove the lambda function which uses 'Prm_spine_table' as an input.
Once that is removed it all works fine. Ingestion is now having Prm Spine Table as output.

the above shows that this only happens when an internal output (output created inside a modular pipeline is used as both internal input and external input to another modular pipeline.

Expected Result

This needs some technical discussion because 'Prm_spine_table' should ideally be an external output to modular pipeline "ingestion" but if it's outside then how it will be an input to the same modular pipeline. This is where it becomes circular interaction and it's not clear how we should represent it.

Checklist

Include labels so that we can categorise your issue

limdauto · 2022-10-10T12:36:15Z

@tynandebold @rashidakanchwala iirc it was a deliberate decision to throw away edges to avoid cycles. If you scroll to the Edge cases section in this PDF, I noted a couple of them down: https://github.com/kedro-org/kedro-viz/blob/main/package/kedro_viz/docs/assets/expand_collapse_modular_pipelines_presentation.pdf

rashidakanchwala · 2022-10-10T12:43:27Z

thanks @limdauto , this is so helpful. I remember something around these lines being discussed back then but could not find which documentation had it.

Also FYI - @jmholzer

tynandebold · 2022-10-12T14:25:52Z

Notes from Technical Design session:

The current implementation seems to be removing the inbound edge to a node instead of the outbound
Let's investigate in the code how it's being done and play around with removing the other edge
We should aim to work out the possible combinations of these happenings and see what will be the best solution

NeroOkwa · 2022-11-25T17:44:25Z

Context

This pain point was also an output from the Kedro-Viz adoption synthesis #987

Sometimes in the flowchart, the layers can't be activated because of circular dependency. This makes it harder to navigate
your flowchart in Kedro-Viz.

Supporting quotes

“Sometimes on the layer level we have, we run into this that the layer can't be activated because of circular dependencies. I think there is an open issue for that, but that has been (especially in like big projects) a pain points sometimes because the graph that you then end up with is really much harder to navigate as opposed to with the layers that, that we created in the catalogs. And yeah, just being able to debug which data set is causing the issue would already be incredibly helpful”.

DrDaDe · 2023-01-24T10:18:06Z

I push this issue and provide another simple minimal working example:

pipeline code:

from kedro.pipeline import Pipeline, node, pipeline

def create_pipeline(**kwargs) -> Pipeline:

    sub_pipeline = pipeline(
        [
            node(lambda x: x,
                 inputs="dataset_1",
                 outputs="dataset_2",
                 name="step2"),
            node(lambda x: x,
                 inputs="dataset_2",
                 outputs="dataset_3",
                 name="step3"),
        ],
        inputs={"dataset_1"},
        outputs={"dataset_3"},
        namespace="sub_pipeline"
    )

    new_pipeline = pipeline(
        [
            node(lambda x: x,
                 inputs="dataset_in",
                 outputs="dataset_1",
                 name="step1"),
            sub_pipeline,
            node(lambda x: x,
                 inputs="dataset_3",
                 outputs="dataset_out",
                 name="step4"
            )
        ],
            namespace="main_pipeline",
        inputs=None,
        outputs={"dataset_out", "dataset_3"}
    )
    return new_pipeline

catalog:

dataset_in:
  type: json.JSONDataSet
  filepath: data/dataset_in.json

dataset_out:
  type: json.JSONDataSet
  filepath: data/dataset_out.json

dataset_in.json:

{
  "key": "value"
  }

Expanded Pipeline (correct)

Collapsed Pipeline (incorrect - dataset_3 should be output of overall pipeline)

DrDaDe · 2023-10-06T10:18:25Z

Hi @tynandebold ,

I wanted to re-raise this issue.

I was very happy when I saw that you fixed "incorrect rendering of datasets in modular pipelines." in the context of #1123 and #1439 and I went back to my old minimal working example to check whether it works now.

Unfortunately, it still does not.

Same code as before:

Expanded:

Collapsed:

Dataset 3 is still not connected to the collapsed pipeline even though it is explicitly listed as an output of that pipeline.

I actually don't even need a nested scenario for that. Removing " namespace="sub_pipeline" produces the same error: I get a properly connected queue of tasks/datasets if "main pipeline" is expanded, but a fully disconnected 'dataset_3' if the main_pipeline is collapsed.

I find it kind of strange that even after all the effort you put in to solve #1123, this seemingly trivial case still produces errors. But at the same time I carry high hopes that with all the learnings from 0.6.4, you may be able to fix this easily after fixing #1123.

For me, unfortunately, fixing this error is still vital because my steadily growing production pipeline has several datasets which are used at multiple steps in the pipeline and I still cannot properly modularise the pipeline without breaking the visualisation.

Cheers

tynandebold · 2023-10-12T12:04:00Z

Thanks for raising this again @DrDaDe.

I've added it to our next sprint, which starts next week on Monday. We'll have a look then.

rashidakanchwala · 2023-10-19T19:09:11Z

hi @DrDaDe - this is a known issue on Kedro-viz and we currently do not have a way to resolve this.

This is because dataset_3 acts as both an input and an output of main_pipeline. This creates a cyclic dependency and as cycles are not allowed in DAGS - we remove these edges.
We are working on having a better Kedro-viz error handling that informs users of this if it happens.

DrDaDe · 2023-10-27T09:06:42Z

Hi @rashidakanchwala,

Thank you for giving an update on this matter.

However, I have to disagree with the cyclic dependency problem as a general explanation. Again, here is my example from above, simplified as much as possible:

from kedro.pipeline import Pipeline, node, pipeline

def create_pipeline(**kwargs) -> Pipeline:


    new_pipeline = pipeline(
        [
            node(lambda x: x,
                 inputs="dataset_in",
                 outputs="dataset_1",
                 name="step1"),
            node(lambda x: x,
                 inputs="dataset_1",
                 outputs="dataset_2",
                 name="step2"),
            node(lambda x: x,
                 inputs="dataset_2",
                 outputs="dataset_3",
                 name="step3"),
            node(lambda x: x,
                 inputs="dataset_3",
                 outputs="dataset_out",
                 name="step4"
            )
        ],
            namespace="main_pipeline",
        inputs=None,
        outputs={"dataset_out", "dataset_3"}
    )
    return new_pipeline

it produces the exact same graphs as already provided in my earlier message. Especially, dataset_3 gets disconnected if the main pipeline is collapsed.

Clearly, there is no cyclic dependency at all in this pipeline, neither on the expanded nor at the collapsed level. At least not logically.

If, as you suggest, the edge is removed due to a cyclic dependency detection, then there must either be an algorithmic error in that detection or another artifact in the code which involuntarily created a cyclic dependency even if logically there is none.

But to be honest, I believe the root of this problem lies somewhere else.

rashidakanchwala · 2023-10-27T13:38:57Z

@DrDaDe - this is a correct DAG in Kedro. However the cyclic dependency is introduced during the collapsed pipeline view in Kedro-viz

dataset3 is an output of main_pipeline.step3
dataset3 is an input of main_pipeline.step4

in a collapsed view when you cannot see both step3 and step4 -- a cycle is introduced in the DAG where dataset_3 becomes both input and output to main_pipeline.

Please see more details on the edge case section of this pdf - https://github.com/kedro-org/kedro-viz/blob/main/package/kedro_viz/docs/assets/expand_collapse_modular_pipelines_presentation.pdf

cc - @noklam

This is a known issue but it would be interesting to understand from you how you would be able to visualise this in a collapsed view.

DrDaDe · 2023-11-01T09:02:27Z

Hi @rashidakanchwala,

first of all: Many thanks for the explanation and link to the source. I understand the issue a little better now and maybe we can sort this out faster:

What I agree with:

Dataset 3 is definitely considered an output of the modular pipeline, because I explicitly declare it to be.
If a dependency cycle occurs, edges have to be removed to be able to construct a tree view.
If dataset 3 is considered an input for the collapsed modular pipeline, a cycle occurs. (But mind the if, see below).

What I don't agree with

dataset 3 should never be considered an input for that pipeline. Neither do I declare it to be, nor does it make sense logically: An input to a pipeline must not and cannot be something the pipeline itself produces.

If you could tell me the part of the code which determines the inputs and outputs for the modular pipeline, I might be able to investigate further myself (I was unable to dig through the code by myself as I am not very good in understanding frontend project code).

Hence I can only vaguely guess what the issue is: I believe the algorithm creates dataset 3 as an external node because I declare it as an output. If the algorithm then determines if any external output node (including the one I just created) is used as any input for any of the internal nodes of the pipeline, then it will declare dataset3 as an input node and we have a problem.

From this random guess and not knowing what the actual algorithm is, I propose two possible solutions:

If a dataset is the output of any node inside a modular pipeline. then it cannot be an input dataset of that modular pipeline.
If a dataset is the output of a modular pipeline, then it cannot be an input dataset of that same modular pipeline.

Let me know if I can be of any more help. Also, if you could tell me which part of the code determines the inputs and outputs of the collapsed pipeline, I'd be happy to check and test for myself. It would also be interesting if you could provide counter-examples to my proposal above, i.e. examples where it is important that internal datasets are considered as input datasets even though not explicitly declared as such.

Addendum: Another thought of mine: Why is it that the algorithm believes dataset3 is an input but not, e.g., dataset2? The fact that I declare dataset3 as "output" seems to label it as a potential "candidate" for an input node. Proposal 2 should therefore straightforwardly fix this.

DrDaDe · 2023-11-01T09:20:49Z

If anyone stumbles across here: A colleague of mine found a nice workaround:

If I have a dataset like dataset3 (I'll call it internal_dataset) which is both an external output and an "internal input", I can circumvent the problem as follows:

add a pseudonode with a trivial lambda function which maps the input to itself
use internal_dataset as an input and define a new dataset external_dataset as output of that node
use external_dataset as the output of the modular pipeline.

Now the ambiguity is broken: internal dataset is purely internal (produced and consumed), external dataset is purely external (only produced and available as an external node)

from kedro.pipeline import Pipeline, node, pipeline

def create_pipeline(**kwargs) -> Pipeline:


    new_pipeline = pipeline(
        [
            node(lambda x: x,
                 inputs="dataset_in",
                 outputs="dataset_1",
                 name="step1"),
            node(lambda x: x,
                 inputs="dataset_1",
                 outputs="dataset_2",
                 name="step2"),
            node(lambda x: x,
                 inputs="dataset_2",
                 outputs="dataset_3",
                 name="step3"),
            node(lambda x: x,
                 inputs="dataset_3",
                 outputs="pseudodataset_3",
                 name="pseudostep3"),
            node(lambda x: x,
                 inputs="dataset_3",
                 outputs="dataset_out",
                 name="step4"
            )
        ],
            namespace="main_pipeline",
        inputs=None,
        outputs={"dataset_out", "pseudodataset_3"}
    )
    return new_pipeline

Grain of salt: The additional pseudonode may create confusion when reading the pipeline. In my case, for example, my trivial chain of nodes now becomes an actual tree with a branch.

rashidakanchwala · 2023-11-02T13:50:53Z

Let me know if I can be of any more help. Also, if you could tell me which part of the code determines the inputs and outputs of the collapsed pipeline, I'd be happy to check and test for myself. It would also be interesting if you could provide counter-examples to my proposal above, i.e. examples where it is important that internal datasets are considered as input datasets even though not explicitly declared as such.

@DrDaDe - Thanks for the above suggestions and workarounds :) Your thoughts are very helpful. We will investigate the above in our next sprint. Also here's where the code resides - https://github.com/kedro-org/kedro-viz/blob/main/package/kedro_viz/data_access/managers.py#L389

Why is it that the algorithm believes dataset3 is an input but not, e.g., dataset2? The fact that I declare dataset3 as "output" seems to label it as a potential "candidate" for an input node. Proposal 2 should therefore straightforwardly fix this.

That is because when dataset3 becomes an external output to the modular_pipeline - in the collapsed mode it lies outside the modular pipeline whereas dataset2 resides inside the modular pipeline hence it is a hidden when the modular pipeline is collapsed.

DrDaDe · 2023-11-02T15:26:17Z

Hi @rashidakanchwala

First of all: My minimal working example has now boiled further down to this:

from kedro.pipeline import Pipeline, node, pipeline

def create_pipeline(**kwargs) -> Pipeline:


    new_pipeline = pipeline(
        [
            node(lambda x: x,
                 inputs="dataset_in",
                 outputs="dataset_middle",
                 name="step_in"),
            node(lambda x: x,
                 inputs="dataset_middle",
                 outputs="dataset_out",
                 name="step_out"
            )
        ],
            namespace="main_pipeline",
        inputs=None,
        outputs={"dataset_out", "dataset_middle"}
    )
    return new_pipeline

I tried running my pipeline through the create_modular_pipelines_tree_for_registered_pipelinemethod you referenced and check what happens:

Here is the list of "inputs", "outputs", "external inputs" and "external outputs":

inputs: ['main_pipeline.dataset_in']
outputs  ['dataset_out']
external_inputs: ['dataset_middle']
external_outputs: ['dataset_out', 'dataset_middle']

Node that the cycle-detector looks for the "descendants" which are as follows:

descendants = nx.descendants(digraph, modular_pipeline_id)
bad_inputs = modular_pipeline.inputs.intersection(descendants)

Here are the descendants, the graph produces

['dataset_out']

Funnily, since this does not intersect with "inputs", nothing gets removed! And I actually observed while debugging, that the "cycle node remover" code is never executed.

So as a fun summary: What you suspected indeed happens: dataset_middle is considered an "external_input" but not an input. Still, it has nothing to do with the code you referenced, because the cycle detector looks for overlaps between inputs and descendents, not external inputs.

To be honest, I am even more confused than before but I hope maybe we can get closer to the core of the problem :-)

My followup questions would be:

which earlier part of the code decides that "dataset_middle" is an input? (Because this is where I disagree)
which later part of the code translates this pipeline into the 'collapsed view'?

rashidakanchwala · 2023-11-08T22:16:20Z

@DrDaDe , you are right. On further investigation based on your above point we realised that this part of the code cancels external_inputs and external_outputs and this prevents dataset_middle to be an output of the modular pipeline.

This in indeed a bug! We are planning to address this issue in our upcoming sprint and will keep you updated once the fix is released. Thank you for bringing this to our attention!

rashidakanchwala · 2023-12-19T13:05:50Z

@DrDaDe - The fix for this issue has been released in the latest Kedro-viz 7.0.0. Thank you so much for your support and relentless pursuit to have this issue fixed! 😄

rashidakanchwala added the Issue: Bug Report label Sep 29, 2022

tynandebold added the Technical Design label Oct 4, 2022

rashidakanchwala mentioned this issue Oct 10, 2022

Intermediate Dataset is shown disconnected to folded modular pipeline #839

Closed

1 task

tynandebold mentioned this issue Oct 17, 2022

Node isn't inclued in nested/combine modular pipeline #1134

Closed

1 task

NeroOkwa mentioned this issue Nov 25, 2022

Evaluating Kedro-Viz adoption #987

Closed

DrDaDe mentioned this issue Jan 24, 2023

Investigate incorrect rendering of dataset node seen on a project with nested modular pipelines #1123

Closed

1 task

rashidakanchwala assigned rashidakanchwala and unassigned rashidakanchwala Oct 19, 2023

rashidakanchwala mentioned this issue Oct 20, 2023

Error Handling for cyclic dependencies in the flowchart #1593

Closed

1 task

tynandebold assigned rashidakanchwala Oct 30, 2023

rashidakanchwala added Python Pull requests that update Python code and removed Technical Design labels Nov 9, 2023

rashidakanchwala mentioned this issue Nov 21, 2023

Fix modular pipelines breaking when collapsed. #1651

Merged

5 tasks

rashidakanchwala closed this as completed in #1651 Nov 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modular Pipeline breaks when external outputs are used as inputs to the same modular pipeline and another modular pipeline #1105

Modular Pipeline breaks when external outputs are used as inputs to the same modular pipeline and another modular pipeline #1105

rashidakanchwala commented Sep 29, 2022 •

edited by tynandebold

Loading

limdauto commented Oct 10, 2022

rashidakanchwala commented Oct 10, 2022

tynandebold commented Oct 12, 2022

NeroOkwa commented Nov 25, 2022 •

edited

Loading

DrDaDe commented Jan 24, 2023

DrDaDe commented Oct 6, 2023

tynandebold commented Oct 12, 2023

rashidakanchwala commented Oct 19, 2023 •

edited

Loading

DrDaDe commented Oct 27, 2023 •

edited

Loading

rashidakanchwala commented Oct 27, 2023 •

edited

Loading

DrDaDe commented Nov 1, 2023 •

edited

Loading

DrDaDe commented Nov 1, 2023

rashidakanchwala commented Nov 2, 2023 •

edited

Loading

DrDaDe commented Nov 2, 2023 •

edited

Loading

rashidakanchwala commented Nov 8, 2023 •

edited

Loading

rashidakanchwala commented Dec 19, 2023

Modular Pipeline breaks when external outputs are used as inputs to the same modular pipeline and another modular pipeline #1105

Modular Pipeline breaks when external outputs are used as inputs to the same modular pipeline and another modular pipeline #1105

Comments

rashidakanchwala commented Sep 29, 2022 • edited by tynandebold Loading

Description

Context

Steps to Reproduce

Expected Result

Checklist

limdauto commented Oct 10, 2022

rashidakanchwala commented Oct 10, 2022

tynandebold commented Oct 12, 2022

NeroOkwa commented Nov 25, 2022 • edited Loading

Context

Supporting quotes

DrDaDe commented Jan 24, 2023

DrDaDe commented Oct 6, 2023

tynandebold commented Oct 12, 2023

rashidakanchwala commented Oct 19, 2023 • edited Loading

DrDaDe commented Oct 27, 2023 • edited Loading

rashidakanchwala commented Oct 27, 2023 • edited Loading

DrDaDe commented Nov 1, 2023 • edited Loading

What I agree with:

What I don't agree with

DrDaDe commented Nov 1, 2023

rashidakanchwala commented Nov 2, 2023 • edited Loading

DrDaDe commented Nov 2, 2023 • edited Loading

rashidakanchwala commented Nov 8, 2023 • edited Loading

rashidakanchwala commented Dec 19, 2023

rashidakanchwala commented Sep 29, 2022 •

edited by tynandebold

Loading

NeroOkwa commented Nov 25, 2022 •

edited

Loading

rashidakanchwala commented Oct 19, 2023 •

edited

Loading

DrDaDe commented Oct 27, 2023 •

edited

Loading

rashidakanchwala commented Oct 27, 2023 •

edited

Loading

DrDaDe commented Nov 1, 2023 •

edited

Loading

rashidakanchwala commented Nov 2, 2023 •

edited

Loading

DrDaDe commented Nov 2, 2023 •

edited

Loading

rashidakanchwala commented Nov 8, 2023 •

edited

Loading