-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modular Pipeline breaks when external outputs are used as inputs to the same modular pipeline and another modular pipeline #1105
Comments
@tynandebold @rashidakanchwala iirc it was a deliberate decision to throw away edges to avoid cycles. If you scroll to the Edge cases section in this PDF, I noted a couple of them down: https://github.com/kedro-org/kedro-viz/blob/main/package/kedro_viz/docs/assets/expand_collapse_modular_pipelines_presentation.pdf |
Notes from Technical Design session:
|
ContextThis pain point was also an output from the Kedro-Viz adoption synthesis #987 Sometimes in the flowchart, the layers can't be activated because of circular dependency. This makes it harder to navigate Supporting quotes
|
Hi @tynandebold , I wanted to re-raise this issue. I was very happy when I saw that you fixed "incorrect rendering of datasets in modular pipelines." in the context of #1123 and #1439 and I went back to my old minimal working example to check whether it works now. Unfortunately, it still does not. Same code as before: Dataset 3 is still not connected to the collapsed pipeline even though it is explicitly listed as an output of that pipeline. I actually don't even need a nested scenario for that. Removing " namespace="sub_pipeline" produces the same error: I get a properly connected queue of tasks/datasets if "main pipeline" is expanded, but a fully disconnected 'dataset_3' if the main_pipeline is collapsed. I find it kind of strange that even after all the effort you put in to solve #1123, this seemingly trivial case still produces errors. But at the same time I carry high hopes that with all the learnings from 0.6.4, you may be able to fix this easily after fixing #1123. For me, unfortunately, fixing this error is still vital because my steadily growing production pipeline has several datasets which are used at multiple steps in the pipeline and I still cannot properly modularise the pipeline without breaking the visualisation. Cheers |
Thanks for raising this again @DrDaDe. I've added it to our next sprint, which starts next week on Monday. We'll have a look then. |
hi @DrDaDe - this is a known issue on Kedro-viz and we currently do not have a way to resolve this. This is because dataset_3 acts as both an input and an output of main_pipeline. This creates a cyclic dependency and as cycles are not allowed in DAGS - we remove these edges. |
Thank you for giving an update on this matter. However, I have to disagree with the cyclic dependency problem as a general explanation. Again, here is my example from above, simplified as much as possible: from kedro.pipeline import Pipeline, node, pipeline
def create_pipeline(**kwargs) -> Pipeline:
new_pipeline = pipeline(
[
node(lambda x: x,
inputs="dataset_in",
outputs="dataset_1",
name="step1"),
node(lambda x: x,
inputs="dataset_1",
outputs="dataset_2",
name="step2"),
node(lambda x: x,
inputs="dataset_2",
outputs="dataset_3",
name="step3"),
node(lambda x: x,
inputs="dataset_3",
outputs="dataset_out",
name="step4"
)
],
namespace="main_pipeline",
inputs=None,
outputs={"dataset_out", "dataset_3"}
)
return new_pipeline it produces the exact same graphs as already provided in my earlier message. Especially, dataset_3 gets disconnected if the main pipeline is collapsed. Clearly, there is no cyclic dependency at all in this pipeline, neither on the expanded nor at the collapsed level. At least not logically. If, as you suggest, the edge is removed due to a cyclic dependency detection, then there must either be an algorithmic error in that detection or another artifact in the code which involuntarily created a cyclic dependency even if logically there is none. But to be honest, I believe the root of this problem lies somewhere else. |
@DrDaDe - this is a correct DAG in Kedro. However the cyclic dependency is introduced during the collapsed pipeline view in Kedro-viz dataset3 is an output of main_pipeline.step3 in a collapsed view when you cannot see both step3 and step4 -- a cycle is introduced in the DAG where dataset_3 becomes both input and output to main_pipeline. Please see more details on the edge case section of this pdf - https://github.com/kedro-org/kedro-viz/blob/main/package/kedro_viz/docs/assets/expand_collapse_modular_pipelines_presentation.pdf cc - @noklam This is a known issue but it would be interesting to understand from you how you would be able to visualise this in a collapsed view. |
first of all: Many thanks for the explanation and link to the source. I understand the issue a little better now and maybe we can sort this out faster: What I agree with:
What I don't agree with
If you could tell me the part of the code which determines the inputs and outputs for the modular pipeline, I might be able to investigate further myself (I was unable to dig through the code by myself as I am not very good in understanding frontend project code). Hence I can only vaguely guess what the issue is: I believe the algorithm creates dataset 3 as an external node because I declare it as an output. If the algorithm then determines if any external output node (including the one I just created) is used as any input for any of the internal nodes of the pipeline, then it will declare dataset3 as an input node and we have a problem. From this random guess and not knowing what the actual algorithm is, I propose two possible solutions:
Let me know if I can be of any more help. Also, if you could tell me which part of the code determines the inputs and outputs of the collapsed pipeline, I'd be happy to check and test for myself. It would also be interesting if you could provide counter-examples to my proposal above, i.e. examples where it is important that internal datasets are considered as input datasets even though not explicitly declared as such. Addendum: Another thought of mine: Why is it that the algorithm believes dataset3 is an input but not, e.g., dataset2? The fact that I declare dataset3 as "output" seems to label it as a potential "candidate" for an input node. Proposal 2 should therefore straightforwardly fix this. |
@DrDaDe - Thanks for the above suggestions and workarounds :) Your thoughts are very helpful. We will investigate the above in our next sprint. Also here's where the code resides - https://github.com/kedro-org/kedro-viz/blob/main/package/kedro_viz/data_access/managers.py#L389
That is because when dataset3 becomes an external output to the modular_pipeline - in the collapsed mode it lies outside the modular pipeline whereas dataset2 resides inside the modular pipeline hence it is a hidden when the modular pipeline is collapsed. |
First of all: My minimal working example has now boiled further down to this: from kedro.pipeline import Pipeline, node, pipeline
def create_pipeline(**kwargs) -> Pipeline:
new_pipeline = pipeline(
[
node(lambda x: x,
inputs="dataset_in",
outputs="dataset_middle",
name="step_in"),
node(lambda x: x,
inputs="dataset_middle",
outputs="dataset_out",
name="step_out"
)
],
namespace="main_pipeline",
inputs=None,
outputs={"dataset_out", "dataset_middle"}
)
return new_pipeline I tried running my pipeline through the Here is the list of "inputs", "outputs", "external inputs" and "external outputs":
Node that the cycle-detector looks for the "descendants" which are as follows: descendants = nx.descendants(digraph, modular_pipeline_id)
bad_inputs = modular_pipeline.inputs.intersection(descendants) Here are the descendants, the graph produces
Funnily, since this does not intersect with "inputs", nothing gets removed! And I actually observed while debugging, that the "cycle node remover" code is never executed. So as a fun summary: What you suspected indeed happens: To be honest, I am even more confused than before but I hope maybe we can get closer to the core of the problem :-) My followup questions would be:
|
@DrDaDe , you are right. On further investigation based on your above point we realised that this part of the code cancels This in indeed a bug! We are planning to address this issue in our upcoming sprint and will keep you updated once the fix is released. Thank you for bringing this to our attention! |
@DrDaDe - The fix for this issue has been released in the latest Kedro-viz 7.0.0. Thank you so much for your support and relentless pursuit to have this issue fixed! 😄 |
Description
If an output from a modular pipeline is used both as an input to a node in the same modular pipeline or as an input to a node in another modular pipeline then Kedro-viz no longer recognises it as an external output to the modular pipeline.
Context
User @david-zihao-xu reported this issue on his Kedro-project.
See a visualization of the problem here.
Steps to Reproduce
the above shows that this only happens when an internal output (output created inside a modular pipeline is used as both internal input and external input to another modular pipeline.
Expected Result
This needs some technical discussion because 'Prm_spine_table' should ideally be an external output to modular pipeline "ingestion" but if it's outside then how it will be an input to the same modular pipeline. This is where it becomes circular interaction and it's not clear how we should represent it.
Checklist
The text was updated successfully, but these errors were encountered: