Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow multiple nodes to return same output #806

Closed
CapTen101 opened this issue Jun 24, 2021 · 11 comments
Closed

Allow multiple nodes to return same output #806

CapTen101 opened this issue Jun 24, 2021 · 11 comments
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@CapTen101
Copy link

Description

I want to visualize a particular pipeline in which multiple nodes point to same output. This gives me an error in the current version of kedro 0.17.4. Below is the error that I get:

Error: Output(s) ['target_1', 'target_2'] are returned by more than one nodes. Node outputs must be unique.
node(
        node_dummy_fun,
        inputs=["source_1"],
        outputs="target_1",
        name="job_id_x2",
    ),
node(
        node_dummy_fun,
        inputs=None,
        outputs="target_1",
        name="random",
    ),

Above nodes have different inputs and they point to same outputs.

Context

I just need to use Kedro viz for visualization of data-lineage and this scenario can benefit from this feature.

@CapTen101 CapTen101 added the Issue: Feature Request New feature or improvement to existing feature label Jun 24, 2021
@antonymilne
Copy link
Contributor

Hello @CapTen101. This is a constraint which is forced by the nature of the DAG that kedro builds, which is supposed to provide a deterministic structure for a reproducible pipeline run. If two nodes have the same output then it's not clear what should happen once node 1 has written the output and then node 2 is run: does it overwrite the output? Append to it in some way?

What exactly are you trying to achieve here? Maybe there is another way it's possible while working within the constraints of kedro's graph structure.

If you're not interested in running the DAG through kedro run but just looking to visualise a graph using kedro viz then this might already be possible through manually editing the json file that kedro viz uses to generate the graph.

@datajoely
Copy link
Contributor

As discussed on Discord here I think we can close this issue

@CapTen101
Copy link
Author

CapTen101 commented Jun 24, 2021

@AntonyMilneQB

Hello @CapTen101. This is a constraint which is forced by the nature of the DAG that kedro builds, which is supposed to provide a deterministic structure for a reproducible pipeline run. If two nodes have the same output then it's not clear what should happen once node 1 has written the output and then node 2 is run: does it overwrite the output? Append to it in some way?

What exactly are you trying to achieve here? Maybe there is another way it's possible while working within the constraints of kedro's graph structure.

If you're not interested in running the DAG through kedro run but just looking to visualise a graph using kedro viz then this might already be possible through manually editing the json file that kedro viz uses to generate the graph.

It is basically a table in database which is being read or written to by different SQL jobs.. and that is a real word scenario which doesn't indroduce any DAGs.

@CapTen101
Copy link
Author

@AntonyMilneQB It is achievable with a small modification in the Kedro source code! I commented line no. 178 in kedro/pipeline/pipeline.py and it worked.
Screenshot from 2021-06-24 15-45-52

@datajoely
Copy link
Contributor

Well done! Since the constraint is on the Kedro core side rather than the Viz side, perhaps you want to look into generating the JSON yourself longer term?

You can see what Kedro generates using this command:
image

I'm going to close this issue since it's an intended constraint on the core side - but Viz doesn't really care!

@datajoely
Copy link
Contributor

datajoely commented Jun 24, 2021

@CapTen101 one other suggestion by my colleague @limdauto was to monkey patch out _validate_unique_outputs() so that you don't need to fork the library, but it is still a bit hacky.

@CapTen101
Copy link
Author

@datajoely Actually, I went into my virtual environment directory and commented the line of code there itself. This way it'll work for all other visualization projects created inside that virtual environment.

@benniedp
Copy link

benniedp commented Jul 13, 2023

Sorry to dig up an old closed issue, but I'm also running into this problem.

This is a constraint which is forced by the nature of the DAG that kedro builds, which is supposed to provide a deterministic structure for a reproducible pipeline run. If two nodes have the same output then it's not clear what should happen once node 1 has written the output and then node 2 is run: does it overwrite the output? Append to it in some way?

I'm trying to understand the above constraint (from @antonymilne's comment above). Why does Kedro care what happens when a function's result is given to an output DataSet? Is it not up to the DataSet to decide how to manage this?

For example, given that node functions are expected to be pure functions, it might be useful to implement a write-only DataSet that simply writes a node's result to standard output. There's no reason such a class shouldn't be reusable as an output. Am I missing something?

@astrojuanlu
Copy link
Member

Hi @benniedp, I'm re-reading this old conversation and it's not entirely clear to me how the DAG properties prevent that datasets can be connected to two incoming nodes. Also, I cannot open the Discord link @datajoely shared because I was never part of the server.

I understand though that, from a Kedro perspective (not from a mathematical/DAG perspective) it's problematic to have two nodes return the same output with certain types of datasets.

To your original question:

For example, given that node functions are expected to be pure functions, it might be useful to implement a write-only DataSet that simply writes a node's result to standard output. There's no reason such a class shouldn't be reusable as an output.

Let's call the nodes A and B. How should we decide whether the output should be AB or BA?

@deepyaman
Copy link
Member

I'm trying to understand the above constraint (from @antonymilne's comment above). Why does Kedro care what happens when a function's result is given to an output DataSet? Is it not up to the DataSet to decide how to manage this?

@benniedp I think Kedro cares because it doesn't want users unknowingly/accidentally creating race conditions/unpredictable behavior. If Kedro quietly overwrote data, it could be extremely painful to debug. One can also argue, in 95+% of situations, you don't want to overwrite a dataset during a run--why would it be useful?

An appendable, (possibly) write-only dataset is one situation where this is useful, as you've raised. The pandas.ExcelDataSet is an example of this, when used with append mode. Kedro expects you to be explicit and create a dataset for each output, even though you're modifying the same file. Thinking about it, I'm not sure how it would play with a non-sequential runner. A drawback of this approach is, if you want to use the dataset as an input downstream, you have to include all appended output datasets as input to the consuming node (to ensure execution order).

Another situation where this is useful, from my experience, is with PartitionedDataSets. I think it's quite possible to want to generate some set of partition outputs using some logic in one node, and another set for different partitions and with different logic in another node, but to consume them together. (Or maybe use different modular pipelines to all produce partitions in the same dataset, and consume them in a final node without a fake no-op "merge" step.)

In both of these situations, Kedro stipulates additional restrictions--namely that you can't use versioning. This is because, as soon as you enable versioning, you're writing to a different filepath, and therefore a different dataset. This could even come up for a single intended run, in case of failure/having to resume; you'll get a new timestamped version on the run. There's no way that I'm aware of to solve this, without changing the way versioning in Kedro works.

My guess is that, for these reasons--increased likelihood of wrong behavior, the relative infrequency of this need, the fact that there is a way to achieve the goal by defining additional datasets, and potential limitations of enabling this--the additional restriction on the Kedro DAG exists.

@stefano-brambilla-venchi
Copy link

stefano-brambilla-venchi commented Jul 4, 2024

Hi everybody,
I came across this issue when facing a specific use case. I understand the reason behind these restrictions, but then I am not sure what should be the BEST practice, according to Kedro, in my use case.

I have a parquet table with columns ["date", "country", "store", "some_metrics"].
I have specific functions that compute those "some metrics" daily for each store in each country.

I initially designed the process creating a node for each store, all writing to a table in catalog defined as:

my_table:
    filepath: my_path
    type: spark.SparkDataset
    file_format: parquet
    save_args:
        mode: append
        partitionBy: ["date", "country", "store"]
node(
    compute_metric_country_store,
    inputs={},
    outputs="my_table"
    name = my_node
)

In that way, with a daily trigger, each node should run and add their rows in the correct partition. The "append" mode guarantees that whatever order Kedro choose for the nodes, is consistent.
Giving a node for each store (hundreds) allow us to check clearly the failed nodes.
Having only one parquet file allow us to keep clarity in the process and not having a monster catalog.

This architecture, though, is against the rule "each output only in one node"; in this ETL use case (that I believe is not that unique!), what would it be the BEST pratice for Kedro?

I am open to any suggestion, thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
None yet
Development

No branches or pull requests

7 participants