fix(airflow): Fix nodes grouping #664

ElenaKhaustova · 2024-05-03T13:27:41Z

Description

Fix #655

Development notes

Checklist

Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the relevant RELEASE.md file
Added tests to cover my changes

Signed-off-by: Ankita Katiyar <[email protected]>

Signed-off-by: Elena Khaustova <[email protected]>

ElenaKhaustova · 2024-05-03T19:20:46Z

kedro-airflow/kedro_airflow/plugin.py

            for node, parent_nodes in pipeline.node_dependencies.items():
                for parent in parent_nodes:
                    dependencies[parent.name].append(node.name)

-        # Sort both parent and child nodes to make sure it's deterministic


This sorting was replaced with the original topological order obtained from kedro; it is deterministic. Otherwise nodes and dependencies have a different order which is confusing.

Signed-off-by: Elena Khaustova <[email protected]>

ElenaKhaustova · 2024-05-03T20:56:46Z

@ankatiyar and I have also discussed whether we need to add grouping logic to the e2e test. IMO, grouping logic is currently covered well by unit tests, and we can go without it. Unit tests already include kedro airflow create --group-in-memory command run, which is kind of the e2e test. But happy to hear others' thoughts.

ankatiyar

Manually tested, works well! 💯
ETA: About the e2e test, I think it's still worth adding an e2e test because in the unit tests we are mocking the catalog etc. The bug would have been evident earlier if it there was an e2e test in place. I think we can do it as a follow up task though, I want to explore other grouping strategies as a part of the airflow milestone too.

DimedS

Great work, @ElenaKhaustova ! It looks good to me. I just have a few small questions.

DimedS · 2024-05-09T11:15:45Z

kedro-airflow/kedro_airflow/plugin.py

+            dependencies = {}
+            for node in pipeline.nodes:
+                nodes[node.name] = [node]
+                dependencies[node.name] = []
            for node, parent_nodes in pipeline.node_dependencies.items():


Why do you believe that this row will be deterministic? I mean it affects the order of .append(node.name)

We are iterating nodes in the order they were resolved with topological sort, which is deterministic, and nodes and dependencies dictionaries are ordered.

dependencies[parent.name].append(node.name)
I understand that the parents will be deterministic because you initialised an empty dictionary previously. However, could you explain in more detail why the dictionary values containing the children will also be deterministic?

They're deterministic because all the traversals inside group_memory_nodes function are deterministic as well and based on pipeline.nodes order which is sorted already.

Can we add some comments about the intent of keeping it deterministic, it is not so obvious from the code. Ideally there should be tests for it but AFIAK it's a bit tricky to reproduce the randomness. This is only manually tested I assume.

They're deterministic because all the traversals inside group_memory_nodes function are deterministic as well and based on pipeline.nodes order which is sorted already.

Is it means that previously code was not deterministic (and sorting was used) because of using defaultdict?

Yes, exactly

We tested manually, yes.

Theoretically, we can randomise n test cases and run each test k times to ensure we get the same results within runs. But given that there's no randomness, the results will be the same, so I don't think we need it.

The randomness cannot be simulated with n use cases, this is another stories, but I agree we don't need it here.

Randomness - no, but you can make a stress test for determinism, which will not guarantee it with 100% chance, but for large n and k, it will be close to it. Anyway, creating an adjacency list is deterministic, and dfs is deterministic; except for them, we have other loops which rely on Pipeline API, so there's no place from where randomness can come. So, I agree on not doing anything with it. 🙂

DimedS · 2024-05-09T11:18:09Z

kedro-airflow/kedro_airflow/grouping.py

+                if node_input in memory_datasets:
+                    adj_matrix[node.name].add(output_to_node[node_input].name)
+                    adj_matrix[output_to_node[node_input].name].add(node.name)
+                parents[output_to_node[node_input].name].add(node.name)


To me, it seems like that's a children dictionary, because the values are children, as I understand it.

Yeah, a key is a parent's name, while value is the children's list. Will rename it to parent_to_children for clarity. Thank you!

Signed-off-by: Elena Khaustova <[email protected]>

noklam · 2024-05-09T13:19:26Z

kedro-airflow/kedro_airflow/grouping.py

+        for node_input in node.inputs:
+            if node_input in output_to_node:
+                if node_input in memory_datasets:
+                    adj_matrix[node.name].add(output_to_node[node_input].name)
+                    adj_matrix[output_to_node[node_input].name].add(node.name)
+                parent_to_children[output_to_node[node_input].name].add(node.name)


Suggested change

for node_input in node.inputs:

if node_input in output_to_node:

if node_input in memory_datasets:

adj_matrix[node.name].add(output_to_node[node_input].name)

adj_matrix[output_to_node[node_input].name].add(node.name)

parent_to_children[output_to_node[node_input].name].add(node.name)

for node_input in node.inputs:

if node_input in output_to_node:

parent_to_children[output_to_node[node_input].name].add(node.name)

if node_input in memory_datasets:

adj_matrix[node.name].add(output_to_node[node_input].name)

adj_matrix[output_to_node[node_input].name].add(node.name)

If rearranging it does not change the logic, I prefer to keep the related code closer.

noklam · 2024-05-09T13:20:43Z

kedro-airflow/kedro_airflow/grouping.py

+    """
+    memory_datasets = get_memory_datasets(catalog, pipeline)
+
+    adj_matrix: dict[str, set] = {node.name: set() for node in pipeline.nodes}


Is this really an adjacency matrix? I was thinking https://en.wikipedia.org/wiki/Adjacency_matrix but it seems like it's just another dictionary

It's an adjacency list: https://en.wikipedia.org/wiki/Adjacency_list. Will rename it

noklam

Left a couple of small comment. Is this tested manually?

The only concern for me is implementing a new dfs here. I recalled I have to implement a bfs for the SoftFailRunner, at the end I end up using some Pipeline API, which actually is doing the search already. I just want to make sure this is considered already, and do we see a need to extend on the Pipeline API side instead.

example of bfs:
https://github.com/noklam/kedro-softfail-runner/blob/970036ea8d5c969d02ed9150bfd4a2dc4baf967a/kedro_softfail_runner/core.py#L92

Signed-off-by: Elena Khaustova <[email protected]>

ElenaKhaustova · 2024-05-09T14:40:36Z

Left a couple of small comment. Is this tested manually?

The only concern for me is implementing a new dfs here. I recalled I have to implement a bfs for the SoftFailRunner, at the end I end up using some Pipeline API, which actually is doing the search already. I just want to make sure this is considered already, and do we see a need to extend on the Pipeline API side instead.

example of bfs: https://github.com/noklam/kedro-softfail-runner/blob/970036ea8d5c969d02ed9150bfd4a2dc4baf967a/kedro_softfail_runner/core.py#L92

Thank you for the comments, I've addressed them.

Not sure I fully understand the concern about using dfs vs bfs. We cannot rely just on Pipeline API since we need to adjust the graph by adding new edges to group connected MemoryDatasets into one node. Thus we need an updated adjacency matrix and make one traversal to find connected components. There's no difference whether it's a dfs or bfs. The only issue which might happen with dfs is if the number of nodes is more than the default size of the recursion stack which is around 1000.

Signed-off-by: Elena Khaustova <[email protected]>

noklam · 2024-05-10T11:53:27Z

Not sure I fully understand the concern about using dfs vs bfs. We cannot rely just on Pipeline API since we need to adjust the graph by adding new edges to group connected MemoryDatasets into one node. Thus we need an updated adjacency matrix and make one traversal to find connected components. There's no difference whether it's a dfs or bfs. The only issue which might happen with dfs is if the number of nodes is more than the default size of the recursion stack which is around 1000.

@ElenaKhaustova I haven't dived deep into this enough to understand it fully. I will approve this now as I don't want to block the PR as it works perfectly fine.

kedro-org/kedro#3758 It's another topic but a way to adjust the graph. The API is obviously bad but just want to show that there may be potential to open this up a little bit.

ElenaKhaustova · 2024-05-10T13:33:42Z

Not sure I fully understand the concern about using dfs vs bfs. We cannot rely just on Pipeline API since we need to adjust the graph by adding new edges to group connected MemoryDatasets into one node. Thus we need an updated adjacency matrix and make one traversal to find connected components. There's no difference whether it's a dfs or bfs. The only issue which might happen with dfs is if the number of nodes is more than the default size of the recursion stack which is around 1000.

@ElenaKhaustova I haven't dived deep into this enough to understand it fully. I will approve this now as I don't want to block the PR as it works perfectly fine.

kedro-org/kedro#3758 It's another topic but a way to adjust the graph. The API is obviously bad but just want to show that there may be potential to open this up a little bit.

To be honest, I don't think the API is bad it's just out of the scope of pipeline API.

* Update memory dataset checking Signed-off-by: Ankita Katiyar <[email protected]> * Built adjacency matrix Signed-off-by: Elena Khaustova <[email protected]> * Implemented connectivity components search Signed-off-by: Elena Khaustova <[email protected]> * Replaced sort with topological order Signed-off-by: Elena Khaustova <[email protected]> * Removed debug output Signed-off-by: Elena Khaustova <[email protected]> * Fixed pre-commit errors Signed-off-by: Elena Khaustova <[email protected]> * Updated unit tests for node grouping Signed-off-by: Elena Khaustova <[email protected]> * Refactored grouping function Signed-off-by: Elena Khaustova <[email protected]> * Added clarification comments Signed-off-by: Elena Khaustova <[email protected]> * Updated unit test Signed-off-by: Elena Khaustova <[email protected]> * Added missed return types Signed-off-by: Elena Khaustova <[email protected]> * Linter errors fix Signed-off-by: Elena Khaustova <[email protected]> * Fixed mypy errors Signed-off-by: Elena Khaustova <[email protected]> * Fixing docs build Signed-off-by: Elena Khaustova <[email protected]> * Fixing docs build Signed-off-by: Elena Khaustova <[email protected]> * Renamed parent dictionary Signed-off-by: Elena Khaustova <[email protected]> * Added comments to clarify the resulting order nodes Signed-off-by: Elena Khaustova <[email protected]> * Renamed matrix -> list Signed-off-by: Elena Khaustova <[email protected]> * Applied suggested change Signed-off-by: Elena Khaustova <[email protected]> * Added missed renamings Signed-off-by: Elena Khaustova <[email protected]> --------- Signed-off-by: Ankita Katiyar <[email protected]> Signed-off-by: Elena Khaustova <[email protected]> Co-authored-by: Ankita Katiyar <[email protected]> Signed-off-by: tgoelles <[email protected]>

ankatiyar and others added 8 commits April 30, 2024 17:34

Update memory dataset checking

eb75c6a

Signed-off-by: Ankita Katiyar <[email protected]>

Built adjacency matrix

49e5246

Signed-off-by: Elena Khaustova <[email protected]>

Implemented connectivity components search

fa4590d

Signed-off-by: Elena Khaustova <[email protected]>

Replaced sort with topological order

21d0f32

Signed-off-by: Elena Khaustova <[email protected]>

Removed debug output

bfff338

Signed-off-by: Elena Khaustova <[email protected]>

Fixed pre-commit errors

c6a5391

Signed-off-by: Elena Khaustova <[email protected]>

Updated unit tests for node grouping

15ea3f4

Signed-off-by: Elena Khaustova <[email protected]>

Merge branch 'main' into airflow-bug-nodes-grouping

aa1f28e

ElenaKhaustova self-assigned this May 3, 2024

ElenaKhaustova added 4 commits May 3, 2024 19:58

Refactored grouping function

581d689

Signed-off-by: Elena Khaustova <[email protected]>

Added clarification comments

08178e2

Signed-off-by: Elena Khaustova <[email protected]>

Updated unit test

03cbe60

Signed-off-by: Elena Khaustova <[email protected]>

Added missed return types

80c69e3

Signed-off-by: Elena Khaustova <[email protected]>

ElenaKhaustova commented May 3, 2024

View reviewed changes

ElenaKhaustova added 4 commits May 3, 2024 20:24

Linter errors fix

fd73a9a

Signed-off-by: Elena Khaustova <[email protected]>

Fixed mypy errors

6f54529

Signed-off-by: Elena Khaustova <[email protected]>

Fixing docs build

e342e96

Signed-off-by: Elena Khaustova <[email protected]>

Fixing docs build

a04fecf

Signed-off-by: Elena Khaustova <[email protected]>

ElenaKhaustova marked this pull request as ready for review May 3, 2024 20:49

ElenaKhaustova requested review from sbrugman, ankatiyar and noklam May 3, 2024 20:56

ankatiyar approved these changes May 7, 2024

View reviewed changes

ElenaKhaustova requested a review from DimedS May 8, 2024 10:43

Merge branch 'main' into airflow-bug-nodes-grouping

db458c7

DimedS approved these changes May 9, 2024

View reviewed changes

ElenaKhaustova added 2 commits May 9, 2024 14:02

Merge branch 'main' into airflow-bug-nodes-grouping

c5026f1

Renamed parent dictionary

0b0f9f3

Signed-off-by: Elena Khaustova <[email protected]>

noklam reviewed May 9, 2024

View reviewed changes

ElenaKhaustova added 4 commits May 9, 2024 14:32

Added comments to clarify the resulting order nodes

256c96f

Signed-off-by: Elena Khaustova <[email protected]>

Renamed matrix -> list

a8fb957

Signed-off-by: Elena Khaustova <[email protected]>

Applied suggested change

3a7fe2e

Signed-off-by: Elena Khaustova <[email protected]>

Merge branch 'main' into airflow-bug-nodes-grouping

daf25a8

ElenaKhaustova requested a review from noklam May 9, 2024 19:38

Added missed renamings

9fea4b0

Signed-off-by: Elena Khaustova <[email protected]>

ElenaKhaustova merged commit 624ace1 into main May 10, 2024
20 checks passed

ElenaKhaustova deleted the airflow-bug-nodes-grouping branch May 10, 2024 20:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(airflow): Fix nodes grouping #664

fix(airflow): Fix nodes grouping #664

ElenaKhaustova commented May 3, 2024

ElenaKhaustova May 3, 2024

ElenaKhaustova commented May 3, 2024

ankatiyar left a comment •

edited

Loading

DimedS left a comment

DimedS May 9, 2024

ElenaKhaustova May 9, 2024

DimedS May 9, 2024

ElenaKhaustova May 9, 2024

noklam May 9, 2024

DimedS May 9, 2024

ElenaKhaustova May 9, 2024

ElenaKhaustova May 9, 2024

noklam May 9, 2024

ElenaKhaustova May 10, 2024

DimedS May 9, 2024

ElenaKhaustova May 9, 2024

ElenaKhaustova May 9, 2024

noklam May 9, 2024

ElenaKhaustova May 9, 2024

noklam May 9, 2024

ElenaKhaustova May 9, 2024

ElenaKhaustova May 9, 2024

noklam left a comment

ElenaKhaustova commented May 9, 2024 •

edited

Loading

noklam commented May 10, 2024

ElenaKhaustova commented May 10, 2024

fix(airflow): Fix nodes grouping #664

fix(airflow): Fix nodes grouping #664

Conversation

ElenaKhaustova commented May 3, 2024

Description

Development notes

Checklist

Choose a reason for hiding this comment

ElenaKhaustova commented May 3, 2024

ankatiyar left a comment • edited Loading

Choose a reason for hiding this comment

DimedS left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

noklam left a comment

Choose a reason for hiding this comment

ElenaKhaustova commented May 9, 2024 • edited Loading

noklam commented May 10, 2024

ElenaKhaustova commented May 10, 2024

ankatiyar left a comment •

edited

Loading

ElenaKhaustova commented May 9, 2024 •

edited

Loading