Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: when key not present assume NodeRunning. Fixes 11843 #11847

Merged
merged 1 commit into from
Sep 21, 2023

Conversation

isubasinghe
Copy link
Member

@isubasinghe isubasinghe commented Sep 20, 2023

Fixes #11843

When a node is not present in the nodes map, this implies that the DAG is still running.
We errored out which is what causes #11843, this fixes that by assuming NodeRunning.

Verification

Tested on the workflow posted in #11843

Copy link
Member

@agilgur5 agilgur5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is equivalent logic as before, as you wrote in your analysis in #11843 (comment), so I'm fine approving.

But I am wondering if there is a deeper bug (that silently existed before) that we should try to fix here? Since the current logic means it got a nodeId that doesn't exist in the nodes map??
boundaryNode.children should all be valid nodes, so my only guess is that getRetryNodeChildrenIds might be getting an incorrect id, although the logic of that function seemed fine at a glance.

@isubasinghe
Copy link
Member Author

@agilgur5 I think its somewhat impossible to say if the default initialisation was intended or not.
But here is the argument for why this change makes sense:

The nodeId doesn't exist yet because the controller didn't create it yet, this implies code is yet to run.
Therefore the NodeRunning state is implied.

It makes sense to me given the overall design of the controller.

I do not like the fact that the controller even attempts to run this code, because it should know that these nodes have not been created. Making the controller more "context aware" would be nice, it wouldn't need to try paths that it knows it doesn't need to run yet, would probably reduce bugs as well.

@isubasinghe
Copy link
Member Author

To make a more general point, this is why default initialised values are bad, they do not scale (with respect to #users).

When a programmer uses default initialised values, they make an implicit assumption that this behaviour is fine/valid. The
problem here is that default initialised values are not always valid. How do you actually know the programmers intention? One way is via explicit checks, the other is via comments.

So yeah I think we should only allow for code relying on default initialised values, if it has been commented at least.
Explicit is always better though.

@agilgur5
Copy link
Member

To make a more general point, this is why default initialised values are bad
Explicit is always better though.

Oh for sure, agreed on that. We're fighting the language design a bit clearly. I do wonder if there's a linter check for this scenario as well though.
(which does feel ironic to me as someone who writes a lot more TS and Python, which are not as respected in the community)

@agilgur5
Copy link
Member

The nodeId doesn't exist yet because the controller didn't create it yet, this implies code is yet to run.
Therefore the NodeRunning state is implied.

Wait so does the id exist but the struct around it wasn't created? I actually didn't expect that scenario initially but I suppose that could happen. 😵‍💫

I do not like the fact that the controller even attempts to run this code, because it should know that these nodes have not been created.

So as far as my current understanding of the controller goes, it seems that speculation like this is primarily limited to DAGs as in the assessDAGPhase function. There are still good chunks of the controller I'm not super familiar with yet though.

But yea agreed that making things more deterministic is usually better. Some of the templating changes I'd like to make are conceptually similar (e.g. erroring out earlier at validation with type-checking etc instead of run-time issues)

Copy link
Member

@agilgur5 agilgur5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving since, as I mentioned above, this fixes the regression and is equivalent behavior to the previous version.

But would still like to continue the conversation here, both for my own edification, and to help with finding root cause

@terrytangyuan terrytangyuan merged commit 0fde680 into argoproj:master Sep 21, 2023
26 checks passed
@isubasinghe
Copy link
Member Author

To make a more general point, this is why default initialised values are bad
Explicit is always better though.

Oh for sure, agreed on that. We're fighting the language design a bit clearly. I do wonder if there's a linter check for this scenario as well though. (which does feel ironic to me as someone who writes a lot more TS and Python, which are not as respected in the community)

I definitely prefer mypy and TS over Go, Go has weird footguns like this that it invented for no real reason.

Wait so does the id exist but the struct around it wasn't created? I actually didn't expect that scenario initially but I suppose that could happen. 😵‍💫

Yup the node id exists because it is something that is generated at runtime (which is also not ideal), but its possible the struct wasn't created yet.

So as far as my current understanding of the controller goes, it seems that speculation like this is primarily limited to DAGs as in the assessDAGPhase function.

I maybe wrong, but my assessment was that this kind of speculation is actually relatively common. Feel like a lot of issues I've dealt with were at least indirectly because of this. Might just be my bias.

@agilgur5
Copy link
Member

but its possible the struct wasn't created yet.

well that would explain why I couldn't seem to find any deletions in #11451 (comment) -- it's not that they were deleted, it's that they didn't exist yet 😵‍💫

I maybe wrong, but my assessment was that this kind of speculation is actually relatively common.

You probably know this part of the codebase better than me, to be fair. But as a piece of anecdotal evidence, I never encountered these issues while using Workflows and most of our usage was with steps. If this is DAG-specific, that would add up. steps don't need a graph traversal either.

terrytangyuan pushed a commit that referenced this pull request Oct 19, 2023
dpadhiar pushed a commit to dpadhiar/argo-workflows that referenced this pull request May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics area/templates/dag
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Get the error "key was not found for ..." after upgrade from v3.4.7 to v3.4.11
3 participants