Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Skip execution control for agent pod during pod reconciliation. Fixes #12726 #12732

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jswxstw
Copy link
Member

@jswxstw jswxstw commented Mar 4, 2024

Fixes #12726

Motivation

Obtain nonexistent node with agent pod in applyExecutionControl.

Modifications

Skip execution control for agent pod during pod reconciliation, it will be done here now:

// Execution control has been applied to the nodes with created pods after pod reconciliation.
// However, pending and suspended nodes do not have created pods, and taskset nodes use the agent pod.
// Apply execution control to these nodes now since pod reconciliation does not take effect on them.
woc.failNodesWithoutCreatedPodsAfterDeadlineOrShutdown()

Verification

local test

workflow/controller/exec_control.go Outdated Show resolved Hide resolved
Copy link
Member

@Joibel Joibel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Looks like the test failures are the common ones which have been happening recently.

@jswxstw
Copy link
Member Author

jswxstw commented Mar 4, 2024

LGTM.

Looks like the test failures are the common ones which have been happening recently.

Related to: #12730 (comment)

@agilgur5 agilgur5 added area/controller Controller issues, panics area/executor area/agent Argo Agent that runs for HTTP and Plugin templates labels May 4, 2024
@agilgur5

This comment was marked as resolved.

@agilgur5
Copy link
Member

agilgur5 commented May 4, 2024

/retest

Hmm that might have failed because the job run is 2 months old. I didn't have a button to rerun the job either in the UI.

To trigger a re-run I updated the branch via the button on the GH UI, selected rebase instead of merge, annnd it broke DCO 😕

Copy link
Member

@agilgur5 agilgur5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this has a use-case for Agent pods -- the Workflow's deadline should still apply to them.

woc.wf.NodeId("agent") seems to work for them in agent.go's getAgentPodName() func

@jswxstw
Copy link
Member Author

jswxstw commented May 4, 2024

To trigger a re-run I updated the branch via the button on the GH UI, selected rebase instead of merge, annnd it broke DCO

I fixed it, but there are some new failed tests that appear frequently recently.

I'm wondering if this has a use-case for Agent pods -- the Workflow's deadline should still apply to them.

I've also thought about it. The workflow's deadline can not apply to agent pods since they do not support being terminated now.

This is somewhat related to another issue I raised, see #12692 (comment).

@agilgur5

This comment was marked as resolved.

@agilgur5
Copy link
Member

agilgur5 commented May 4, 2024

I've also thought about it. The workflow's deadline can not apply to agent pods since they do not support being terminated now.

This is somewhat related to another issue I raised, see #12692 (comment).

Ah I see, because it doesn't support the kill command etc (which is pretty hacky as-is). Could you leave a TODO comment above this line then, e.g. // TODO: run for Agent Pod as well once it supports termination by the Controller?

Also when you've thought through some of these situations, it's useful to write them down in the PR description, both for historical context and so that reviewers know without even asking. I use a "### Notes to Reviewers" section in my PRs where there are some changes that require an explanation

Also just wanted to say, I do appreciate all the investigation and work you've done for the Agent and Plugins!

@agilgur5
Copy link
Member

agilgur5 commented May 4, 2024

I fixed it, but there are some new failed tests that appear frequently recently.

Yea these seem to have been really recent, I only noticed them earlier today. I'm not sure which PR caused them though from a look through of the commit history on main 😕 Seem to have happened at least two days ago though per #13002's test run

@jswxstw
Copy link
Member Author

jswxstw commented May 4, 2024

Ah I see, because it doesn't support the kill command etc (which is pretty hacky as-is). Could you leave a TODO comment above this line then, e.g. // TODO: run for Agent Pod as well once it supports termination by the Controller?

The agent pod will be deleted once the workflow is completed now, so the pod termination may be unnecessary.

This PR #12441 has support plugin nodes shutdown, but there are a few unresolved issues.

func (woc *wfOperationCtx) failSuspendedAndPendingNodesAfterDeadlineOrShutdown() {
for _, node := range woc.wf.Status.Nodes {
// fail suspended nodes or plugin nodes when shuting down
if woc.GetShutdownStrategy().Enabled() && (node.IsActiveSuspendNode() || node.IsActivePluginNode()) {
message := fmt.Sprintf("Stopped with strategy '%s'", woc.GetShutdownStrategy())
woc.markNodePhase(node.Name, wfv1.NodeFailed, message)
continue
}
// fail all pending and suspended nodes when exceeding deadline
deadlineExceeded := woc.workflowDeadline != nil && time.Now().UTC().After(*woc.workflowDeadline)
if deadlineExceeded && (node.Phase == wfv1.NodePending || node.IsActiveSuspendNode()) {
message := "Step exceeded its deadline"
woc.markNodePhase(node.Name, wfv1.NodeFailed, message)
continue
}
}
}

  1. Active HTTP nodes should also be shut down here since they also use the agent pod.
  2. The workflow's deadline should also apply to all active taskset nodes(HTTP/Plugin) here.

I will submit a new PR to fix it. EDIT: See #13016

@agilgur5

This comment was marked as resolved.

@agilgur5
Copy link
Member

agilgur5 commented May 4, 2024

The agent pod will be deleted once the workflow is completed now, so the pod termination may be unnecessary.

For now I suppose. It's a little bit different timing-wise, but close enough. A comment would still be helpful as it is relevant to #12692 / must change if that is implemented

  1. Active HTTP nodes should also be shut down here since they also use the agent pod.

Agreed

2. The workflow's ttl should also apply to all active taskset nodes(HTTP/Plugin) here.

How is TTL related there? Since it's only deadline related

@jswxstw
Copy link
Member Author

jswxstw commented May 5, 2024

How is TTL related there? Since it's only deadline related

Sorry, I made a mistake and I actually meant the workflow's deadline.
I actually want to say that it is reasonable to skip execution control in applyExecutionControl for the agent pod since it will or should be done in failSuspendedAndPendingNodesAfterDeadlineOrShutdown.
EDIT: See #13016

@agilgur5 agilgur5 self-assigned this May 8, 2024
@jswxstw
Copy link
Member Author

jswxstw commented Aug 26, 2024

@agilgur5 Can this PR be merged?

@agilgur5
Copy link
Member

agilgur5 commented Aug 26, 2024

Could you leave a TODO comment above this line then, e.g. // TODO: run for Agent Pod as well once it supports termination by the Controller?

A comment would still be helpful as it is relevant to #12692 / must change if that is implemented

This was the only thing I was waiting for. Has actually been on my to-do list to add this myself since I have that permission these days

@jswxstw jswxstw changed the title fix: Skip execution control for agent pod. Fixes #12726 fix: Skip execution control for agent pod during pod reconciliation. Fixes #12726 Aug 27, 2024
@jswxstw
Copy link
Member Author

jswxstw commented Aug 27, 2024

This was the only thing I was waiting for. Has actually been on my to-do list to add this myself since I have that permission these days

I updated the PR information, execution control will apply for nodes of the agent pod here now:

// Execution control has been applied to the nodes with created pods after pod reconciliation.
// However, pending and suspended nodes do not have created pods, and taskset nodes use the agent pod.
// Apply execution control to these nodes now since pod reconciliation does not take effect on them.
woc.failNodesWithoutCreatedPodsAfterDeadlineOrShutdown()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/agent Argo Agent that runs for HTTP and Plugin templates area/controller Controller issues, panics area/executor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Obtain nonexistent node with agent pod generating many error logs
3 participants