Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node Drain Logic: Allow users to force node drain #197

Merged
merged 1 commit into from
Jul 14, 2023

Conversation

jescarri
Copy link
Contributor

@jescarri jescarri commented Jul 7, 2023

[Allow users to force node drain]

If pods can't be drained due to a PDB the controller fails to drain the node and crashes.

F0706 23:17:17.114069 3022288 main.go:88] Error running agent: processing: getting pods for deletion: [cannot delete Pods declare no controller (use --force to override): test/test-iscsi]

This PR allows users to force drains.

How to use

Run the update-agent with -force-drain=true

Testing done

Deployed the patched update agent into a cluster that had workloads with PDBs that could not be disrupted and the update finished.

I0707 18:42:57.900984 4100321 agent.go:266] Marking node as unschedulable
I0707 18:42:57.963429 4100321 agent.go:277] Getting pod list for deletion
I0707 18:42:58.533578 4100321 agent.go:284] Deleting/Evicting 28 pods
I0707 18:42:58.536832 4100321 agent.go:602] evicting pod  XXXXX
I0707 18:42:58.536866 4100321 agent.go:602] evicting pod XXXXX
I0707 18:42:58.536891 4100321 agent.go:602] evicting pod XXXXX
I0707 18:42:58.536912 4100321 agent.go:602] evicting pod XXXXX
I0707 18:42:58.537070 4100321 agent.go:602] evicting pod XXXXX
I0707 18:42:58.537104 4100321 agent.go:602] evicting pod XXXXX
I0707 18:42:58.537131 4100321 agent.go:602] evicting pod XXXXX
I0707 18:42:58.537071 4100321 agent.go:602] evicting pod XXXXX
I0707 18:42:58.537180 4100321 agent.go:602] evicting pod XXXXX
I0707 18:42:58.537195 4100321 agent.go:602] evicting pod XXXXX
I0707 18:42:58.537274 4100321 agent.go:602] evicting pod XXXXX
I0707 18:42:58.537248 4100321 agent.go:602] evicting pod XXXXX
I0707 18:42:58.536835 4100321 agent.go:602] evicting pod XXXXX
I0707 18:42:58.537160 4100321 agent.go:602] evicting pod XXXXX
I0707 18:42:58.537182 4100321 agent.go:602] evicting pod XXXXX
I0707 18:42:58.537198 4100321 agent.go:602] evicting pod XXXXX
I0707 18:42:58.537107 4100321 agent.go:602] evicting pod foo/test-iscsi
I0707 18:42:58.536871 4100321 agent.go:602] evicting pod XXXXX
I0707 18:42:58.536844 4100321 agent.go:602] evicting pod XXXXX
I0707 18:42:58.537170 4100321 agent.go:602] evicting pod XXXXX
I0707 18:42:59.540454 4100321 request.go:690] Waited for 1.002213s due to client-side throttling, not priority and fairness, request: POST:[https://10.3.0.1:443/api/v1/namespaces/XXXXX/pods/XXXXX-1/eviction](https://10.3.0.1/api/v1/namespaces/XXXXX/pods/XXXXX-1/eviction)
E0707 18:43:01.760442 4100321 agent.go:602] error when evicting pods/"XXXXX" -n "XXXXX" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0707 18:43:06.761669 4100321 agent.go:602] evicting pod XXXXX/XXXXX
E0707 18:43:06.761797 4100321 agent.go:291] Ignoring node drain error and proceeding with reboot: [error when waiting for pod "XXXXX" terminating: global timeout reached: 1s, error when waiting for pod "XXXXX" 
I0707 18:43:06.762091 4100321 agent.go:294] Node drained, rebooting
  • Changelog entries added in the respective changelog/ directory (user-facing change, bug fix, security fix, update)
  • Inspected CI output for image differences: /boot and /usr size, packages, list files for any missing binaries, kernel modules, config files, kernel modules, etc.

Closes #196

@jescarri
Copy link
Contributor Author

jescarri commented Jul 7, 2023

Related to: #196

@pothos
Copy link
Member

pothos commented Jul 10, 2023

Thank you!
Apart from using a config flag, could we make the default behavior better to not crash and either try draining again and force after a certain timeout? Then we could make the forcing opt-out with your flag and by default we would do it, ensuring that everything works and the nodes get updates.

cmd/update-agent/main.go Outdated Show resolved Hide resolved
@invidian
Copy link
Member

Apart from using a config flag, could we make the default behavior better to not crash and either try draining again and force after a certain timeout? Then we could make the forcing opt-out with your flag and by default we would do it, ensuring that everything works and the nodes get updates.

We could retry draining I think, but I wouldn't force it by default, as it will bypass PDB, which may cause production issues for people. And I don't think such breaking change would be justified here.

Copy link
Member

@invidian invidian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution @jescarri! The PR is simple and looks good. I suggested some changes to CLI flag description and to add a test case for it which I would like to be added, otherwise it should be fine for merging.

pkg/agent/agent.go Show resolved Hide resolved
cmd/update-agent/main.go Outdated Show resolved Hide resolved
@jescarri
Copy link
Contributor Author

@invidian thanks for taking a look at this, all comments have been resolved.

cmd/update-agent/main.go Outdated Show resolved Hide resolved
If pods can't be drained due to a PDB the controller fails to drain the node and crashes.

F0706 23:17:17.114069 3022288 main.go:88] Error running agent: processing: getting pods for deletion: [cannot delete Pods declare no controller (use --force to override): test/test-iscsi]

This PR allows users to force drains.
@invidian invidian merged commit 7a1b0ff into flatcar:master Jul 14, 2023
@invidian
Copy link
Member

Thanks again for your contribution @jescarri! Is this something you would like to see released soon? We usually do FLUO releases on demand, hence the question :)

@jescarri
Copy link
Contributor Author

@invidian releasing a new version would be great.

Appreciate the flexibility on the ad-hoc release!.

@jescarri jescarri deleted the add_force_option branch July 14, 2023 19:11
@invidian
Copy link
Member

I'll do a round of updates then and work on a release, thanks again for feedback.

@invidian
Copy link
Member

v0.10.0-rc1 is out for the time being if you would like to test it. Official release will follow by the end of this week I guess.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFE] Add flag to force Node Drain
3 participants