-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add termination condition to deleting nodes #492
Add termination condition to deleting nodes #492
Conversation
/assign |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like something you missed out while testing.
f7b5173
to
8c3b2b1
Compare
Thanks @guydaichs for the great PR, and sorry for the delay in review :) |
8c3b2b1
to
f65d33f
Compare
f65d33f
to
0a93d9a
Compare
0a93d9a
to
2a1a1f0
Compare
Thanks @guydaichs for the changes, I'd be testing this today quickly and would like to go ahead merging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few more minor comments.
I could test it quickly and works pretty well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@guydaichs Thanks for the changes!
Hi Colleagues, Something we have probably forgotten to consider is that this logic is currently maintained in two code paths also here - https://github.com/gardener/machine-controller-manager/blob/master/pkg/util/provider/machinecontroller/machine_util.go#L817-L951. Ideally the change has to be in both code paths until we have completely finished the OOT migration. CC: @amshuman-kr @guydaichs - Really nice work with the changes and tests. Really appreciate it. Also, I have updated the release notes to external users to update the RBACs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, looks good, thanks for the PR @guydaichs .
As per the discussion with Prashanth, we realized, it's better to have the related OOT changes be done in a subsequent PR.
/lgtm
Changes missed out by PR #492
|
||
// If machine was created on the cloud provider | ||
machineID, _ := driver.GetExisting() | ||
|
||
// update node with the machine's state prior to termination |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @guydaichs ,
Looks like this change in moving the machine phase to termination is affecting the drain logic making it trigger a force deletion (as force deletion is dependent on the last update time whicch is to be set on machine termination) many of times. Refer - #563. I will have to move the setting machine Terminating phase to before updating the node conditions.
The machines were being force drained during race conditions due to PR gardener#492. This commit fixes that by - Make the drain calculation based on deletionTimestamp - Reverted logic from PR gardener#492 to first set machine Termination phase
The machines were being force drained during race conditions due to PR gardener#492. This commit fixes that by - Make the drain calculation based on deletionTimestamp - Reverted logic from PR gardener#492 to first set machine Termination phase
The machines were being force drained during race conditions due to PR gardener#492. This commit fixes that by - Make the drain calculation based on deletionTimestamp - Reverted logic from PR gardener#492 to first set machine Termination phase
…eletionTimestamp (#564) * Fixes issues with machines being force drained The machines were being force drained during race conditions due to PR #492. This commit fixes that by - Make the drain calculation based on deletion timestamp - Reverted logic from PR #492 to first set machine Termination phase * Update pkg/util/time/time_test.go Fixed typo Co-authored-by: Hardik Dodiya <[email protected]>
…eletionTimestamp (gardener#564) * Fixes issues with machines being force drained The machines were being force drained during race conditions due to PR gardener#492. This commit fixes that by - Make the drain calculation based on deletion timestamp - Reverted logic from PR gardener#492 to first set machine Termination phase
…eletionTimestamp (gardener#564) * Fixes issues with machines being force drained The machines were being force drained during race conditions due to PR gardener#492. This commit fixes that by - Make the drain calculation based on deletion timestamp - Reverted logic from PR gardener#492 to first set machine Termination phase * Update pkg/util/time/time_test.go Fixed typo Co-authored-by: Hardik Dodiya <[email protected]>
…eletionTimestamp (gardener#564) * Fixes issues with machines being force drained The machines were being force drained during race conditions due to PR gardener#492. This commit fixes that by - Make the drain calculation based on deletion timestamp - Reverted logic from PR gardener#492 to first set machine Termination phase * Update pkg/util/time/time_test.go Fixed typo Co-authored-by: Hardik Dodiya <[email protected]>
* Prepare Next Dev Cycle v0.35.0-dev * Wait also for CSI PVs during serial eviction Signed-off-by: ialidzhikov <[email protected]> * add servergroup option * [azure] Verify tags Signed-off-by: Andrey Klimentyev <[email protected]> * Allow deletion of machines where modify instance fails on deletion * [OOT] Re-enabled support for bootstrap token injection Signed-off-by: Andrey Klimentyev <[email protected]> * Azure MachineSet: VM ScaleSet or AvailabilitySet Add generic MachineSet configuration in AzureMachineClass. The machineSet field can be used to configure either a VMO or an AvailabilitySet for a MachineClass. The availabilitySets field will be kept for a while but is now deprecated in favour of the machineset field. The Azure compute sdk is also updated to support VMSS/VMO. * Introduce constant backoff while enqueuing machine objects on creation and deletion failures Co-authored-by: Prashanth <[email protected]> * Bugfix: Finalizers are added by default on MC * Added event handlers for MachCls Fixed minor typo * Removed multiple machineClass adds when pointing to same class.kind * Fix typo in documentation The documentation was pointing to the wrong branch. This change fixes it. * Improve documentation with FAQ Co-authored-by: Prashanth <[email protected]> Co-authored-by: Samarth Deyagond <[email protected]> * Enable NetworkUnavailable as node condition for machine health (gardener#543) Add NetworkUnavailable as node condition to check while the machine is unhealthy * Removing excessive spaces in FAQ ToC links in the "Basics" section were broken due to this. * OOT: Fixes drain failure timeout issues (gardener#548) * OpenStack driver support to create VMs in a given subnet. * Adapt integration tests to handle possibly orphaned resources * Fix the OpenAPI validation of the machine-deployment crd * Reduced default drain timeout to 2hours * Bugfix: maxEvictRetries derived from drainTimeout * OOT: Enqueue machine only when conditions change * Split retry into Short Medium & Long periods * Delete the orphan resources * Backoff on machine creation/deletion faliures. - Introdcued CrashLoopBackOff on machine creation failure * Use listers to get the machine-object during reconciliation * Allow migration to continue when machineClass with same name as ProviderMachineClass is found (gardener#559) * Release v0.35.0 * Prepare Next Dev Cycle v0.36.0-dev * Fix Azure machine deletion if resource group is gone * Add FAQ for paused field * Set Machine Phase to Terminating & make drain calculations based on deletionTimestamp (gardener#564) * Fixes issues with machines being force drained The machines were being force drained during race conditions due to PR gardener#492. This commit fixes that by - Make the drain calculation based on deletion timestamp - Reverted logic from PR gardener#492 to first set machine Termination phase * Update pkg/util/time/time_test.go Fixed typo Co-authored-by: Hardik Dodiya <[email protected]> * Update docker images to use latest gcr copy - Updated from golang:1.13.5 to eu.gcr.io/gardener-project/3rd/golang:1.15.5 - Updated from alpine:3.11.2 to eu.gcr.io/gardener-project/3rd/alpine:3.12.1 * Update docs/FAQ.md Co-authored-by: Dieter Guendisch <[email protected]> * Prevent panic when machine template hash length is < 5 Signed-off-by: ialidzhikov <[email protected]> * In-tree drivers work with credentials data map They are no longer working with the whole secret because they don't need it. All of them are evaluating the data map only. * Introduce `.spec.credentialsSecretRef` for in-tree machine classes * Introduce `.spec.credentialsSecretRef` for out-of-tree machine class * Support Gardener credentials data keys for Alicloud, AWS, Azure, GCP * bump aws sdk version to v1.23.13 * Migration note for OOT providers (gardener#581) * Updated Readme.md * Release v0.36.0 * Prepare Next Dev Cycle v0.37.0-dev * Update pull_request_template.md (gardener#582) Signed-off-by: ialidzhikov <[email protected]> * Correcting the variable substitution * docforge manifest (gardener#587) * Check for misconfigured PDBs on Node drain and set proper error description Signed-off-by: ialidzhikov <[email protected]> * Fix for issue gardener#583 with UT * Release v0.37.0 * Prepare next Dev Cycle v0.38.0-dev * add component-descriptor-trait to head-update job * Update eu.gcr.io/gardener-project/cc/job-image-golang to 0.12.0 * Update eu.gcr.io/gardener-project/cc/job-image-golang to 0.10.0 * Update eu.gcr.io/gardener-project/cc/job-image-golang to 0.11.0 * Run update before fetching on CI image (gardener#596) * Improved NIC creation and deletion logic - NIC creation now, adopts any existing NICs with matching name - NIC deletion confirms deletion by performing GET on deletion - Improved creation logs to give more details - Revendored additional libraries * Release v0.38.0 * Prepare next Dev Cycle v0.39.0-dev * Updated links to various MCM OOT providers * Fixed minor typo * Updated link for metal stack * skip drain readOnlyFileSystem logic and test case added (gardener#605) Co-authored-by: Samarth Deyagond <[email protected]> * Improved log details to include node name and provider ID (gardener#607) Co-authored-by: Prashanth <[email protected]> * Fixed formatting directive error * Fix panic when secretRef cannot be found (gardener#609) It would assign nil to secretRef, and this would panic when trying to log out the original namespace/name. * Fixes issue adding on finalizers on existing mach (gardener#611) * Release v0.39.0 * Feature openstack volume type & dualstack customs Fix podNetworkCidrs * add openstack volumeType for machine Co-authored-by: Maximilian Geberl <[email protected]> updated VERSION after rebase updated driver_openstack.go changed order of applied services added allowed address pairs Change make to einfachnuralex Split podnetwork and run update port n times changed version and repository added more v6 stuff * VERSION v0.39.0-ske-1 Co-authored-by: gardener-robot-ci-3 <[email protected]> Co-authored-by: ialidzhikov <[email protected]> Co-authored-by: Hardik Dodiya <[email protected]> Co-authored-by: Konstantinos Angelopoulos <[email protected]> Co-authored-by: Andrey Klimentyev <[email protected]> Co-authored-by: prashanth26 <[email protected]> Co-authored-by: dkistner <[email protected]> Co-authored-by: Samarth Deyagond <[email protected]> Co-authored-by: Anthony Comtois <[email protected]> Co-authored-by: Dmitry Shurupov <[email protected]> Co-authored-by: Thomas Buchner <[email protected]> Co-authored-by: Kistner, Dominic <[email protected]> Co-authored-by: Dieter Guendisch <[email protected]> Co-authored-by: Rafael Franzke <[email protected]> Co-authored-by: 郑佳金 <[email protected]> Co-authored-by: gardener-robot-ci-2 <[email protected]> Co-authored-by: Samarth Deyagond <[email protected]> Co-authored-by: Georgi Pavlov <[email protected]> Co-authored-by: Andreas Burger <[email protected]> Co-authored-by: gardener-robot-ci-1 <[email protected]> Co-authored-by: Himanshu Sharma <[email protected]> Co-authored-by: James Ravn <[email protected]> Co-authored-by: Manuel Ganter <[email protected]>
What this PR does / why we need it:
Helps users monitor and react to ongoing node terminations.
Which issue(s) this PR fixes:
Fixes #488
Special notes for your reviewer:
Assumes that termination is final in MCM, and a terminating node cannot be restored. So, only situations where the termination condition is true are handled.
Note that some RBAC changes are required to allow MCM updates of the node's status.
Release note: