Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add termination condition to deleting nodes #492

Merged
merged 1 commit into from
Aug 19, 2020

Conversation

guydc
Copy link
Contributor

@guydc guydc commented Jul 21, 2020

What this PR does / why we need it:
Helps users monitor and react to ongoing node terminations.

Which issue(s) this PR fixes:
Fixes #488

Special notes for your reviewer:
Assumes that termination is final in MCM, and a terminating node cannot be restored. So, only situations where the termination condition is true are handled.

Note that some RBAC changes are required to allow MCM updates of the node's status.

Release note:

RBAC policies have to be updated to allow updating of `node/status` resources. 
Node condition is added to the status of terminating nodes indicating the termination start time and reason (Unhealthy|ScaleDown)

@guydc guydc requested review from ggaurav10 and a team as code owners July 21, 2020 19:06
@gardener-robot-ci-1 gardener-robot-ci-1 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jul 21, 2020
@gardener-robot-ci-3 gardener-robot-ci-3 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jul 21, 2020
@hardikdr hardikdr added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jul 24, 2020
@hardikdr
Copy link
Member

/assign

@gardener-robot-ci-2 gardener-robot-ci-2 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jul 29, 2020
Copy link
Contributor

@prashanth26 prashanth26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like something you missed out while testing.

pkg/controller/machine_util_test.go Outdated Show resolved Hide resolved
@guydc guydc force-pushed the feature/annotate-on-termination branch from f7b5173 to 8c3b2b1 Compare July 30, 2020 05:45
@gardener-robot-ci-2 gardener-robot-ci-2 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jul 30, 2020
@gardener-robot-ci-1 gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jul 30, 2020
pkg/controller/machine.go Outdated Show resolved Hide resolved
@hardikdr
Copy link
Member

Thanks @guydaichs for the great PR, and sorry for the delay in review :)

@guydc guydc force-pushed the feature/annotate-on-termination branch from 8c3b2b1 to f65d33f Compare August 2, 2020 16:36
@gardener-robot-ci-1 gardener-robot-ci-1 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Aug 2, 2020
@gardener-robot-ci-3 gardener-robot-ci-3 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Aug 2, 2020
@guydc guydc force-pushed the feature/annotate-on-termination branch from f65d33f to 0a93d9a Compare August 2, 2020 17:26
@gardener-robot-ci-3 gardener-robot-ci-3 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Aug 2, 2020
@gardener-robot-ci-1 gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Aug 2, 2020
@guydc guydc force-pushed the feature/annotate-on-termination branch from 0a93d9a to 2a1a1f0 Compare August 2, 2020 17:46
@gardener-robot-ci-3 gardener-robot-ci-3 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Aug 2, 2020
@gardener-robot-ci-1 gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Aug 2, 2020
@hardikdr
Copy link
Member

hardikdr commented Aug 3, 2020

Thanks @guydaichs for the changes, I'd be testing this today quickly and would like to go ahead merging.
@prashanth26 @ggaurav10 Can you please take a final look if possible.

Copy link
Member

@hardikdr hardikdr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few more minor comments.

pkg/controller/machine_util.go Outdated Show resolved Hide resolved
pkg/controller/machine.go Outdated Show resolved Hide resolved
@hardikdr
Copy link
Member

hardikdr commented Aug 3, 2020

I could test it quickly and works pretty well.

@gardener-robot-ci-1 gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Aug 15, 2020
Copy link

@amshuman-kr amshuman-kr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guydaichs Thanks for the changes!

@prashanth26
Copy link
Contributor

prashanth26 commented Aug 16, 2020

Hi Colleagues,

Something we have probably forgotten to consider is that this logic is currently maintained in two code paths also here - https://github.com/gardener/machine-controller-manager/blob/master/pkg/util/provider/machinecontroller/machine_util.go#L817-L951. Ideally the change has to be in both code paths until we have completely finished the OOT migration.

CC: @amshuman-kr

@guydaichs - Really nice work with the changes and tests. Really appreciate it. Also, I have updated the release notes to external users to update the RBACs.

Copy link
Member

@hardikdr hardikdr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks good, thanks for the PR @guydaichs .

As per the discussion with Prashanth, we realized, it's better to have the related OOT changes be done in a subsequent PR.
/lgtm

@gardener-robot gardener-robot added the reviewed/lgtm Has approval for merging label Aug 19, 2020

// If machine was created on the cloud provider
machineID, _ := driver.GetExisting()

// update node with the machine's state prior to termination
Copy link
Contributor

@prashanth26 prashanth26 Nov 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @guydaichs ,

Looks like this change in moving the machine phase to termination is affecting the drain logic making it trigger a force deletion (as force deletion is dependent on the last update time whicch is to be set on machine termination) many of times. Refer - #563. I will have to move the setting machine Terminating phase to before updating the node conditions.

prashanth26 added a commit to prashanth26/machine-controller-manager that referenced this pull request Nov 23, 2020
The machines were being force drained during race conditions due to PR gardener#492.
This commit fixes that by
	- Make the drain calculation based on deletionTimestamp
	- Reverted logic from PR gardener#492 to first set machine Termination phase
prashanth26 added a commit to prashanth26/machine-controller-manager that referenced this pull request Nov 23, 2020
The machines were being force drained during race conditions due to PR gardener#492.
This commit fixes that by
	- Make the drain calculation based on deletionTimestamp
	- Reverted logic from PR gardener#492 to first set machine Termination phase
prashanth26 added a commit to prashanth26/machine-controller-manager that referenced this pull request Nov 23, 2020
The machines were being force drained during race conditions due to PR gardener#492.
This commit fixes that by
	- Make the drain calculation based on deletionTimestamp
	- Reverted logic from PR gardener#492 to first set machine Termination phase
prashanth26 added a commit that referenced this pull request Nov 24, 2020
…eletionTimestamp (#564)

* Fixes issues with machines being force drained

The machines were being force drained during race conditions due to PR #492.
This commit fixes that by
	- Make the drain calculation based on deletion timestamp
	- Reverted logic from PR #492 to first set machine Termination phase

* Update pkg/util/time/time_test.go

Fixed typo

Co-authored-by: Hardik Dodiya <[email protected]>
prashanth26 added a commit to prashanth26/machine-controller-manager that referenced this pull request Nov 24, 2020
…eletionTimestamp (gardener#564)

* Fixes issues with machines being force drained

The machines were being force drained during race conditions due to PR gardener#492.
This commit fixes that by
	- Make the drain calculation based on deletion timestamp
	- Reverted logic from PR gardener#492 to first set machine Termination phase
prashanth26 added a commit to prashanth26/machine-controller-manager that referenced this pull request Nov 24, 2020
…eletionTimestamp (gardener#564)

* Fixes issues with machines being force drained

The machines were being force drained during race conditions due to PR gardener#492.
This commit fixes that by
	- Make the drain calculation based on deletion timestamp
	- Reverted logic from PR gardener#492 to first set machine Termination phase

* Update pkg/util/time/time_test.go

Fixed typo

Co-authored-by: Hardik Dodiya <[email protected]>
prashanth26 added a commit to prashanth26/machine-controller-manager that referenced this pull request Nov 24, 2020
…eletionTimestamp (gardener#564)

* Fixes issues with machines being force drained

The machines were being force drained during race conditions due to PR gardener#492.
This commit fixes that by
	- Make the drain calculation based on deletion timestamp
	- Reverted logic from PR gardener#492 to first set machine Termination phase

* Update pkg/util/time/time_test.go

Fixed typo

Co-authored-by: Hardik Dodiya <[email protected]>
einfachnuralex added a commit to stackitcloud/machine-controller-manager that referenced this pull request Aug 20, 2021
* Prepare Next Dev Cycle v0.35.0-dev

* Wait also for CSI PVs during serial eviction

Signed-off-by: ialidzhikov <[email protected]>

* add servergroup option

* [azure] Verify tags

Signed-off-by: Andrey Klimentyev <[email protected]>

* Allow deletion of machines where modify instance fails on deletion

* [OOT] Re-enabled support for bootstrap token injection

Signed-off-by: Andrey Klimentyev <[email protected]>

* Azure MachineSet: VM ScaleSet or AvailabilitySet

Add generic MachineSet configuration in AzureMachineClass.
The machineSet field can be used to configure either a VMO or an AvailabilitySet for a MachineClass.

The availabilitySets field will be kept for a while but is now deprecated in favour of the machineset field.
The Azure compute sdk is also updated to support VMSS/VMO.

* Introduce constant backoff while enqueuing machine objects on creation and deletion failures

Co-authored-by: Prashanth <[email protected]>

* Bugfix: Finalizers are added by default on MC

* Added event handlers for MachCls

Fixed minor typo

* Removed multiple machineClass adds when pointing to same class.kind

* Fix typo in documentation

The documentation was pointing to the wrong branch. This change fixes it.

* Improve documentation with FAQ

Co-authored-by: Prashanth <[email protected]>
Co-authored-by: Samarth Deyagond <[email protected]>

* Enable NetworkUnavailable as node condition for machine health (gardener#543)

Add NetworkUnavailable as node condition to check while the machine is unhealthy

* Removing excessive spaces in FAQ

ToC links in the "Basics" section were broken due to this.

* OOT: Fixes drain failure timeout issues (gardener#548)

* OpenStack driver support to create VMs in a given subnet.

* Adapt integration tests to handle possibly orphaned resources

* Fix the OpenAPI validation of the machine-deployment crd

* Reduced default drain timeout to 2hours

* Bugfix: maxEvictRetries derived from drainTimeout

* OOT: Enqueue machine only when conditions change

* Split retry into Short Medium & Long periods

* Delete the orphan resources

* Backoff on machine creation/deletion faliures.

- Introdcued CrashLoopBackOff on machine creation failure

* Use listers to get the machine-object during reconciliation

* Allow migration to continue when machineClass with same name as ProviderMachineClass is found (gardener#559)

* Release v0.35.0

* Prepare Next Dev Cycle v0.36.0-dev

* Fix Azure machine deletion if resource group is gone

* Add FAQ for paused field

* Set Machine Phase to Terminating & make drain calculations based on deletionTimestamp (gardener#564)

* Fixes issues with machines being force drained

The machines were being force drained during race conditions due to PR gardener#492.
This commit fixes that by
	- Make the drain calculation based on deletion timestamp
	- Reverted logic from PR gardener#492 to first set machine Termination phase

* Update pkg/util/time/time_test.go

Fixed typo

Co-authored-by: Hardik Dodiya <[email protected]>

* Update docker images to use latest gcr copy

  - Updated from golang:1.13.5 to eu.gcr.io/gardener-project/3rd/golang:1.15.5
  - Updated from alpine:3.11.2 to eu.gcr.io/gardener-project/3rd/alpine:3.12.1

* Update docs/FAQ.md

Co-authored-by: Dieter Guendisch <[email protected]>

* Prevent panic when machine template hash length is < 5

Signed-off-by: ialidzhikov <[email protected]>

* In-tree drivers work with credentials data map

They are no longer working with the whole secret because they don't need it. All of them are evaluating the data map only.

* Introduce `.spec.credentialsSecretRef` for in-tree machine classes

* Introduce `.spec.credentialsSecretRef` for out-of-tree machine class

* Support Gardener credentials data keys for Alicloud, AWS, Azure, GCP

* bump aws sdk version to v1.23.13

* Migration note for OOT providers (gardener#581)

* Updated Readme.md

* Release v0.36.0

* Prepare Next Dev Cycle v0.37.0-dev

* Update pull_request_template.md (gardener#582)

Signed-off-by: ialidzhikov <[email protected]>

* Correcting the variable substitution

* docforge manifest (gardener#587)

* Check for misconfigured PDBs on Node drain and set proper error description

Signed-off-by: ialidzhikov <[email protected]>

* Fix for issue gardener#583 with UT

* Release v0.37.0

* Prepare next Dev Cycle v0.38.0-dev

* add component-descriptor-trait to head-update job

* Update eu.gcr.io/gardener-project/cc/job-image-golang to 0.12.0

* Update eu.gcr.io/gardener-project/cc/job-image-golang to 0.10.0

* Update eu.gcr.io/gardener-project/cc/job-image-golang to 0.11.0

* Run update before fetching on CI image (gardener#596)

* Improved NIC creation and deletion logic

- NIC creation now, adopts any existing NICs with matching name
- NIC deletion confirms deletion by performing GET on deletion
- Improved creation logs to give more details
- Revendored additional libraries

* Release v0.38.0

* Prepare next Dev Cycle v0.39.0-dev

* Updated links to various MCM OOT providers

* Fixed minor typo

* Updated link for metal stack

* skip drain readOnlyFileSystem logic and test case added (gardener#605)

Co-authored-by: Samarth Deyagond <[email protected]>

* Improved log details to include node name and provider ID (gardener#607)

Co-authored-by: Prashanth <[email protected]>

* Fixed formatting directive error

* Fix panic when secretRef cannot be found (gardener#609)

It would assign nil to secretRef, and this would panic when trying
to log out the original namespace/name.

* Fixes issue adding on finalizers on existing mach (gardener#611)

* Release v0.39.0

* Feature openstack volume type & dualstack customs

Fix podNetworkCidrs

* add openstack volumeType for machine

Co-authored-by: Maximilian Geberl <[email protected]>

updated VERSION after rebase

updated driver_openstack.go

changed order of applied services

added allowed address pairs

Change make to einfachnuralex

Split podnetwork and run update port n times

changed version and repository

added more v6 stuff

* VERSION v0.39.0-ske-1

Co-authored-by: gardener-robot-ci-3 <[email protected]>
Co-authored-by: ialidzhikov <[email protected]>
Co-authored-by: Hardik Dodiya <[email protected]>
Co-authored-by: Konstantinos Angelopoulos <[email protected]>
Co-authored-by: Andrey Klimentyev <[email protected]>
Co-authored-by: prashanth26 <[email protected]>
Co-authored-by: dkistner <[email protected]>
Co-authored-by: Samarth Deyagond <[email protected]>
Co-authored-by: Anthony Comtois <[email protected]>
Co-authored-by: Dmitry Shurupov <[email protected]>
Co-authored-by: Thomas Buchner <[email protected]>
Co-authored-by: Kistner, Dominic <[email protected]>
Co-authored-by: Dieter Guendisch <[email protected]>
Co-authored-by: Rafael Franzke <[email protected]>
Co-authored-by: 郑佳金 <[email protected]>
Co-authored-by: gardener-robot-ci-2 <[email protected]>
Co-authored-by: Samarth Deyagond <[email protected]>
Co-authored-by: Georgi Pavlov <[email protected]>
Co-authored-by: Andreas Burger <[email protected]>
Co-authored-by: gardener-robot-ci-1 <[email protected]>
Co-authored-by: Himanshu Sharma <[email protected]>
Co-authored-by: James Ravn <[email protected]>
Co-authored-by: Manuel Ganter <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) reviewed/lgtm Has approval for merging
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MCM to mark terminating nodes
9 participants