Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] gke deployer supports retrying when cluster creation failures happen #88

Closed
chizhg opened this issue Feb 1, 2021 · 19 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/testing Categorizes an issue or PR as relevant to SIG Testing.

Comments

@chizhg
Copy link
Contributor

chizhg commented Feb 1, 2021

Background

kubetest2 gke deployer is now used by a few projects to create GKE clusters in the CI environment for test automation, and we've seen a few different errors which caused the cluster creation to be failed.

The most common errors we've seen are:

  1. The zone xxx does not have enough resources available to fulfill the request. It corresponds to the GCE_STOCKOUT error. This error happens when there are not enough machines/resources in that data center to provision the requested nodes, and there is no way to predict the error since the stock in each region/zone is always changing dynamically.

  2. All cluster resources were brought up, but: component "kube-apiserver" from endpoint xxx is unhealthy. It is a legitimate cluster bring up issue(?) as discussed in Allow setting backup regions and zones when creating GKE clusters and a few other fixes #85 (comment), and the reason for it is unknown. But usually when it happened, the similar cluster creation requests would have a large chance (~50%) to get the same error.

  3. only x nodes out of y have registered; this is likely due to Nodes failing to start correctly. is similar as the 2nd error, and the reason for it is unknown either.

When these errors happen, usually they will have a wide impact - multiple PRs will be blocked since the presubmit test jobs fail, and simply rerunning the jobs won't fix them, hence directly impact the developers' productivity.

Solution

We've seen these errors in multiple projects where we use kubetest2 gke deployer, so we would prefer a solution in kubetest2 gke deployer rather than each project having its own workaround. Since these errors are not predictable and preventable, a mechanism to retry the cluster creation when these errors happen is proposed.

Details

These three errors happen during different stages of GKE cluster creation. The GCE_STOCKOUT error happens before the nodes are created, and the other two errors happen after the nodes are created.

For the GCE_STOCKOUT error, since we cannot predict when the region/zone will have enough resource to fulfill the request, the best strategy would be retrying the cluster creation request in a different region/zone. For the other two errors, since the cluster is already created, retrying with the same request will result in a duplicate cluster name error, so for ease of implementation and efficiency (deleting the broken cluster first will take some time), it's also more ideal to retry with a different region/zone.

Updates to the Up function

To support retrying in Up, two new flags --backup-regions and --backup-zones can be added. When cluster creation fails, the error messages can be checked to match with the error patterns that need to be retried. If there is a match, the cluster creation request will be retried in the next backup region/zone.

Updates to the Down function

To ensure proper cleanups, a map[string][]string data structure needs to be added to track which clusters have been created in each region/zone during the Up stage. And the Down function will iterate the map to delete all these clusters.

Multiple clusters in different regions/zones

Currently for multiple clusters creation, they are supposed to be created in the same region/zone since --region and --zone are only accepting a single string. In the future if we need the clusters to be created in different regions/zones, we can change --region and --zone to accept a list of strings, and the --backup-regions/zones can be changed to a list of lists correspondingly.

Other concerns

For multiple cluster creation scenario, there is a debate on whether we should only retry the failed cluster creation requests or all the requests. Under the current gke deployer implementation, since we always want the clusters to be created in the same region/zone, and we can ensure proper cleanups with the planned updates to Down, it's more preferred to retry all the cluster creation requests. In the future when we support creating them in different regions/zones, this logic be changed to only retry on the failed requests.

@amwat
Copy link
Contributor

amwat commented Feb 1, 2021

/cc @BenTheElder @spiffxp

@chizhg
Copy link
Contributor Author

chizhg commented Feb 1, 2021

/cc @albertomilan

@amwat
Copy link
Contributor

amwat commented Feb 3, 2021

Thanks for opening up a proposal issue as requested!

Yes, currently we assume that all clusters have the same configuration.

Since recreating in different zones is essentially now making the creation request more dynamic, my recommendation would be start support a config spec instead of trying to maintain individual cluster->zone mappings for Down etc.
(this will also enable the multiple cluster profiles in different zones/regions more easier)

Even though your specific use case is only limited to retry_locations (or backup_locations) enabling the spec based config will make it much more easier to manage the lifecycles.

high level:

We should expose the existing flags as a Up config that will make each spec more customizable anyway and more robust to passing through the kubetest2 lifecycle.

conceptually (not final) something like:

Up: # list of (multi-)cluster profiles (with independent lifecycles)
- count: 2 # multi-cluster spec that should follow the same spec (e.g. sticking to a same location)
  version: 1.X.Y
  name: qwerty # used as prefix for multi cluster
  location: foo # zone or region
  retry_locations: bar, qwe, foo
- count: 1
  name: asdfg
  version: 1.A.B
  zone: baz
  enable-workload-identity: true # and all the other flags we support

which after conversion we can internally represent as a subset of fields we support from https://pkg.go.dev/google.golang.org/api/container/v1#Cluster as required (might also help in migrating from gcloud cli to sdk in the future). which we can pass around through the lifecycle. that way Down only needs to know about the clusters that were finally created.

we can also persist that info under ARTIFACTS for tracking ( and will also enable decoupled up, down)

Additionally, successful clusters that are then later retried should be cleaned up (during Upitself ) even if down is not specified, instead of just leaking them and relying on janitor since they are actually redundant clusters that have been created.

@amwat
Copy link
Contributor

amwat commented Feb 3, 2021

fyi @mm4tt (just to be aware) since this will affect how multi clusters work with kubetest2

@mm4tt
Copy link

mm4tt commented Feb 3, 2021

Thanks @amwat, ACK

@BenTheElder
Copy link
Member

How do you propose to approach cleanup in the case that we fail in a zone? Is it immediate? Does down need to remember to cleanup all of the zones?

If cluster bring up fails due to a bug in Kubernetes rather than a stock out, are we going to waste time and resources attempting bring up in another zone? Or only for detected stockout?

I feel like perhaps if stockouts specifically are an ongoing concern, it would be better to identify them and switch everything to the new zone to avoid time spent across N jobs on failed bring up in a known unavailable zone.

@amwat
Copy link
Contributor

amwat commented Feb 3, 2021

also adding refs to previous discussions on this topic: #13 , #85

@chizhg
Copy link
Contributor Author

chizhg commented Feb 4, 2021

How do you propose to approach cleanup in the case that we fail in a zone? Is it immediate? Does down need to remember to cleanup all of the zones?

For efficiency reasons, I'm proposing to have a map that records all the clusters created in each region/zone, and clean up them all in the end. But if you think it'll be a waste of resource, I think we can have double insurance - in the Up function, if the cluster fails to be created, delete the cluster asynchronously (do not check error) and record it in a map. And in the Down function, iterate all the recorded clusters and try to delete them again. Would this option be better to you?

If cluster bring up fails due to a bug in Kubernetes rather than a stock out, are we going to waste time and resources attempting bring up in another zone? Or only for detected stockout?

Here I'm only proposing retrying on the 3 errors I mentioned in this proposal, the first one is the GCE_STOCKOUT error, and the other two are some errors we see quite often. The last two errors happen quite randomly, they do not seem to be caused by any major GKE outages and based on our previous experience in Knative, retrying will usually solve the problem. These errors composite a large portion of test flakiness for all our jobs testing with GKE (I can send you links separately if it'll be useful), so we really want a common solution for mitigation.

I feel like perhaps if stockouts specifically are an ongoing concern, it would be better to identify them and switch everything to the new zone to avoid time spent across N jobs on failed bring up in a known unavailable zone.

I don't think there can be a region/zone that never gets stockout, since the stock for each region/zone is always changing dynamically, and we cannot control how many users are creating/deleting clusters in these regions/zones since it's public cloud :-( So switching to a different region/zone is not a permanent solution.
That's why we want to add some failover here by introducing the retry mechanism, with which there will be much chance to get stockout in all the regions/zone we try (1 primary region/zone + multiple backup regions/zones)

@BenTheElder
Copy link
Member

For efficiency reasons, I'm proposing to have a map that records all the clusters created in each region/zone, and clean up them all in the end. But if you think it'll be a waste of resource, I think we can have double insurance - in the Up function, if the cluster fails to be created, delete the cluster asynchronously (do not check error) and record it in a map. And in the Down function, iterate all the recorded clusters and try to delete them again. Would this option be better to you?

My only concerns are that we consider this, including:

  • we should not cleanup things we did not create (kubetest2 can be used without boskos)
  • we should make sure we will not miss cleanup up resources we did create

An in memory map tracking creation calls may be sufficient.

Here I'm only proposing retrying on the 3 errors I mentioned in this proposal, the first one is the GCE_STOCKOUT error, and the other two are some errors we see quite often. The last two errors happen quite randomly, they do not seem to be caused by any major GKE outages and based on our previous experience in Knative, retrying will usually solve the problem. These errors composite a large portion of test flakiness for all our jobs testing with GKE (I can send you links separately if it'll be useful), so we really want a common solution for mitigation.

To be more clear: I don't think we should mask flakiness of bringup, that's something that should be fixed in the underlying tool. Stockout seems like a special case worth retrying around as it's not a property of the tool.

If knative is interested in retrying anyhow for something other than a stockout and ignoring flaky bringup tools (it doesn't seem to be the case, but confirming), that behavior should probably be opt-in, when testing Kubernetes we don't likely want to mask most bringup flakes by trying again.

I don't think there can be a region/zone that never gets stockout, since the stock for each region/zone is always changing dynamically, and we cannot control how many users are creating/deleting clusters in these regions/zones since it's public cloud :-( So switching to a different region/zone is not a permanent solution.

I'm not suggesting that, I'm suggesting that during a stockout we should probably have everything migrated out of the zone for some period rather than in each individual run discovering this (slowly). That can be done by some other automation approach, rather than partially bringing up in the zone and then having to clean it up repeatedly (since the stockout will not be of all resources).

E.g. a naive approach: when a stockout is detected we can flag that (opt-in behavior to report it to some simple API?) and a bot can update the configs. Especially since we still have job level retry.

@amwat
Copy link
Contributor

amwat commented Mar 19, 2021

/lifecycle active
/priority important-soon
/assign @joshua-bone

@k8s-ci-robot
Copy link
Contributor

@amwat: GitHub didn't allow me to assign the following users: joshua-bone.

Note that only kubernetes-sigs members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/lifecycle active
/priority important-soon
/assign @joshua-bone

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Mar 19, 2021
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. labels Jun 18, 2021
@amwat
Copy link
Contributor

amwat commented Jun 18, 2021

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 18, 2021
@amwat
Copy link
Contributor

amwat commented Jun 18, 2021

/lifecycle active
/assign @joshua-bone

@k8s-ci-robot k8s-ci-robot added the lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. label Jun 18, 2021
@k8s-ci-robot
Copy link
Contributor

@amwat: GitHub didn't allow me to assign the following users: joshua-bone.

Note that only kubernetes-sigs members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/lifecycle active
/assign @joshua-bone

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@spiffxp spiffxp added the sig/testing Categorizes an issue or PR as relevant to SIG Testing. label Aug 17, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. labels Nov 15, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 16, 2021
@k8s-ci-robot k8s-ci-robot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Dec 16, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Projects
None yet
Development

No branches or pull requests

8 participants