-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Proposal] gke deployer supports retrying when cluster creation failures happen #88
Comments
/cc @BenTheElder @spiffxp |
/cc @albertomilan |
Thanks for opening up a proposal issue as requested! Yes, currently we assume that all clusters have the same configuration. Since recreating in different zones is essentially now making the creation request more dynamic, my recommendation would be start support a config spec instead of trying to maintain individual cluster->zone mappings for Down etc. Even though your specific use case is only limited to retry_locations (or backup_locations) enabling the spec based config will make it much more easier to manage the lifecycles. high level: We should expose the existing flags as a conceptually (not final) something like: Up: # list of (multi-)cluster profiles (with independent lifecycles)
- count: 2 # multi-cluster spec that should follow the same spec (e.g. sticking to a same location)
version: 1.X.Y
name: qwerty # used as prefix for multi cluster
location: foo # zone or region
retry_locations: bar, qwe, foo
- count: 1
name: asdfg
version: 1.A.B
zone: baz
enable-workload-identity: true # and all the other flags we support which after conversion we can internally represent as a subset of fields we support from https://pkg.go.dev/google.golang.org/api/container/v1#Cluster as required (might also help in migrating from gcloud cli to sdk in the future). which we can pass around through the lifecycle. that way Down only needs to know about the clusters that were finally created. we can also persist that info under ARTIFACTS for tracking ( and will also enable decoupled up, down) Additionally, successful clusters that are then later retried should be cleaned up (during |
fyi @mm4tt (just to be aware) since this will affect how multi clusters work with kubetest2 |
Thanks @amwat, ACK |
How do you propose to approach cleanup in the case that we fail in a zone? Is it immediate? Does down need to remember to cleanup all of the zones? If cluster bring up fails due to a bug in Kubernetes rather than a stock out, are we going to waste time and resources attempting bring up in another zone? Or only for detected stockout? I feel like perhaps if stockouts specifically are an ongoing concern, it would be better to identify them and switch everything to the new zone to avoid time spent across N jobs on failed bring up in a known unavailable zone. |
For efficiency reasons, I'm proposing to have a map that records all the clusters created in each region/zone, and clean up them all in the end. But if you think it'll be a waste of resource, I think we can have double insurance - in the Up function, if the cluster fails to be created, delete the cluster asynchronously (do not check error) and record it in a map. And in the Down function, iterate all the recorded clusters and try to delete them again. Would this option be better to you?
Here I'm only proposing retrying on the 3 errors I mentioned in this proposal, the first one is the
I don't think there can be a region/zone that never gets stockout, since the stock for each region/zone is always changing dynamically, and we cannot control how many users are creating/deleting clusters in these regions/zones since it's public cloud :-( So switching to a different region/zone is not a permanent solution. |
My only concerns are that we consider this, including:
An in memory map tracking creation calls may be sufficient.
To be more clear: I don't think we should mask flakiness of bringup, that's something that should be fixed in the underlying tool. Stockout seems like a special case worth retrying around as it's not a property of the tool. If knative is interested in retrying anyhow for something other than a stockout and ignoring flaky bringup tools (it doesn't seem to be the case, but confirming), that behavior should probably be opt-in, when testing Kubernetes we don't likely want to mask most bringup flakes by trying again.
I'm not suggesting that, I'm suggesting that during a stockout we should probably have everything migrated out of the zone for some period rather than in each individual run discovering this (slowly). That can be done by some other automation approach, rather than partially bringing up in the zone and then having to clean it up repeatedly (since the stockout will not be of all resources). E.g. a naive approach: when a stockout is detected we can flag that (opt-in behavior to report it to some simple API?) and a bot can update the configs. Especially since we still have job level retry. |
/lifecycle active |
@amwat: GitHub didn't allow me to assign the following users: joshua-bone. Note that only kubernetes-sigs members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
/remove-lifecycle stale |
/lifecycle active |
@amwat: GitHub didn't allow me to assign the following users: joshua-bone. Note that only kubernetes-sigs members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Background
kubetest2 gke deployer is now used by a few projects to create GKE clusters in the CI environment for test automation, and we've seen a few different errors which caused the cluster creation to be failed.
The most common errors we've seen are:
The zone xxx does not have enough resources available to fulfill the request
. It corresponds to theGCE_STOCKOUT
error. This error happens when there are not enough machines/resources in that data center to provision the requested nodes, and there is no way to predict the error since the stock in each region/zone is always changing dynamically.All cluster resources were brought up, but: component "kube-apiserver" from endpoint xxx is unhealthy
. It is a legitimate cluster bring up issue(?) as discussed in Allow setting backup regions and zones when creating GKE clusters and a few other fixes #85 (comment), and the reason for it is unknown. But usually when it happened, the similar cluster creation requests would have a large chance (~50%) to get the same error.only x nodes out of y have registered; this is likely due to Nodes failing to start correctly.
is similar as the 2nd error, and the reason for it is unknown either.When these errors happen, usually they will have a wide impact - multiple PRs will be blocked since the presubmit test jobs fail, and simply rerunning the jobs won't fix them, hence directly impact the developers' productivity.
Solution
We've seen these errors in multiple projects where we use kubetest2 gke deployer, so we would prefer a solution in kubetest2 gke deployer rather than each project having its own workaround. Since these errors are not predictable and preventable, a mechanism to retry the cluster creation when these errors happen is proposed.
Details
These three errors happen during different stages of GKE cluster creation. The
GCE_STOCKOUT
error happens before the nodes are created, and the other two errors happen after the nodes are created.For the
GCE_STOCKOUT
error, since we cannot predict when the region/zone will have enough resource to fulfill the request, the best strategy would be retrying the cluster creation request in a different region/zone. For the other two errors, since the cluster is already created, retrying with the same request will result in aduplicate cluster name
error, so for ease of implementation and efficiency (deleting the broken cluster first will take some time), it's also more ideal to retry with a different region/zone.Updates to the
Up
functionTo support retrying in
Up
, two new flags--backup-regions
and--backup-zones
can be added. When cluster creation fails, the error messages can be checked to match with the error patterns that need to be retried. If there is a match, the cluster creation request will be retried in the next backup region/zone.Updates to the
Down
functionTo ensure proper cleanups, a
map[string][]string
data structure needs to be added to track which clusters have been created in each region/zone during theUp
stage. And theDown
function will iterate the map to delete all these clusters.Multiple clusters in different regions/zones
Currently for multiple clusters creation, they are supposed to be created in the same region/zone since
--region
and--zone
are only accepting a single string. In the future if we need the clusters to be created in different regions/zones, we can change--region
and--zone
to accept a list of strings, and the--backup-regions/zones
can be changed to a list of lists correspondingly.Other concerns
For multiple cluster creation scenario, there is a debate on whether we should only retry the failed cluster creation requests or all the requests. Under the current gke deployer implementation, since we always want the clusters to be created in the same region/zone, and we can ensure proper cleanups with the planned updates to
Down
, it's more preferred to retry all the cluster creation requests. In the future when we support creating them in different regions/zones, this logic be changed to only retry on the failed requests.The text was updated successfully, but these errors were encountered: