Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubetest2 gke retries creating the cluster if it fails due to specific reasons #13

Closed
wants to merge 1 commit into from

Conversation

chizhg
Copy link
Contributor

@chizhg chizhg commented Aug 3, 2020

  1. As we synced offline, kubetest2 gke supports retrying cluster creation if it fails due to some reasons that cannot be prevented in advance, e.g. quota running out. This is achieved by adding an extra back-regions flag which will enable the retry logic if it's not empty. IMO this approach is better than supporting region to be set as auto since it allows users to restrict which region(s) they want the cluster to be in, and it's also backward compatible.

  2. Add a new flag require-gcp-ssh-key to allow the GCP SSH key to be empty, with which we can get rid of the hack in Knative - https://github.com/knative/test-infra/blob/master/scripts/e2e-tests.sh#L229-L233 (we are still using kubetest in Knative but will be likely switching to kubetest2 soon).

/cc @amwat

@k8s-ci-robot k8s-ci-robot requested a review from amwat August 3, 2020 02:46
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 3, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: chizhg
To complete the pull request process, please assign spiffxp
You can assign the PR to them by writing /assign @spiffxp in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

Hi @chizhg. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 3, 2020
@amwat
Copy link
Contributor

amwat commented Aug 3, 2020

We also need to handle zonal clusters.

I'd like to reconsider the use case for this functionality.
the references issues should be solved in different ways I think.

  1. Stockout isn't something we want to hide under the covers (i.e. jobs failing and requiring reconfiguring the prowjobs seems acceptable in just adverse cases.)

  2. Quota issues should definitely not be hidden by retries (this means that the job isn't configured with the right region in the first place and will always keep on failing the first time and go on to backup regions)

  3. Different versions being available in different regions should be fixed by improving the "extract" behavior where the latest available gke version is determined by using the https://cloud.google.com/sdk/gcloud/reference/container/get-server-config API and not hardcoding a version in the job config.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 6, 2020
@k8s-ci-robot
Copy link
Contributor

@chizhg: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@chizhg
Copy link
Contributor Author

chizhg commented Aug 9, 2020

We also need to handle zonal clusters.

I'd like to reconsider the use case for this functionality.
the references issues should be solved in different ways I think.

  1. Stockout isn't something we want to hide under the covers (i.e. jobs failing and requiring reconfiguring the prowjobs seems acceptable in just adverse cases.)
  2. Quota issues should definitely not be hidden by retries (this means that the job isn't configured with the right region in the first place and will always keep on failing the first time and go on to backup regions)
  3. Different versions being available in different regions should be fixed by improving the "extract" behavior where the latest available gke version is determined by using the https://cloud.google.com/sdk/gcloud/reference/container/get-server-config API and not hardcoding a version in the job config.

The stockout issue, which corresponds to does not have enough resources available to fulfill, is not a configuration error: the number of machines in a region (or a data center) is limited, and the machines will keep being acquired and released by all users from the Internet, so the number of available machines is dynamic and any region can possibly run into stockout issues at any time. This error just introduces some flakinesses to the infra, since in most cases users do not care where the clusters are created.

But I agree the other issues are preventable without blindly retrying, and since some kubetest2 users might want the clusters to be created in a fixed region (e.g. for the multi-cluster profile), so I'll close this PR for now and probably will handle this in the place where we run kubetest2 commands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants