-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kubernetes CI Policy: critical jobs must be Guaranteed Pod QOS #18530
Comments
/cc |
Additionally: This should be enforced by test-infra presubmit, to prevent regressions. |
If I can find time I will try to generate a list of jobs that need updates and their paths, throw it up in a sheet with check boxes so that people can claim them to avoid duplicating efforts. |
Though not yet on community owned infra, Example from Example from |
I added some notes around how to identify which jobs we're talking about here to the OP. |
/cc |
@BenTheElder I'm a bit confused here. For this: #18159 you said that we don't really want presubmit in testgrid. Could you explain more? Thanks |
@ZhiFeng1993 that's not related to this issue. Currently because they are in testgrid, that's a quick way to find a lot of the jobs. Let's try to keep discussion here on-topic, lots of people are interested in this issue, it's going to take a lot of work, and github does not handle lots of comments well. 😅 @tpepper you can subscribe to a github issue by clicking "subscribe" on the righthand side of the web UI 🙃 |
sig-release-master-blocking are: config/jobs/kubernetes-sigs/kind/kind-release-blocking.yaml
config/jobs/kubernetes/sig-cli/sig-cli-config.yaml
config/jobs/kubernetes/sig-cloud-provider/gcp/gce-conformance.yaml
config/jobs/kubernetes/sig-cloud-provider/gcp/gcp-gce.yaml
config/jobs/kubernetes/sig-cloud-provider/gcp/gpu/gpu-gce.yaml
config/jobs/kubernetes/sig-network/sig-network-misc.yaml
config/jobs/kubernetes/sig-node/node-kubelet.yaml
config/jobs/kubernetes/sig-release/kubernetes-builds.yaml
config/jobs/kubernetes/sig-scalability/sig-scalability-release-blocking-jobs.yaml
config/jobs/kubernetes/sig-testing/bazel-build-test.yaml
config/jobs/kubernetes/sig-testing/conformance-e2e.yaml
config/jobs/kubernetes/sig-testing/integration.yaml
config/jobs/kubernetes/sig-testing/verify.yaml
|
sig-release-1.19-blocking are: config/jobs/kubernetes/generated/generated.yaml
config/jobs/kubernetes/sig-release/release-branch-jobs/1.19.yaml
|
sig-release-1.18-blocking are: config/jobs/kubernetes/generated/generated.yaml
config/jobs/kubernetes/sig-release/release-branch-jobs/1.18.yaml
|
sig-release-1.17-blocking are: config/jobs/kubernetes/generated/generated.yaml
config/jobs/kubernetes/sig-release/release-branch-jobs/1.17.yaml
|
sig-release-1.16-blocking are: config/jobs/kubernetes/generated/generated.yaml
config/jobs/kubernetes/sig-release/release-branch-jobs/1.16.yaml
|
Thanks for the lists tim, I tried to consolidate in description |
/cc |
#18556 enforces the policy in test form, but only logs instead of errors |
Set resources limits and requests as suggested. Ref: kubernetes#18591 Part of: kubernetes#18530 Signed-off-by: Arnaud Meukam <[email protected]>
Set resources limits and requests as suggested. Ref: kubernetes#18591 Part of: kubernetes#18530 Signed-off-by: Arnaud Meukam <[email protected]>
Set resources limits and requests as suggested. Ref: kubernetes#18591 Part of: kubernetes#18530 Signed-off-by: Arnaud Meukam <[email protected]>
Alright, we've got limits set on all of the release-blocking jobs! I'm going to flip this test to fail instead of log
|
The merge-blocking situation is still pretty incomplete. I suspect we've at least missed a few release-branch jobs
|
Looking at this test this morning (filtered on CPU not being zero so as to count the number of job file edits required to finish this out)
Will ping @spiffxp later about me doing this work. |
@RobertKielty you'll need to address comments on #18668 and then that should take care of the node jobs #18691 is in flight for the kind jobs |
I mentioned earlier today during SIG Testing meeting, but I suspect any of the issues that have been held open for soak time, making sure things are still running ok, etc. Can now probably be closed. @RobertKielty mentioned he was going to take a look at some. I will take a pass at some point but it may not be until Thursday at the rate I'm going |
Anecdotally, while attempting to push some last minute PR's through the door for patch releases, it sure seems like merge-blocking presubmits are still flaking pretty badly. |
@spiffxp can you include your anecdata? |
It's not as straightforward as jobs hitting "error" state, though there are those (I'm just as willing to chalk that up to "now that we'e asking for resources, we're discovering they're not available instead of finding out the hard way") I'll see if I can find a better way to measure/express this. But it's the fact that humans have sat on PR's hitting "/test" or "/retest" continually. Here’s a quick scan of PRs that have merged recently in release-1.16, release-1.17, release-1.18 and master. Is this worse or better than before? I’m not sure. Is this sort of thing worth scripting and generating a report/metric? Maybe kubernetes/kubernetes#93927 kubernetes/kubernetes#93813 kubernetes/kubernetes#93924 kubernetes/kubernetes#93696 kubernetes/kubernetes#93812 kubernetes/kubernetes#93754 kubernetes/kubernetes#93695 kubernetes/kubernetes#93811 kubernetes/kubernetes#93929 kubernetes/kubernetes#93829 kubernetes/kubernetes#93857 kubernetes/kubernetes#93907 kubernetes/kubernetes#93521 kubernetes/kubernetes#93895 kubernetes/kubernetes#93893 kubernetes/kubernetes#93831 |
Test flake fixes aren't always back ported to older releases, and I had some concerns about some recent CPU limits being set lower on older branches ... The first one I sampled had kubernetes/kubernetes#93929 (comment) |
Alright, we've got limits set on all of the merge-blocking jobs! I'm going to flip this test to fail instead of log
What remains is:
|
CSV report generated by https://github.com/kubernetes/test-infra/tree/master/experiment/prowjob-report then imported into google sheets |
/close |
@spiffxp: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Part of #18551
This is a policy action Item out of Policies to improve Kubernetes CI discussed at SIG-Testing yesterday.
Checklist of release-blocking jobs: (h/t @tpepper)
Checklist of merge-blocking jobs (suggestions are based on metrics explorer, check against resource requests too!)
pull-kubernetes-cross(was made optional via Cleanup cross #18612)pull-kubernetes-files-remake(was made optional)pull-kubernetes-godeps(aged out, was removed via Old release branch test cleaning #18509)pull-kubernetes-kubemark-e2e-gce-big(was made optional via Make pull-kubernetes-kubemark-e2e-gce-big manually triggered, optional #18788)For release-blocking jobs:
For merge-blocking jobs:
(Punted "decide how we're going to measure success" to #18785)
How to make a guess:
How to see resources (note: this only works for jobs that are running in k8s-infra-prow-build)
Once the above has been completed, we can move on to the next step: migrating everything to a dedicated cluster.
The text was updated successfully, but these errors were encountered: