Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make critical jobs Guaranteed Pod QOS: ci-kubernetes-build #18577

Closed
spiffxp opened this issue Aug 1, 2020 · 10 comments
Closed

Make critical jobs Guaranteed Pod QOS: ci-kubernetes-build #18577

spiffxp opened this issue Aug 1, 2020 · 10 comments
Assignees
Labels
area/jobs area/release-eng Issues or PRs related to the Release Engineering subproject kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. sig/release Categorizes an issue or PR as relevant to SIG Release. sig/testing Categorizes an issue or PR as relevant to SIG Testing.

Comments

@spiffxp
Copy link
Member

spiffxp commented Aug 1, 2020

What should be cleaned up or changed:

This is part of #18530

The following jobs should be Guaranteed Pod QOS, meaning they should have CPU and memory resource limits, and matching resource requests:

  • ci-kubernetes-build
  • ci-kubernetes-build-1-19
  • ci-kubernetes-build-stable1
  • ci-kubernetes-build-stable2
  • ci-kubernetes-build-stable3

These jobs run on (google.com only) k8s-prow-build, so @spiffxp has provided the following guess:

  • 7 cpu (or as near 8 as you can get), slightly above 32 Gi mem (34?)

General steps to follow:

/sig testing
/sig release
/area jobs
/area release-eng

@spiffxp spiffxp added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Aug 1, 2020
@k8s-ci-robot k8s-ci-robot added sig/testing Categorizes an issue or PR as relevant to SIG Testing. sig/release Categorizes an issue or PR as relevant to SIG Release. area/jobs area/release-eng Issues or PRs related to the Release Engineering subproject labels Aug 1, 2020
@spiffxp spiffxp added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Aug 1, 2020
@helenfeng737
Copy link
Contributor

/assign

@spiffxp
Copy link
Member Author

spiffxp commented Aug 3, 2020

/remove-help
since @ZhiFeng1993 has it

@k8s-ci-robot k8s-ci-robot removed the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Aug 3, 2020
@helenfeng737
Copy link
Contributor

/close

I think it's safe to close this one. cc. @RobertKielty

@k8s-ci-robot
Copy link
Contributor

@ZhiFeng1993: You can't close an active issue/PR unless you authored it or you are a collaborator.

In response to this:

/close

I think it's safe to close this one. cc. @RobertKielty

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@spiffxp
Copy link
Member Author

spiffxp commented Aug 12, 2020

Looking at testgrids

https://testgrid.k8s.io/sig-release-master-blocking#build-master&graph-metrics=test-duration-minutes&width=20 - looks mostly ok

https://testgrid.k8s.io/sig-release-1.19-blocking#build-1.19&graph-metrics=test-duration-minutes&width=20 - same

https://testgrid.k8s.io/sig-release-1.18-blocking#build-1.18&graph-metrics=test-duration-minutes&width=20 - this concerns me
Screen Shot 2020-08-12 at 3 22 21 PM

The two peaks on the right are what I think is "good" behavior... the job fails due to timeout for Reasons™, but there is a followup run that doesn't, and the build is available for use.

The two peaks on the left don't have any runs after that are OK. Did incomplete builds get published?

More importantly... is this new behavior? Or has this been happening prior to us setting resource constraints?

@spiffxp
Copy link
Member Author

spiffxp commented Aug 12, 2020

The resource limit changes merged 2020-08-03 4:20pm PDT

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-build-stable1/1289866689743163392 - is an example of a build falling prey to this problem before the change was deployed. So I don't think this is new behavior

@spiffxp
Copy link
Member Author

spiffxp commented Aug 12, 2020

I opened #18808 to address the bad build job behavior

Given that it's pre-existing bad behavior, I'm willing to call this done

/close

@spiffxp
Copy link
Member Author

spiffxp commented Aug 12, 2020

Thanks @ZhiFeng1993 !

@k8s-ci-robot
Copy link
Contributor

@spiffxp: Closing this issue.

In response to this:

I opened #18808 to address the bad build job behavior

Given that it's pre-existing bad behavior, I'm willing to call this done

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/jobs area/release-eng Issues or PRs related to the Release Engineering subproject kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. sig/release Categorizes an issue or PR as relevant to SIG Release. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Projects
None yet
Development

No branches or pull requests

3 participants