Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise k8s-infra-prow-build quotas in anticipation of handling merge-blocking jobs #1132

Closed
spiffxp opened this issue Aug 11, 2020 · 18 comments
Closed
Assignees
Labels
area/prow Setting up or working with prow in general, prow.k8s.io, prow build clusters priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@spiffxp
Copy link
Member

spiffxp commented Aug 11, 2020

The node pool is currently setup as 3 * (6 to 30) n1-highmem-8's. We don't have enough quota to hit max nodepool size

At least in terms of resources, we need at least:

  • 3 * 30 * 8 = 720 CPU's
  • 3 * 30 * 250 = 22500 Gi SSD capacity
  • 3 * 30 = 90 in-use IP addresses

If we want to match the size of the k8s-prow-builds cluster, which has 160 nodes, we should ask for more

/wg k8s-infra
/area prow

@k8s-ci-robot k8s-ci-robot added wg/k8s-infra area/prow Setting up or working with prow in general, prow.k8s.io, prow build clusters labels Aug 11, 2020
@spiffxp
Copy link
Member Author

spiffxp commented Aug 11, 2020

Submitted requests for:

  • 1024 CPU's in us-central1
  • 100 in-use IP addresses in us-central1

@spiffxp
Copy link
Member Author

spiffxp commented Aug 11, 2020

Well, the 1024 CPU request went through just fine.

The 100 in-use IP's...

Unfortunately, we are unable to grant you additional quota at this time. If this is a new project please wait 48h until you resubmit the request or until your Billing account has additional history.

So, I'll hold this open and see what comes back in two days. Quota is 69 in-use IP addresses until then.

@spiffxp
Copy link
Member Author

spiffxp commented Aug 11, 2020

Part of kubernetes/test-infra#18550

@spiffxp
Copy link
Member Author

spiffxp commented Aug 21, 2020

/priority critical-urgent
/assign @idvoretskyi @thockin

I have repeatedly tried to file for 100 in-use IP's and been rejected every time. We bumped into IP quota yesterday when autoscaling to handle PR traffic
Screen Shot 2020-08-21 at 9 35 21 AM

I'm escalating because in the grand scheme of things our PR load looks pretty low, and and I anticipate will bump into this more once we see real traffic (opening up for v1.20)
Screen Shot 2020-08-21 at 9 37 52 AM

There are some things we can do to workaround or address:

  • migrate jobs back to google.com k8s-prow-builds
  • migrate to a nodepool of larger nodes
    • this seems like the wrong direction, I suspect we want more smaller nodes for i/o isolation
  • put the squeeze on job resources
    • our cluster-level graphs make the cluster look really underutilized
    • raising utilization may lead to more flakiness in the jobs that have migrated over
  • see if regions other than us-central1 would give us the ip quota we need
    • may need to balance with cost?
  • setup more small build clusters
    • can't share a boskos across build clusters, dedicate one for e2e's that need gcp projects
    • could do a greenhouse instance per cluster (TODO: see what the bazel/non-bazel breakdown of jobs is)
    • would probably take this opportunity to move away from regional clusters
  • try setting up a private cluster (https://cloud.google.com/kubernetes-engine/docs/how-to/private-clusters) to avoid external ip per node
    • this would deviate from k8s-prow-builds' behavior
    • NAT for outbound access might present some challenges

It would be really nice to be able to just raise our quota

@k8s-ci-robot k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Aug 21, 2020
@thockin
Copy link
Member

thockin commented Aug 21, 2020

I will see what else I can learn internally, but to the mitigations, I think some should be considered:

migrate to a nodepool of larger nodes

16 core is a good sweet-spot, I think we should try it

put the squeeze on job resources
our cluster-level graphs make the cluster look really underutilized

We should try this (slowly) - we don't want to be wasteful

see if regions other than us-central1 would give us the ip quota we need

I don't see why we would not do this anyway, just for sanity in case of failure.

would probably take this opportunity to move away from regional clusters

How would this affect the quota?

try setting up a private cluster (https://cloud.google.com/kubernetes-engine/docs/how-to/private-clusters) to avoid external ip per node

I think this is the real solution. I don't think we really need each node to have an IP anyway?

@spiffxp
Copy link
Member Author

spiffxp commented Aug 21, 2020

Submitted request for 40960GB SSD in us-central1 (quota claims we were hitting our 20480 quota), which was approved

16 core is a good sweet-spot, I think we should try it

I'll see if I can setup a pool2 nodepool on the existing cluster and shift things over during some quiet time.

I don't see why we would not do this anyway, just for sanity in case of failure.

Tried asking for 100 IP's in us-west1 and us-east1, both rejected.

I think this is the real solution. I don't think we really need each node to have an IP anyway?

I agree. I just anticipate it could bump into the most unknowns along the way, and my bandwidth is currently limited.

@helenfeng737
Copy link
Contributor

@spiffxp do we need to move some jobs out of that cluster while waiting?

@thockin
Copy link
Member

thockin commented Aug 21, 2020 via email

@spiffxp
Copy link
Member Author

spiffxp commented Aug 21, 2020

Maybe we can't get 100 IPs in each, but could we spread the load between regions, so we get 50 in each?

This basically is the "setup more small build clusters" option, since each region would need its own regional build cluster anyway. This avoids setting up a new GCP project for each cluster though.

I'll look into it, we might be able to split up jobs in a way that makes sense

@spiffxp
Copy link
Member Author

spiffxp commented Aug 21, 2020

@ZhiFeng1993

do we need to move some jobs out of that cluster while waiting?

I would like to hold off on moving things away from community-accessible infra for now. Flipping back to k8s-prow-builds is a pretty quick change if we decide we have to move quickly and/or are out of options.

@spiffxp
Copy link
Member Author

spiffxp commented Aug 24, 2020

setup more small build clusters

I tried raising CPU and SSD quota in us-west1 to be able to create an equivalently sized build cluster in the same k8s-infra-prow-build project over there. Both requests automatically rejected.

@spiffxp
Copy link
Member Author

spiffxp commented Aug 24, 2020

There is suspicion that moving to n1-highmem-16's has actually increased flakiness. Specifically for these jobs, across release branches:

I have opened #1172 to start rolling back

@spiffxp
Copy link
Member Author

spiffxp commented Aug 24, 2020

Opened #1173 to track the rollback

@spiffxp
Copy link
Member Author

spiffxp commented Aug 25, 2020

setup more small build clusters

I was able to raise CPU quota in us-east1 to 1024, but was rejected for SSD and IP quota requests.

Next step would be to try raising quotas for a different GCP project, in case k8s-infra-prow-build has gotten flagged for some reason.

@spiffxp
Copy link
Member Author

spiffxp commented Aug 25, 2020

At least in terms of resources, we need at least:

  • 3 * 30 * 8 = 720 CPU's
  • 3 * 30 * 250 = 22500 Gi SSD capacity
  • 3 * 30 = 90 in-use IP addresses

OK quota changes came through (thank you @thockin), I'm feeling better about our immediate capacity requirements being met in us-central1:

  • 1024 CPUs
  • 40960 GB SSD
  • 150 in-use IP addresses

So now we'll be able to bump into our autoscaling limits at least

/remove-priority critical-urgent
/priority important-soon

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Aug 25, 2020
@spiffxp
Copy link
Member Author

spiffxp commented Aug 25, 2020

try setting up a private cluster

I have broken this out into its own issue #1178

@spiffxp
Copy link
Member Author

spiffxp commented Aug 27, 2020

Quotas for us-central1 are now at:

  • 1440 CPUs
  • 81920 GB SSD
  • 150 in-use IP addresses

Based on how things have been behaving today with v1.20 merges, I'm comfortable calling this done. We can open further issues as our needs evolve

/close

@k8s-ci-robot
Copy link
Contributor

@spiffxp: Closing this issue.

In response to this:

Quotas for us-central1 are now at:

  • 1440 CPUs
  • 81200 GB SSD
  • 150 in-use IP addresses

Based on how things have been behaving today with v1.20 merges, I'm comfortable calling this done. We can open further issues as our needs evolve

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/prow Setting up or working with prow in general, prow.k8s.io, prow build clusters priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

5 participants