GKE clusters get two core nodes without being CPU / Memory constrained #2199

consideRatio · 2023-02-13T22:06:41Z

It seems that at least some GKE clusters, like linked-earth gets two core nodes instead of one, which is a waste of cloud resources I think. I think it is because the cluster-autoscaler scales up for konnectivity-agent so that it can have three pods in a 2+1 configuration instead of having all on the same node.

This isn't suitable for us, and I'm not sure how we ought to avoid it. I think we can do kubectl edit on resources like the Deployment of konnectivity-agent, or influence the konnectivity-agent-autoscaler that runs gke.gcr.io/cluster-proportional-autoscaler. But, how to make this change so that its not reverted by GKE at a later point in time or similar?

The problem stems from use of a cluster-proportional-autoscaler that adds one pod per node, but then the konnectivity-agent pod's don't tolerate the user nodes and stack up on core nodes. In a situation with two core nodes and one user node, we have 3+1 konnectivity-agent's running on the core nodes currently.

kind: Deployment
metadata:
  annotations:
    components.gke.io/layer: addon
    credential-normal-mode: "true"
  labels:
    addonmanager.kubernetes.io/mode: Reconcile
    k8s-app: konnectivity-agent
  name: konnectivity-agent
  namespace: kube-system
spec:
  replicas: 3

  # konnectivity-agent's pod spec includes this
  topologySpreadConstraints:
  - labelSelector:
      matchLabels:
        k8s-app: konnectivity-agent
    maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: ScheduleAnyway
  - labelSelector:
      matchLabels:
        k8s-app: konnectivity-agent
    maxSkew: 1
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: ScheduleAnyway

kubectl get events --sort-by='.metadata.creationTimestamp' -n kube-system | grep -B5 -A5 TriggeredScale

13m         Normal    TriggeredScaleUp       pod/calico-typha-8759b9648-rtfj4                                  pod triggered scale-up: [{https://www.googleapis.com/compute/v1/projects/linked-earth-hubs/zones/us-central1-c/instanceGroups/gke-linked-earth-cluster-core-pool-903b36c0-grp 2->3 (max: 5)}]

kubectl describe pod calico-typha-8759b9648-h4smt

Warning  FailedScheduling  33s   default-scheduler   0/3 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 1 node(s) had untolerated taint {hub.jupyter.org_dedicated: user}, 1 node(s) had untolerated taint {node.kubernetes.io/unschedulable: }, 1 node(s) were unschedulable. preemption: 0/3 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 2 Preemption is not helpful for scheduling.

    ports:
    - containerPort: 5473
      hostPort: 5473
      name: calico-typha
      protocol: TCP

Hmmm... One can't force node-typha to run on the same node I guess. It is also controlled by...

calico-typha-8759b9648-9l85z                                  1/1     Running   0          3h8m
calico-typha-8759b9648-h4smt                                  0/1     Pending   0          15s
calico-typha-horizontal-autoscaler-69dd48c655-r4v9x           1/1     Running   0          123m
calico-typha-vertical-autoscaler-84dbc54cd5-z9h6w             1/1     Running   0          23m

Where the horizontal autoscaler decides on pod count and vertical on resources granted. The horizontal is described like this.

kubectl get cm -o yaml calico-typha-horizontal-autoscaler
apiVersion: v1
data:
  ladder: |-
    {
      "coresToReplicas": [],
      "nodesToReplicas":
      [
        [1, 1],
        [2, 2],
        [100, 3],
        [250, 4],
        [500, 5],
        [1000, 6],
        [1500, 7],
        [2000, 8]
      ]
    }

Editing the configmap to for example this made us get only one pod, which we will have until we have 10 nodes in total. At that point, it may not be so problematic if another core node is added.

data:
   ladder: |-
     {
       "coresToReplicas": [],
       "nodesToReplicas":
       [
         [1, 1],
-        [2, 1],
+        [10, 2],
         [100, 3],
         [250, 4],
         [500, 5],
         [1000, 6],
         [1500, 7],
         [2000, 8]
       ]
     }

consideRatio · 2023-02-13T23:48:31Z

Action points

Figure out what makes sense to do with this.

Disable calico on GKE and stop getting NetworkPolicy enforcement?
Re-configure node-typha via the configmap to only have one pod until there are 10 nodes or similar?
Allow node-typha to schedule on a user node also (then it must fit on any given user node, but its also vertically scaled so we can't know the resource requests ahead of time)
Lower node-typhas priority to -10 or lower so it can't trigger a scale up - but then also risk getting evicted and get 0 replicas available?

I'm not sure, but I don't like any option. I lean towards adjusting the configmap with a kubectl patch command to write the multiline JSON string to the key ladder.

consideRatio · 2023-02-16T11:34:57Z

To resolve issues, I think we should go for two nodes with 2:16 CPU:RAM. I don't want this to be tracked in this issue though. I've opened #2212 for this.

consideRatio · 2023-03-22T10:39:54Z

I've applied the patch in #2199 (comment) to reduce costs for the callysto cluster, which can run with a single n2-highmem-2 node and only requires two core nodes because calico-typha's horizontal autoscaler.

So, I've made the horizontal autoscaler only add another replica if we go up to 10 nodes for now.

yuvipanda · 2023-03-22T10:49:18Z

@consideRatio was this manually done? I am curious if it'll just come back on cluster upgrade.

consideRatio · 2023-03-22T11:04:27Z

@yuvipanda this was manually applied change to the calico-typha-horizontal-autoscaler's configuration in a configmap, and it was not overridden when making either a node pool upgrade or k8s control plane upgrade!

So, it seems like a change like this is quite robust!

yuvipanda mentioned this issue Feb 14, 2023

Minimize base cost of our clusters #235

Closed

consideRatio mentioned this issue Feb 16, 2023

Core node pool revision to minimize cost - towards n2-highmem-2 and r5.large machines #2212

Open

consideRatio closed this as completed Feb 16, 2023

consideRatio self-assigned this Feb 16, 2023

consideRatio mentioned this issue Feb 24, 2023

LEAP prometheus server is down/scheduler faiiling #2248

Closed

consideRatio mentioned this issue Mar 20, 2023

callysto: maintenance plan for Wed, Mar 22, 0-4am PDT #2388

Closed

6 tasks

consideRatio mentioned this issue Apr 13, 2023

Decide and document on pod scaling of GKE's clusters calico-typha pods, and konnectivity-agent pods #2490

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GKE clusters get two core nodes without being CPU / Memory constrained #2199

GKE clusters get two core nodes without being CPU / Memory constrained #2199

consideRatio commented Feb 13, 2023 •

edited

Loading

consideRatio commented Feb 13, 2023 •

edited

Loading

consideRatio commented Feb 13, 2023 •

edited

Loading

consideRatio commented Feb 16, 2023

consideRatio commented Mar 22, 2023

yuvipanda commented Mar 22, 2023

consideRatio commented Mar 22, 2023

GKE clusters get two core nodes without being CPU / Memory constrained #2199

GKE clusters get two core nodes without being CPU / Memory constrained #2199

Comments

consideRatio commented Feb 13, 2023 • edited Loading

Related

consideRatio commented Feb 13, 2023 • edited Loading

consideRatio commented Feb 13, 2023 • edited Loading

Action points

consideRatio commented Feb 16, 2023

consideRatio commented Mar 22, 2023

yuvipanda commented Mar 22, 2023

consideRatio commented Mar 22, 2023

consideRatio commented Feb 13, 2023 •

edited

Loading

consideRatio commented Feb 13, 2023 •

edited

Loading

consideRatio commented Feb 13, 2023 •

edited

Loading