Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GKE clusters get two core nodes without being CPU / Memory constrained #2199

Closed
consideRatio opened this issue Feb 13, 2023 · 6 comments
Closed
Assignees

Comments

@consideRatio
Copy link
Member

consideRatio commented Feb 13, 2023

It seems that at least some GKE clusters, like linked-earth gets two core nodes instead of one, which is a waste of cloud resources I think. I think it is because the cluster-autoscaler scales up for konnectivity-agent so that it can have three pods in a 2+1 configuration instead of having all on the same node.

This isn't suitable for us, and I'm not sure how we ought to avoid it. I think we can do kubectl edit on resources like the Deployment of konnectivity-agent, or influence the konnectivity-agent-autoscaler that runs gke.gcr.io/cluster-proportional-autoscaler. But, how to make this change so that its not reverted by GKE at a later point in time or similar?

The problem stems from use of a cluster-proportional-autoscaler that adds one pod per node, but then the konnectivity-agent pod's don't tolerate the user nodes and stack up on core nodes. In a situation with two core nodes and one user node, we have 3+1 konnectivity-agent's running on the core nodes currently.

kind: Deployment
metadata:
  annotations:
    components.gke.io/layer: addon
    credential-normal-mode: "true"
  labels:
    addonmanager.kubernetes.io/mode: Reconcile
    k8s-app: konnectivity-agent
  name: konnectivity-agent
  namespace: kube-system
spec:
  replicas: 3
  # konnectivity-agent's pod spec includes this
  topologySpreadConstraints:
  - labelSelector:
      matchLabels:
        k8s-app: konnectivity-agent
    maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: ScheduleAnyway
  - labelSelector:
      matchLabels:
        k8s-app: konnectivity-agent
    maxSkew: 1
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: ScheduleAnyway

Related

@consideRatio
Copy link
Member Author

consideRatio commented Feb 13, 2023

Aha, maybe it was always only calico-typha-8759b9648-h4smt doing this.

Looking at events after draining a node core, I saw that its a calico-typha pod that triggers scale-up because it refuses to put itself on the other core node because of a lack of open ports, and on the user nodes due to a lack of toleration to a taint.

kubectl get events --sort-by='.metadata.creationTimestamp' -n kube-system | grep -B5 -A5 TriggeredScale

13m         Normal    TriggeredScaleUp       pod/calico-typha-8759b9648-rtfj4                                  pod triggered scale-up: [{https://www.googleapis.com/compute/v1/projects/linked-earth-hubs/zones/us-central1-c/instanceGroups/gke-linked-earth-cluster-core-pool-903b36c0-grp 2->3 (max: 5)}]
kubectl describe pod calico-typha-8759b9648-h4smt

Warning  FailedScheduling  33s   default-scheduler   0/3 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 1 node(s) had untolerated taint {hub.jupyter.org_dedicated: user}, 1 node(s) had untolerated taint {node.kubernetes.io/unschedulable: }, 1 node(s) were unschedulable. preemption: 0/3 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 2 Preemption is not helpful for scheduling.
    ports:
    - containerPort: 5473
      hostPort: 5473
      name: calico-typha
      protocol: TCP

Hmmm... One can't force node-typha to run on the same node I guess. It is also controlled by...

calico-typha-8759b9648-9l85z                                  1/1     Running   0          3h8m
calico-typha-8759b9648-h4smt                                  0/1     Pending   0          15s
calico-typha-horizontal-autoscaler-69dd48c655-r4v9x           1/1     Running   0          123m
calico-typha-vertical-autoscaler-84dbc54cd5-z9h6w             1/1     Running   0          23m

Where the horizontal autoscaler decides on pod count and vertical on resources granted. The horizontal is described like this.

kubectl get cm -o yaml calico-typha-horizontal-autoscaler
apiVersion: v1
data:
  ladder: |-
    {
      "coresToReplicas": [],
      "nodesToReplicas":
      [
        [1, 1],
        [2, 2],
        [100, 3],
        [250, 4],
        [500, 5],
        [1000, 6],
        [1500, 7],
        [2000, 8]
      ]
    }

Editing the configmap to for example this made us get only one pod, which we will have until we have 10 nodes in total. At that point, it may not be so problematic if another core node is added.

data:
   ladder: |-
     {
       "coresToReplicas": [],
       "nodesToReplicas":
       [
         [1, 1],
-        [2, 1],
+        [10, 2],
         [100, 3],
         [250, 4],
         [500, 5],
         [1000, 6],
         [1500, 7],
         [2000, 8]
       ]
     }

@consideRatio
Copy link
Member Author

consideRatio commented Feb 13, 2023

Action points

Figure out what makes sense to do with this.

  • Disable calico on GKE and stop getting NetworkPolicy enforcement?
  • Re-configure node-typha via the configmap to only have one pod until there are 10 nodes or similar?
  • Allow node-typha to schedule on a user node also (then it must fit on any given user node, but its also vertically scaled so we can't know the resource requests ahead of time)
  • Lower node-typhas priority to -10 or lower so it can't trigger a scale up - but then also risk getting evicted and get 0 replicas available?

I'm not sure, but I don't like any option. I lean towards adjusting the configmap with a kubectl patch command to write the multiline JSON string to the key ladder.

@consideRatio
Copy link
Member Author

To resolve issues, I think we should go for two nodes with 2:16 CPU:RAM. I don't want this to be tracked in this issue though. I've opened #2212 for this.

@consideRatio
Copy link
Member Author

I've applied the patch in #2199 (comment) to reduce costs for the callysto cluster, which can run with a single n2-highmem-2 node and only requires two core nodes because calico-typha's horizontal autoscaler.

So, I've made the horizontal autoscaler only add another replica if we go up to 10 nodes for now.

@yuvipanda
Copy link
Member

@consideRatio was this manually done? I am curious if it'll just come back on cluster upgrade.

@consideRatio
Copy link
Member Author

@yuvipanda this was manually applied change to the calico-typha-horizontal-autoscaler's configuration in a configmap, and it was not overridden when making either a node pool upgrade or k8s control plane upgrade!

So, it seems like a change like this is quite robust!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

No branches or pull requests

2 participants