Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set kubelet defaults for --cgroups-per-qos & --enforce-node-allocatable #277

Closed
angrox opened this issue Mar 28, 2018 · 13 comments
Closed

Comments

@angrox
Copy link

angrox commented Mar 28, 2018

Disclaimer: This issue reference to the following issue in the acs-engine:
Azure/acs-engine#2263
with the solving Pull Request here: Azure/acs-engine#2310

In my AKS Cluster (1.9.2) in westeurope I have the same issues as mentioned in the acs-engine issues:

Warning FailedNodeAllocatableEnforcement 1m (x106 over 1h) kubelet, aks-nodepool1-77770737-0 Failed to update Node Allocatable Limits "": failed to set supported cgroup subsystems for cgroup : Failed to set config for supported subsystems : failed to write 8342003712 to memory.limit_in_bytes: write /var/lib/docker/overlay2/7428add845f7e87ff8620731e8d9ef63a703255de49fb2a2f1d8a867f491f420/merged/sys/fs/cgroup/memory/memory.limit_in_bytes: invalid argument

When will the AKS clusters updated with the fix?

@angrox
Copy link
Author

angrox commented Mar 28, 2018

Official Support Call is also open: 118032817900562

@slack
Copy link
Contributor

slack commented Apr 3, 2018

The AKS rollout this week includes acs-engine 0.14.6. This patch will be available in all AKS regions by the end of the week, and will be applied to existing clusters via az aks upgrade.

@angrox
Copy link
Author

angrox commented Apr 3, 2018

Thanks! I will close this issue as soon as the patch is available and we've tested it!

@bbrosemer
Copy link

@slack does az aks upgrade imply it will be aligned with a kubernetes upgrade too ?

@slack
Copy link
Contributor

slack commented Apr 4, 2018

@bbrosemer yeah, the updated configuration will be applied as the nodes are replaced during upgrade.

@guesslin
Copy link

guesslin commented Apr 9, 2018

@slack az aks upgrade not working to us, our AKS cluster turns into Failed state

$ az aks list
Name    Location    ResourceGroup    KubernetesVersion    ProvisioningState    Fqdn
------  ----------  ---------------  -------------------  -------------------  -------------------------------------------------
stage1  eastus      stage            1.8.10               Failed               stage1-stage-2f4d48-39f0e898.hcp.eastus.azmk8s.io

@angrox
Copy link
Author

angrox commented Apr 9, 2018

Currently I am upgrading from 1.9.2 to 1.9.6. When I log in to a updated machine I do not see the change in the kubelet configuration:

$ kubectl get nodes
NAME                       STATUS    ROLES     AGE       VERSION
[...]
aks-nodepool1-77770xxx-5   Ready     agent     2h        v1.9.6

Errors:

$ kubectl describe node aks-nodepool1-77770xxx-5
[...]
Events:
  Type     Reason                            Age                From                               Message
  ----     ------                            ----               ----                               -------
  Warning  FailedNodeAllocatableEnforcement  2m (x136 over 2h)  kubelet, aks-nodepool1-77770737-5  Failed to update Node Allocatable Limits "": failed to set supported cgroup subsystems for cgroup : Failed to set config for supported subsystems : failed to write 8342003712 to memory.limit_in_bytes: write /var/lib/docker/overlay2/5df472c59f31fb8272481b920fe782e5310d622f77f90d069a8f85ef05277cc7/merged/sys/fs/cgroup/memory/memory.limit_in_bytes: invalid argument

Arguments for kubelet on the node (in /etc/default/kubelet)

KUBELET_CONFIG=--address=0.0.0.0 --allow-privileged=true --authorization-mode=Webhook --azure-container-registry-config=/etc/kubernetes/azure.json --cadvisor-port=0 --cgroups-per-qos=false --cloud-config=/etc/kubernetes/azure.json --cloud-provider=azure --cluster-dns=10.0.0.10 --cluster-domain=cluster.local --enforce-node-allocatable= --event-qps=0 --eviction-hard=memory.available<100Mi,nodefs.available<10%,nodefs.inodesFree<5% --feature-gates=Accelerators=true --image-gc-high-threshold=85 --image-gc-low-threshold=80 --keep-terminated-pod-volumes=false --kubeconfig=/var/lib/kubelet/kubeconfig --max-pods=110 --network-plugin=kubenet --node-status-update-frequency=10s --non-masquerade-cidr=10.0.0.0/8 --pod-infra-container-image=k8s-gcrio.azureedge.net/pause-amd64:3.0 --pod-manifest-path=/etc/kubernetes/manifests

Edit: Upgrade failed ("provisioningState": "Failed",)

Edit2: Detailed error message:

   "properties": {
        "statusCode": "Conflict",
        "statusMessage": "{\"status\":\"Failed\",\"error\":{\"code\":\"ResourceDeploymentFailure\",\"message\":\"The resource operation completed with terminal provisioning state 'Failed'.\",\"details\":[{\"code\":\"VMExtensionProvisioningError\",\"message\":\"VM has reported a failure when processing extension 'cse6'. Error message: \\\"Enable failed: failed to execute command: command terminated with exit status=5\\n[stdout]\\n\\n[stderr]\\n\\\".\"}]}}",
        "serviceRequestId": "7aae6d60-5c6d-429c-9ab7-6237d29e640c"
    },

@jackfrancis
Copy link
Member

@angrox I infer you don't see --cgroups-per-qos=true in your kubelet runtime config?

ps auxfww | grep /usr/local/bin/kubelet is one way to grok that

@angrox
Copy link
Author

angrox commented Apr 10, 2018

@jackfrancis yeah, it is not there.

root 12008 0.0 0.0 161276 7000 ? Ssl Apr09 0:03 /usr/bin/docker run --net=host --pid=host --privileged --rm --volume=/:/rootfs:ro,shared --volume=/dev:/dev --volume=/sys:/sys:ro --volume=/var/run:/var/run:rw --volume=/var/lib/cni/:/var/lib/cni:rw --volume=/sbin/apparmor_parser/:/sbin/apparmor_parser:rw --volume=/var/lib/docker/:/var/lib/docker:rw,shared --volume=/var/lib/containers/:/var/lib/containers:rw --volume=/var/lib/kubelet/:/var/lib/kubelet:rw,shared --volume=/var/log:/var/log:rw --volume=/etc/kubernetes/:/etc/kubernetes:ro --volume=/srv/kubernetes/:/srv/kubernetes:ro --volume=/var/lib/waagent/ManagedIdentity-Settings:/var/lib/waagent/ManagedIdentity-Settings:ro --volume=/etc/kubernetes/volumeplugins:/etc/kubernetes/volumeplugins:rw k8s-gcrio.azureedge.net/hyperkube-amd64:v1.9.6 /hyperkube kubelet --containerized --enable-server --node-labels=kubernetes.io/role=agent,agentpool=nodepool1,storageprofile=managed,storagetier=Premium_LRS,kubernetes.azure.com/cluster=MC_cn-kubernetes-dev_cn-kubernetes-dev_westeurope --v=2 --non-masquerade-cidr=10.0.0.0/8 --volume-plugin-dir=/etc/kubernetes/volumeplugins --address=0.0.0.0 --allow-privileged=true --authorization-mode=Webhook --azure-container-registry-config=/etc/kubernetes/azure.json --cadvisor-port=0

--cgroups-per-qos=false

--cloud-config=/etc/kubernetes/azure.json --cloud-provider=azure --cluster-dns=10.0.0.10 --cluster-domain=cluster.local --enforce-node-allocatable= --event-qps=0 --eviction-hard=memory.available<100Mi,nodefs.available<10%,nodefs.inodesFree<5% --feature-gates=Accelerators=true --image-gc-high-threshold=85 --image-gc-low-threshold=80 --keep-terminated-pod-volumes=false --kubeconfig=/var/lib/kubelet/kubeconfig --max-pods=110 --network-plugin=kubenet --node-status-update-frequency=10s --non-masquerade-cidr=10.0.0.0/8 --pod-infra-container-image=k8s-gcrio.azureedge.net/pause-amd64:3.0 --pod-manifest-path=/etc/kubernetes/manifests

@slack So the fix was not applied?

@angrox
Copy link
Author

angrox commented Apr 10, 2018

Installing a new cluster fixes the issue - the patch is in place. Upgrading (see posts above) do not update the config

@jackfrancis
Copy link
Member

@slack we should follow up with this, as @angrox's experience does not match our expectations. Thanks @angrox for your stamina here!

@angrox
Copy link
Author

angrox commented Apr 11, 2018

@jackfrancis I am also in contact with one of MS escalation managers and gave them access to our defective clusters. If you need more information please PM me.

@jnoller
Copy link
Contributor

jnoller commented Apr 3, 2019

Closing stale/resolved

@jnoller jnoller closed this as completed Apr 3, 2019
@ghost ghost locked as resolved and limited conversation to collaborators Aug 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants