Skip to content
This repository has been archived by the owner on Jan 11, 2023. It is now read-only.

VM has reported a failure when processing extension 'cse0' #1806

Closed
IvanCaro opened this issue Nov 21, 2017 · 97 comments
Closed

VM has reported a failure when processing extension 'cse0' #1806

IvanCaro opened this issue Nov 21, 2017 · 97 comments
Labels

Comments

@IvanCaro
Copy link

IvanCaro commented Nov 21, 2017

Is this a request for help?:


Is this an ISSUE or FEATURE REQUEST? (choose one):
ISSUE

What version of acs-engine?:
Version: canary
GitCommit: 8db990b
GitTreeState: clean

Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)
Kubernetes

What happened:
{
"status": "Failed",
"error": {
"code": "ResourceDeploymentFailure",
"message": "The resource operation completed with terminal provisioning state 'Failed'.",
"details": [
{
"code": "VMExtensionProvisioningError",
"message": "VM has reported a failure when processing extension 'cse0'. Error message: "Enable failed: failed to execute command: command terminated with exit status=5\n[stdout]\n\n[stderr]\nstat: cannot stat '/opt/azure/containers/provision.complete': No such file or directory\n"."
}

What you expected to happen:
{
"status": "Failed",
"error": {
"code": "ResourceDeploymentFailure",
"message": "The resource operation completed with terminal provisioning state 'Failed'.",
"details": [
{
"code": "VMExtensionProvisioningError",
"message": "VM has reported a failure when processing extension 'cse0'. Error message: "Enable failed: failed to execute command: command terminated with exit status=5\n[stdout]\n\n[stderr]\nstat: cannot stat '/opt/azure/containers/provision.complete': No such file or directory\n"."
}

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know:

@jackfrancis
Copy link
Member

@IvanCaro could you paste the api model that you passed as input to acs-engine?

@JackQuincy I wonder if cse has an non-idiomatic execution context for this change in some cases:

b5eb43b#diff-95c1c34f292e829cdcc06906aaf5c4f1

Does it make sense that a stat failure like the above would ever short-circuit as currently implemented?

@JackQuincy
Copy link
Contributor

Only reason I could see this failing is if we called set -x or something of the sort earlier in the line/command
which would be in parameters common etc.

@jackfrancis
Copy link
Member

@JackQuincy I can reproduce consistently when specifying "orchestratorRelease": "1.6". @IvanCaro Were you attempting to build a 1.6 cluster when you received this error?

@IvanCaro
Copy link
Author

hey @jackfrancis this happened with customVnet and DNS Servers, i created the cluster without DNS (i used azure provider) and after i changed (this works).

@tamilmani1989
Copy link
Member

@jackfrancis @JackQuincy Even I have faced this issue before and when I change the location and redeploy it worked. This issue is not consistent and its not happening all time.

@jackfrancis
Copy link
Member

@IvanCaro Do you agree this is an indeterminate, ephemeral issue and we should close this?

@mpluhar
Copy link

mpluhar commented Jan 8, 2018

Same issue here using Kubernetes 1.6 as an orchestrator and customVnet. Is there a workaround for this?

Update: I had to add maxPods value into the api model, otherwise the value would be empty and prevent the kubelet from starting.

@ducas
Copy link

ducas commented Jan 19, 2018

I found that I was having this because my nodes (master and agents) were not able to hit the k8s.gcr.io to download kubectl. I discovered this by logging into the master and looking at /var/log/cluster-provision.log, which ended with:

+ echo 'kubernetes did not start'
kubernetes did not start
+ exit 3

I traced this back to here - https://github.com/Azure/acs-engine/edit/master/parts/k8s/kubernetesmastercustomscript.sh#L571

This indicated it was having trouble running kubectl, so I tried invoking it from the ssh terminal. Low and behold - command not found. That file lead me to the fact that it's installed using a service called kubectl-extract. Looking at its logs using sudo journalctl -n -u kubectl-extract I found the following output:

Jan 17 05:16:15 k8s-master-42756516-0 systemd[1]: Failed to start Kubectl extraction.
Jan 17 05:16:15 k8s-master-42756516-0 systemd[1]: kubectl-extract.service: Unit entered failed state.
Jan 17 05:16:15 k8s-master-42756516-0 systemd[1]: kubectl-extract.service: Failed with result 'exit-code'.
Jan 17 05:16:20 k8s-master-42756516-0 systemd[1]: kubectl-extract.service: Service hold-off time over, scheduling restart.
Jan 17 05:16:20 k8s-master-42756516-0 systemd[1]: Stopped Kubectl extraction.
Jan 17 05:16:20 k8s-master-42756516-0 systemd[1]: Starting Kubectl extraction...
Jan 17 05:16:35 k8s-master-42756516-0 docker[45966]: Error response from daemon: Get https://k8s-gcrio.azureedge.net/v2/hyperkube-amd64/manifests/v1.7.9: Get https://k8s.gcr.io/v2/token?scope=repository%3Ahyperkube-amd64%3Apull&service=k8s.gcr.io: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Jan 17 05:16:35 k8s-master-42756516-0 systemd[1]: kubectl-extract.service: Control process exited, code=exited status=1

So there was a problem downloading kubectl from k8s.gcr.io. Turns out it was a DNS problem, but that's just my network... Hope this helps someone debug a related issue.

@jsturtevant
Copy link
Collaborator

I ran into this as well. It failed when ran in westus2. I changed to eastus and it worked

@nakah
Copy link

nakah commented Jan 25, 2018

I'm having same issues when deploying to custom VNET on WestEurope using acs-engine 0.12.4 and Kubernetes 1.9.1.
I've backuped all logs from /var/log/azure & /var/log/containers if it can help.

@msorby
Copy link

msorby commented Jan 30, 2018

Just to acknowledge this. I get the exact same thing, custom VNET in West Europe, North Europe works just fine.

Update: This is flaky somehow, now I can't deploy without custom VNET without it getting stuck on the extension for the master node.
It's a mixed cluster with windows and Linux agent pools. This is a setup that worked last week.
Taking the exact same output from acs-engine and deploying it to North Europe works fine.

@rodrigoffonseca
Copy link

I got the same error and the problem was the DNS resolution at the VNET. After I fixed my custom DNS servers to resolve internet names everything worked fine.

@msorby
Copy link

msorby commented Jan 30, 2018

I'm getting this error with or without custom VNET.
I get it from the simplest of configurations in West Europe, then I deploy the same generated arm template to North Europe and it works. It's related to this I'm guessing, #2162.

And here is the build from a pull request with a potential fix that still failed, https://circleci.com/gh/Azure/acs-engine/14298?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link

//Morten

@jackfrancis
Copy link
Member

cse* errors generally are a result of the provisioning process on the host failing. I'd like to keep this ticket open to encourage folks to share (bad) experiences. We're working on (1) improving logging around this and (2) hunting down transient errors (such as lack of DNS access would incur) and try, where appropriate, to introduce add'l resilience.

@feiskyer
Copy link
Member

feiskyer commented Feb 3, 2018

also met same problem in eastus. cse0 timed out and the master VM can't ssh

@jackfrancis
Copy link
Member

Just identified one cause of transient cse errors (DNS availability race condition on cluster provisioning), added some retry resiliency and am hoping that eliminates that symptom. @feiskyer please try to repro using master next week and let me know if you can, thanks!

@feiskyer
Copy link
Member

feiskyer commented Feb 3, 2018

@jackfrancis sure

@ilyalukyanov
Copy link

ilyalukyanov commented Feb 5, 2018

@jackfrancis Thanks for pushing the fix. I've tested with the latest master, but unfortunately it didn't help in my case.

I consistently face this issue if networking is set to azure. However, even before your fixes I was able to deploy at least several times in a row with networking set to none (for some reason networking was extremely unreliable so that's not a solution) and haven't had it failing.

In case this is useful, here's my template:

{
    "apiVersion": "vlabs",
    "properties": {
        "orchestratorProfile": {
            "orchestratorType": "Kubernetes",
            "orchestratorRelease": "1.9",
            "orchestratorVersion": "1.9.2",
            "kubernetesConfig": {
                "networkPolicy": "azure"
            }
        },
        "masterProfile": {
            "count": 1,
            "dnsPrefix": "my-prefix",
            "vnetSubnetId": "<value>",
            "firstConsecutiveStaticIP": "172.19.5.100",
            "vmSize": "Standard_D2_v2"
        },
        "agentPoolProfiles": [{
            "availabilityProfile": "AvailabilitySet",
            "count": 2,
            "name": "pool1",
            "OSDiskSizeGB": 400,
            "storageProfile" : "ManagedDisks",
            "vmSize": "Standard_D2_v2",
            "vnetSubnetId": "<value>",
            "osType": "Linux",
            "distro": "ubuntu"
         }],
        "linuxProfile": {
            "adminUsername": "<value>",
            "ssh": {
                "publicKeys": [{ "keyData": "<value>" }]
            },
            "secrets": []
        },
        "servicePrincipalProfile": {
            "clientId": "<value>",
            "secret": "<value>"
        },
        "certificateProfile": {}
    }
}

And my acs-engine version output:

Version: canary
GitCommit: 7923b960
GitTreeState: clean

Just tried with calico and it worked fine. Seems to be just azure that's affected in my case.

@msorby
Copy link

msorby commented Feb 5, 2018

I've been testing a lot to day with ace-engine 0.12.5 and I've yet to run into this issue. Last week with 0.12.4 and 0.12.2 I got it all the time. So it seems to be much better 👍

@jackfrancis
Copy link
Member

@ilyalukyanov This PR also moves the ball forward:

#2196

That is aimed for master today, should reduce further cse flakiness.

@msorby
Copy link

msorby commented Feb 6, 2018

I've literally deployed 20 times today without issues. Then suddenly the extension error popped up again. For the exact same generated template. This was a template generated with acs-engine 0.12.5.
So there is still some flakiness left ;-)

@ilyalukyanov
Copy link

@jackfrancis thanks for prompt fixes! I'll give them a go later this week and will update this thread.

@idanshahar
Copy link

This is still happening in West Europe.
I'm using acs-engine 0.12.5

@Jarlotee
Copy link

Jarlotee commented Feb 7, 2018

Is there a work around to get the partial deployment into a health state?

@jackfrancis
Copy link
Member

@idanshahar Are you able to build from master? Much of the work post v0.12 has been identifying transient issues with provision scripts (and dependencies), which is where CSE deployment errors originate.

@Jarlotee Depending on the scenario you could cherry-pick through the provision script /opt/azure/containers/provision.sh and manually execute the failed commands, but that would be tedious for a new cluster. The easiest path forward is to re-build, again using built-from-master binary, if possible.

Thanks for your endurance, all. :)

@Jarlotee
Copy link

Jarlotee commented Feb 8, 2018

@jackfrancis and anyone else who gets bitten by this.

My issue turned out to be in the SPN password which had a % in it!

The password was truncated at the % which caused the subsequent failure.

@jackfrancis
Copy link
Member

Ugh. See #1208

We'll prioritize this in the next release cycle. Thanks for sharing @Jarlotee !

@idanshahar
Copy link

idanshahar commented Feb 11, 2018

@jackfrancis Yes I can do so, but still need a patch for a customer. when does the next version suppose to be released? BTW there is another issue exists in the master branch... #2198

UPDATE


After building from master, this is the error I've got:

Deployment failed. Correlation ID: 95b11df4-e602-4e31-97a1-7ace41350afe. {
  "status": "Failed",
  "error": {
    "code": "ResourceDeploymentFailure",
    "message": "The resource operation completed with terminal provisioning state 'Failed'.",
    "details": [
      {
        "code": "VMExtensionProvisioningError",
        "message": "VM has reported a failure when processing extension 'cse0'. Error message: \"Enable failed: failed to execute command: command terminated with exit status=3\n[stdout]\n\n[stderr]\n\"."
      }
    ]
  }
}

@khaldoune
Copy link

@CecileRobertMichon Good news thanks.
I will send cluster-provision and cloud-init-output logs as soon as I have access to these VMs.

@CecileRobertMichon
Copy link
Contributor

@khaldoune I think I have the cause. There seems to be a regression with Calico. I'm trying to find out which commit introduced the regression. In the meantime, if it's an option for you, the same apimodel will work if you remove the line "networkPolicy": "calico", .

@khaldoune
Copy link

@CecileRobertMichon I've already tried to go from 0.14.0 and replace calico 0.7 with 0.1 (I've updated the tgz url), the provisioning had failed.

I've also seen something strange in calico's manifest: cniversion:0.1 instead of 0.7, changing it to 0.7 did not change anything.

I hope it helps.

@CecileRobertMichon
Copy link
Contributor

+@dtzar who is working on a PR to upgrade Calico (#2521 ) and might be able to provide insight on the above.

@CecileRobertMichon
Copy link
Contributor

To clarify, the regression is not a general calico regression as deployments using "networkPolicy": "calico" in our regression tests are succeeding, but rather a regression with this particular apimodel, most likely involving another custom property that became incompatible with calico in v0.14.0.

@dtzar
Copy link
Contributor

dtzar commented Mar 26, 2018

The version of calico in the master branch being deployed is quite old 2.6.3 See releases. Could you check to see if the latest version in my PR referenced above resolves your problem?

As mentioned - the script extension will fail if the nodes are not ready. The calico-node daemonset needs to be operational in order for the scripts to finish/pass and nodes to be ready.

I haven't done any digging, but one suspect is that the kubeClusterCidr is different than your listed vnetCidr in your apimodel. The kubeClusterCidr is the value used in the calico network configuration here.

@CecileRobertMichon
Copy link
Contributor

Thanks @dtzar! @khaldoune can you try changing the value of "clusterSubnet" to match the value of "vnetCidr"?

@dtzar
Copy link
Contributor

dtzar commented Mar 27, 2018

To clarify and record, clusterSubnet translates into kubeClusterCidr in engine.go Line 649 :)

@dtzar
Copy link
Contributor

dtzar commented Mar 27, 2018

@khaldoune It would be good to understand what's going on with your network topology/configuration. Per your above configuration, I see:
"dnsServiceIP": "192.168.1.10"
"serviceCidr": "192.168.1.0/24"
"clusterSubnet": "10.10.0.0/16"
Master - "vnetCidr": "198.18.184.0/22"

@khaldoune
Copy link

khaldoune commented Mar 28, 2018

Hi,

Thanks all for your assistance, I was out of the office yesterday...

@CecileRobertMichon , we need Calico because we are using it for project/namespaces/network isolation.

@dtzar : I had workarround the issue 2202 by disabling the Encryption at Rest.

If my understanding is good, whe should have clusterSubnet=kubeClusterCidr=clusterSubnet= PODs CIDR.

From a design point of view, the pods's cidr should be private (not directly addressable from outside the k8s cluster), and thus, we should be able to use something else than the masters and workers cidrs as a pod CIDR. That's what I'm trying to achieve.

In Azure, a VNET can have several address spaces, so if we reed here: https://github.com/Azure/acs-engine/blob/master/docs/kubernetes/features.md#feat-custom-vnet

"Additionally, to prevent source address NAT'ing within the VNET, we assign to the vnetCidr property in masterProfile the CIDR block that represents the usable address space in the existing VNET"

I understand that I just need to add another address space (10.10.0.0/16) to my k8s VNET (198.18.184.0/22) and the magic should happen.

I've just deployed with success a K8S 1.9.6 using a modified version of the acs 0.13.0

$ k get nodes
NAME STATUS ROLES AGE VERSION
k8s-master-35332392-0 Ready master 2h v1.9.6
k8s-master-35332392-1 Ready master 2h v1.9.6
k8s-master-35332392-2 Ready master 2h v1.9.6
k8s-master-35332392-3 Ready master 2h v1.9.6
k8s-master-35332392-4 Ready master 2h v1.9.6
k8s-wbronze-35332392-0 Ready agent 2h v1.9.6
k8s-wdiamond-35332392-0 Ready agent 2h v1.9.6
k8s-wgold-35332392-0 Ready agent 2h v1.9.6
k8s-wplatin-35332392-0 Ready agent 2h v1.9.6
k8s-wsilver-35332392-0 Ready agent 2h v1.9.6

122/125 of the Sonobuoy tests on this cluster are passing (I will analyse 3 KOs later)

Here is the complete configuration of the subnet:
vpod0a-k8s-prd-1-vnet.zip

An exerpt:
"properties": { "provisioningState": "Succeeded", "resourceGuid": "xxxxxxxxxxxxxxxxxx", "addressSpace": { "addressPrefixes": [ "198.18.184.0/21", "172.16.0.0/16" ] },

I've replaced 10.0.0.0/16 with 172.16.0.0/16 because the first one is used.

As you can see, I've 2 address spaces in my VNET.

@CecileRobertMichon I've also been able to deploy a k8s 1.9.3 cluster using acs0.13.1

k get nodes
NAME STATUS ROLES AGE VERSION
k8s-master-35332392-0 Ready master 6m v1.9.3
k8s-master-35332392-1 Ready master 6m v1.9.3
k8s-master-35332392-2 Ready master 7m v1.9.3
k8s-master-35332392-3 Ready master 6m v1.9.3
k8s-master-35332392-4 Ready master 6m v1.9.3
k8s-wbronze-35332392-0 Ready agent 7m v1.9.3
k8s-wdiamond-35332392-0 Ready agent 7m v1.9.3
k8s-wgold-35332392-0 Ready agent 7m v1.9.3
k8s-wplatin-35332392-0 Ready agent 7m v1.9.3
k8s-wsilver-35332392-0 Ready agent 7m v1.9.3

I will try to provision a cluster with the PR2551. I will keep you updated.

@khaldoune
Copy link

@CecileRobertMichon @dtzar

The provisioning with PR2521 has failed.

$ acs-engine version
Version: canary
GitCommit: ae4590a
GitTreeState: clean

Here are logs (provision, cloud init, cloud init output):
azurelogs.zip

Thanks

@khaldoune
Copy link

@jackfrancis @CecileRobertMichon @dtzar

The deployment fails even with the acs 0.14.5 and with a routable cidr for PODs (in the vnet address space):

vnet_prefix: 198.18.184.0/21
vnet_master_subnet: 198.18.190.0/24
vnet_worker_subnet: 198.18.189.0/24
vnet_master_first_ip: 198.18.190.50
k8s_pod_cidr: 198.18.184.0/22
k8s_service_cidr: 198.18.188.0/23
k8s_dns_service: 198.18.188.10

Here is the cluster definition:

{
"apiVersion": "vlabs",
"properties": {
"orchestratorProfile": {
"orchestratorType": "Kubernetes",
"orchestratorRelease": "1.9",
"orchestratorVersion": "1.9.6",
"kubernetesConfig": {
"networkPolicy": "calico",
"etcdDiskSizeGB": "16",
"enableAggregatedAPIs": true,
"enablePodSecurityPolicy": true,
"EnableRbac": true,
"clusterSubnet": "198.18.184.0/22",
"serviceCidr": "198.18.188.0/23",
"dnsServiceIP": "198.18.188.10",
"kubeletConfig": {
"--event-qps": "0",
"--non-masquerade-cidr": "198.18.184.0/22",
"--authentication-token-webhook": "true"
},
"controllerManagerConfig": {
"--address": "0.0.0.0",
"--profiling": "false",
"--terminated-pod-gc-threshold": "100",
"--node-cidr-mask-size": "27",
"--node-monitor-grace-period": "40s",
"--pod-eviction-timeout": "60s",
"--horizontal-pod-autoscaler-use-rest-clients": "true"
},
"cloudControllerManagerConfig": {
"--profiling": "false"
},
"apiServerConfig": {
"--profiling": "false",
"--repair-malformed-updates": "false",
"--endpoint-reconciler-type": "lease"
},
"addons": [
{
"name": "tiller",
"enabled": false
}
]
}
},
"masterProfile": {
"dnsPrefix": "k8s-noprd",
"vnetCidr": "198.18.190.0/24",
"count": 5,
"vmSize": "Standard_D2_v2",
"OSDiskSizeGB": 80,
"vnetSubnetId": "/subscriptions/xxxxxxxxxxxxxxxxxx/resourceGroups/vpod0a-apps-prd-rg/providers/Microsoft.Network/virtualNetworks/vpod0a-k8s-prd-1-vnet/subnets/master_subnet",
"firstConsecutiveStaticIP": "198.18.190.50",
"preProvisionExtension": {
"name": "setup"
}
},
"agentPoolProfiles": [
{
"name": "wbronze",
"count": 1,
"vmSize": "Standard_D2_v2",
"OSDiskSizeGB": 80,
"availabilityProfile": "AvailabilitySet",
"vnetSubnetId": "/subscriptions/xxxxxxxxxxxxxxxxxx/resourceGroups/vpod0a-apps-prd-rg/providers/Microsoft.Network/virtualNetworks/vpod0a-k8s-prd-1-vnet/subnets/worker_subnet",
"diskSizesGB": [ 50 ],
"StorageProfile": "ManagedDisks",
"preProvisionExtension": {
"name": "setup_node"
}
},
{
"name": "wsilver",
"count": 1,
"vmSize": "Standard_D2_v2",
"OSDiskSizeGB": 80,
"availabilityProfile": "AvailabilitySet",
"vnetSubnetId": "/subscriptions/xxxxxxxxxxxxxxxxxx/resourceGroups/vpod0a-apps-prd-rg/providers/Microsoft.Network/virtualNetworks/vpod0a-k8s-prd-1-vnet/subnets/worker_subnet",
"diskSizesGB": [ 50 ],
"StorageProfile": "ManagedDisks",
"preProvisionExtension": {
"name": "setup_node"
}
},
{
"name": "wgold",
"count": 1,
"vmSize": "Standard_D2_v2",
"OSDiskSizeGB": 80,
"availabilityProfile": "AvailabilitySet",
"vnetSubnetId": "/subscriptions/xxxxxxxxxxxxxxxxxx/resourceGroups/vpod0a-apps-prd-rg/providers/Microsoft.Network/virtualNetworks/vpod0a-k8s-prd-1-vnet/subnets/worker_subnet",
"diskSizesGB": [ 50 ],
"StorageProfile": "ManagedDisks",
"preProvisionExtension": {
"name": "setup_node"
}
},
{
"name": "wplatin",
"count": 1,
"vmSize": "Standard_D2_v2",
"OSDiskSizeGB": 80,
"availabilityProfile": "AvailabilitySet",
"vnetSubnetId": "/subscriptions/xxxxxxxxxxxxxxxxxx/resourceGroups/vpod0a-apps-prd-rg/providers/Microsoft.Network/virtualNetworks/vpod0a-k8s-prd-1-vnet/subnets/worker_subnet",
"diskSizesGB": [ 50 ],
"StorageProfile": "ManagedDisks",
"preProvisionExtension": {
"name": "setup_node"
}
},
{
"name": "wdiamond",
"count": 1,
"vmSize": "Standard_D2_v2",
"OSDiskSizeGB": 80,
"availabilityProfile": "AvailabilitySet",
"vnetSubnetId": "/subscriptions/xxxxxxxxxxxxxxxxxx/resourceGroups/vpod0a-apps-prd-rg/providers/Microsoft.Network/virtualNetworks/vpod0a-k8s-prd-1-vnet/subnets/worker_subnet",
"diskSizesGB": [ 50 ],
"StorageProfile": "ManagedDisks",
"preProvisionExtension": {
"name": "setup_node"
}
}
],
"linuxProfile": {
"adminUsername": "k8s",
"ssh": {
"publicKeys": [
{
"keyData": "ssh-rsa xxxxxxxxxxxxxxxxxx"
}
]
}
},
"servicePrincipalProfile": {
"clientId": "xxxxxxxxxxxxxxxxxx",
"secret": "xxxxxxxxxxxxxxxxxx"
},
"extensionProfiles": [
{
"name": "setup_node",
"version": "v1",
"script": "setup.sh",
"rootURL": "https://gitlab.com/octo-carrefour-k8s/acs-extensions/raw/master/",
"extensionParameters": "198.18.192.4 k8s-noprd.xpod.carrefour.com"
},
{
"name": "setup",
"version": "v1",
"script": "setup.sh",
"rootURL": "https://gitlab.com/octo-carrefour-k8s/acs-extensions/raw/master/",
"extensionParameters": "198.18.192.4 k8s-noprd.xpod.carrefour.com"
}
]
}
}
~

@khaldoune
Copy link

@jackfrancis @idanshahar @CecileRobertMichon @dtzar

My /etc/cni is empty. Where/When its content gets created by acs-engine?

Thanks.

@CecileRobertMichon
Copy link
Contributor

@khaldoune /etc/cni should contain net.d:

setDockerOpts " --volume=/etc/cni/:/etc/cni:ro --volume=/opt/cni/:/opt/cni:ro"


Since you were able to deploy the same api model with two vnets in v13.1 and see ready nodes this might be a regression. I suspect it could be linked to issue #2476. Could you please open a new issue since I think we are outside the scope of this current issue for better tracking of the bug/fix? Thank you for your patience, let's get this resolved asap! cc @jackfrancis

@khaldoune
Copy link

@CecileRobertMichon @jackfrancis

Provisioning using Azure CNI instead of Calico with acs 0.14.5 works fine.

Provisioning with Calico and a single subnet for both masters and workers fails.

I've also double-checked weither or not the Encryption At Rest has been enabled by default in 0.14.5, it is not.

I've just created a new issue: #2607

Thanks for your help.

@marty2bell
Copy link

I got this error yesterday using acs-engine 15.2 with the distro set to coreos. Removing this from the template and reverting to ubuntu mitigated the issue, but means we can't provision coreOS vms.

Marty

@rocketraman
Copy link
Contributor

I just upgraded a cluster from 1.7.5 to 1.8.10 via acs-engine 0.15.2 and ran into this issue. The cluster uses Azure CNI and Ubuntu.

The resource group Deployment is still showing the Failure if more details are needed.

Ignoring the error, and resuming the upgrade seems to have worked fine, but the cse0 extension on the master VM is still showing status "Provisioning failed". I don't know what the implications of this are, but as I said, everything seems to be working.

@BrendanThompson
Copy link

I am seeing this same issue with the following:

Application Version
acs-engine v0.16.0
k8s 1.10

The cluster is trying to use Ubuntu with Azure CNI

@CecileRobertMichon
Copy link
Contributor

@rocketraman and @BrendanThompson please share the apimodel you used to generate the template/deploy the cluster as well as the exact error message (what was the error code?).

@rocketraman
Copy link
Contributor

rocketraman commented Apr 24, 2018

@CecileRobertMichon Here is my API model, with private information elided:

apimodel.json

Here is the error (operation status was "Conflict", Provisioning state is "Failed"):

{
  "status": "Failed",
  "error": {
    "code": "ResourceDeploymentFailure",
    "message": "The resource operation completed with terminal provisioning state 'Failed'.",
    "details": [
      {
        "code": "VMExtensionProvisioningError",
        "message": "VM has reported a failure when processing extension 'cse0'. Error message: \"Enable failed: failed to execute command: command terminated with exit status=3\n[stdout]\n\n[stderr]\n\"."
      }
    ]
  }
}

Same exact error on two different clusters.

@rocketraman
Copy link
Contributor

@CecileRobertMichon I think I understand what happened in my case.

Looking at /var/log/azure/cluster-provision.log, it looks like it failed because it couldn't connect to etcd. This is probably because of some customizations I had made to the pre-upgrade cluster due to some other weirdness [1]. This prevented the upgraded cluster etcd from starting up and the cse0 script to fail.

[1] In my previous cluster, I was experiencing an issue in which etcd wasn't starting up because it was choking on the lost+found directory in /var/lib/etcddisk/. To fix this, I had manually moved the etcd data directory on the older cluster into a subdirectory. Thankfully, I don't have this issue with etcd on my current cluster.

@hensilva
Copy link

I'm facing similar issues today. 1.9 or 1.10 with ACS 0.16.1

@hmarcelodn
Copy link

Same issue with k8s 1.6.6 and acs-engine 1.16.2

@dennis-benzinger-hybris
Copy link
Contributor

In our case the apt package indexes in /var/lib/apt/lists got corrupted somehow and Docker couldn't be installed. Many of the files there are empty but still apt-get update doesn't download them. Only after removing the files manually apt-get update downloaded them again and you can (for example) install docker-engine again.

@CecileRobertMichon
Copy link
Contributor

For everyone here, https://github.com/Azure/acs-engine/blob/master/docs/kubernetes/troubleshooting.md#vmextensionprovisioningerror-or-vmextensionprovisioningtimeout has been added to help troubleshoot VM extension errors. Please follow the instructions if you encounter one of those.

@Navlesh
Copy link

Navlesh commented Jun 15, 2018

@CecileRobertMichon I too face #vmextensionprovisioningtimeout error all the time when i have 3 masters.
I am using ACS v0.18.6
following is the sample input file.
{
"apiVersion": "vlabs",
"properties": {
"orchestratorProfile": {
"orchestratorType": "Kubernetes",
"orchestratorRelease": "1.9",
"kubernetesConfig": {
"privateCluster": {
"enabled": true
}
}
},
"masterProfile": {
"count": 3,
"dnsPrefix": "egsms",
"vmSize": "Standard_D2s_v3",
"vnetSubnetId": "/subscriptions/xxxxxxxxxxxxxxxxxxxxxxxx/resourceGroups/MyRG/providers/Microsoft.Network/virtualNetworks/vnet/subnets/frontend",
"firstConsecutiveStaticIP": "10.0.0.45",
"vnetCidr": "10.0.0.0/24"
},
"agentPoolProfiles": [
{
"name": "egsagent",
"count": 1,
"vmSize": "Standard_D2s_v3",
"vnetSubnetId": "/subscriptions/xxxxxxxxxxxxxxxxxxxxxxxxxx/resourceGroups/MyRG/providers/Microsoft.Network/virtualNetworks/vnet/subnets/frontend",
"availabilityProfile": "AvailabilitySet"
}
],
"linuxProfile": {
"adminUsername": "useradmin",
"ssh": {
"publicKeys": [
{
"keyData": "ssh-rsa xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
}
]
}
},
"servicePrincipalProfile": {
"clientId": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
"secret": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx="
}
}
}

@CecileRobertMichon
Copy link
Contributor

@Navlesh please take a look at https://github.com/Azure/acs-engine/blob/master/docs/kubernetes/troubleshooting.md#vmextensionprovisioningerror-or-vmextensionprovisioningtimeout if you haven't already and open a new issue with title "CSE error: exit code <INSERT_YOUR_EXIT_CODE>" and include the following in the description:

  • The apimodel json used to deploy the cluster (aka your cluster config). Please make sure you remove all secrets and keys before posting it on GitHub (what you pasted above)
  • The output of kubectl get nodes
  • The content of /var/log/azure/cluster-provision.log and /var/log/cloud-init-output.log

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests