VM has reported a failure when processing extension 'cse0' #1806

IvanCaro · 2017-11-21T01:17:32Z

Is this a request for help?:

Is this an ISSUE or FEATURE REQUEST? (choose one):
ISSUE

What version of acs-engine?:
Version: canary
GitCommit: `8db990b`
GitTreeState: clean

Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)
Kubernetes

What happened:
{
"status": "Failed",
"error": {
"code": "ResourceDeploymentFailure",
"message": "The resource operation completed with terminal provisioning state 'Failed'.",
"details": [
{
"code": "VMExtensionProvisioningError",
"message": "VM has reported a failure when processing extension 'cse0'. Error message: "Enable failed: failed to execute command: command terminated with exit status=5\n[stdout]\n\n[stderr]\nstat: cannot stat '/opt/azure/containers/provision.complete': No such file or directory\n"."
}

What you expected to happen:
{
"status": "Failed",
"error": {
"code": "ResourceDeploymentFailure",
"message": "The resource operation completed with terminal provisioning state 'Failed'.",
"details": [
{
"code": "VMExtensionProvisioningError",
"message": "VM has reported a failure when processing extension 'cse0'. Error message: "Enable failed: failed to execute command: command terminated with exit status=5\n[stdout]\n\n[stderr]\nstat: cannot stat '/opt/azure/containers/provision.complete': No such file or directory\n"."
}

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know:

The text was updated successfully, but these errors were encountered:

jackfrancis · 2017-11-21T20:58:34Z

@IvanCaro could you paste the api model that you passed as input to acs-engine?

@JackQuincy I wonder if cse has an non-idiomatic execution context for this change in some cases:

b5eb43b#diff-95c1c34f292e829cdcc06906aaf5c4f1

Does it make sense that a stat failure like the above would ever short-circuit as currently implemented?

JackQuincy · 2017-11-21T22:51:21Z

Only reason I could see this failing is if we called set -x or something of the sort earlier in the line/command
which would be in parameters common etc.

jackfrancis · 2017-11-21T22:54:31Z

@JackQuincy I can reproduce consistently when specifying "orchestratorRelease": "1.6". @IvanCaro Were you attempting to build a 1.6 cluster when you received this error?

IvanCaro · 2017-11-22T18:17:18Z

hey @jackfrancis this happened with customVnet and DNS Servers, i created the cluster without DNS (i used azure provider) and after i changed (this works).

tamilmani1989 · 2017-12-01T18:53:18Z

@jackfrancis @JackQuincy Even I have faced this issue before and when I change the location and redeploy it worked. This issue is not consistent and its not happening all time.

jackfrancis · 2017-12-01T20:52:45Z

@IvanCaro Do you agree this is an indeterminate, ephemeral issue and we should close this?

mpluhar · 2018-01-08T23:30:12Z

Same issue here using Kubernetes 1.6 as an orchestrator and customVnet. Is there a workaround for this?

Update: I had to add maxPods value into the api model, otherwise the value would be empty and prevent the kubelet from starting.

ducas · 2018-01-19T05:20:57Z

I found that I was having this because my nodes (master and agents) were not able to hit the k8s.gcr.io to download kubectl. I discovered this by logging into the master and looking at /var/log/cluster-provision.log, which ended with:

+ echo 'kubernetes did not start'
kubernetes did not start
+ exit 3

I traced this back to here - https://github.com/Azure/acs-engine/edit/master/parts/k8s/kubernetesmastercustomscript.sh#L571

This indicated it was having trouble running kubectl, so I tried invoking it from the ssh terminal. Low and behold - command not found. That file lead me to the fact that it's installed using a service called kubectl-extract. Looking at its logs using sudo journalctl -n -u kubectl-extract I found the following output:

Jan 17 05:16:15 k8s-master-42756516-0 systemd[1]: Failed to start Kubectl extraction.
Jan 17 05:16:15 k8s-master-42756516-0 systemd[1]: kubectl-extract.service: Unit entered failed state.
Jan 17 05:16:15 k8s-master-42756516-0 systemd[1]: kubectl-extract.service: Failed with result 'exit-code'.
Jan 17 05:16:20 k8s-master-42756516-0 systemd[1]: kubectl-extract.service: Service hold-off time over, scheduling restart.
Jan 17 05:16:20 k8s-master-42756516-0 systemd[1]: Stopped Kubectl extraction.
Jan 17 05:16:20 k8s-master-42756516-0 systemd[1]: Starting Kubectl extraction...
Jan 17 05:16:35 k8s-master-42756516-0 docker[45966]: Error response from daemon: Get https://k8s-gcrio.azureedge.net/v2/hyperkube-amd64/manifests/v1.7.9: Get https://k8s.gcr.io/v2/token?scope=repository%3Ahyperkube-amd64%3Apull&service=k8s.gcr.io: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Jan 17 05:16:35 k8s-master-42756516-0 systemd[1]: kubectl-extract.service: Control process exited, code=exited status=1

So there was a problem downloading kubectl from k8s.gcr.io. Turns out it was a DNS problem, but that's just my network... Hope this helps someone debug a related issue.

jsturtevant · 2018-01-23T22:55:48Z

I ran into this as well. It failed when ran in westus2. I changed to eastus and it worked

nakah · 2018-01-25T14:56:22Z

I'm having same issues when deploying to custom VNET on WestEurope using acs-engine 0.12.4 and Kubernetes 1.9.1.
I've backuped all logs from /var/log/azure & /var/log/containers if it can help.

msorby · 2018-01-30T00:05:19Z

Just to acknowledge this. I get the exact same thing, custom VNET in West Europe, North Europe works just fine.

Update: This is flaky somehow, now I can't deploy without custom VNET without it getting stuck on the extension for the master node.
It's a mixed cluster with windows and Linux agent pools. This is a setup that worked last week.
Taking the exact same output from acs-engine and deploying it to North Europe works fine.

rodrigoffonseca · 2018-01-30T18:54:09Z

I got the same error and the problem was the DNS resolution at the VNET. After I fixed my custom DNS servers to resolve internet names everything worked fine.

msorby · 2018-01-30T19:30:13Z

I'm getting this error with or without custom VNET.
I get it from the simplest of configurations in West Europe, then I deploy the same generated arm template to North Europe and it works. It's related to this I'm guessing, #2162.

And here is the build from a pull request with a potential fix that still failed, https://circleci.com/gh/Azure/acs-engine/14298?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link

//Morten

jackfrancis · 2018-02-02T23:10:26Z

cse* errors generally are a result of the provisioning process on the host failing. I'd like to keep this ticket open to encourage folks to share (bad) experiences. We're working on (1) improving logging around this and (2) hunting down transient errors (such as lack of DNS access would incur) and try, where appropriate, to introduce add'l resilience.

feiskyer · 2018-02-03T00:50:10Z

also met same problem in eastus. cse0 timed out and the master VM can't ssh

jackfrancis · 2018-02-03T00:52:15Z

Just identified one cause of transient cse errors (DNS availability race condition on cluster provisioning), added some retry resiliency and am hoping that eliminates that symptom. @feiskyer please try to repro using master next week and let me know if you can, thanks!

feiskyer · 2018-02-03T00:54:06Z

@jackfrancis sure

ilyalukyanov · 2018-02-05T10:07:17Z

@jackfrancis Thanks for pushing the fix. I've tested with the latest master, but unfortunately it didn't help in my case.

I consistently face this issue if networking is set to azure. However, even before your fixes I was able to deploy at least several times in a row with networking set to none (for some reason networking was extremely unreliable so that's not a solution) and haven't had it failing.

In case this is useful, here's my template:

{
    "apiVersion": "vlabs",
    "properties": {
        "orchestratorProfile": {
            "orchestratorType": "Kubernetes",
            "orchestratorRelease": "1.9",
            "orchestratorVersion": "1.9.2",
            "kubernetesConfig": {
                "networkPolicy": "azure"
            }
        },
        "masterProfile": {
            "count": 1,
            "dnsPrefix": "my-prefix",
            "vnetSubnetId": "<value>",
            "firstConsecutiveStaticIP": "172.19.5.100",
            "vmSize": "Standard_D2_v2"
        },
        "agentPoolProfiles": [{
            "availabilityProfile": "AvailabilitySet",
            "count": 2,
            "name": "pool1",
            "OSDiskSizeGB": 400,
            "storageProfile" : "ManagedDisks",
            "vmSize": "Standard_D2_v2",
            "vnetSubnetId": "<value>",
            "osType": "Linux",
            "distro": "ubuntu"
         }],
        "linuxProfile": {
            "adminUsername": "<value>",
            "ssh": {
                "publicKeys": [{ "keyData": "<value>" }]
            },
            "secrets": []
        },
        "servicePrincipalProfile": {
            "clientId": "<value>",
            "secret": "<value>"
        },
        "certificateProfile": {}
    }
}

And my acs-engine version output:

Version: canary
GitCommit: 7923b960
GitTreeState: clean

Just tried with calico and it worked fine. Seems to be just azure that's affected in my case.

msorby · 2018-02-05T18:05:51Z

I've been testing a lot to day with ace-engine 0.12.5 and I've yet to run into this issue. Last week with 0.12.4 and 0.12.2 I got it all the time. So it seems to be much better 👍

jackfrancis · 2018-02-05T18:51:56Z

@ilyalukyanov This PR also moves the ball forward:

#2196

That is aimed for master today, should reduce further cse flakiness.

msorby · 2018-02-06T20:33:37Z

I've literally deployed 20 times today without issues. Then suddenly the extension error popped up again. For the exact same generated template. This was a template generated with acs-engine 0.12.5.
So there is still some flakiness left ;-)

ilyalukyanov · 2018-02-07T01:58:29Z

@jackfrancis thanks for prompt fixes! I'll give them a go later this week and will update this thread.

idanshahar · 2018-02-07T13:12:42Z

This is still happening in West Europe.
I'm using acs-engine 0.12.5

Jarlotee · 2018-02-07T21:25:38Z

Is there a work around to get the partial deployment into a health state?

jackfrancis · 2018-02-08T00:15:06Z

@idanshahar Are you able to build from master? Much of the work post v0.12 has been identifying transient issues with provision scripts (and dependencies), which is where CSE deployment errors originate.

@Jarlotee Depending on the scenario you could cherry-pick through the provision script /opt/azure/containers/provision.sh and manually execute the failed commands, but that would be tedious for a new cluster. The easiest path forward is to re-build, again using built-from-master binary, if possible.

Thanks for your endurance, all. :)

Jarlotee · 2018-02-08T22:42:43Z

@jackfrancis and anyone else who gets bitten by this.

My issue turned out to be in the SPN password which had a % in it!

The password was truncated at the % which caused the subsequent failure.

jackfrancis · 2018-02-09T17:25:44Z

Ugh. See #1208

We'll prioritize this in the next release cycle. Thanks for sharing @Jarlotee !

idanshahar · 2018-02-11T00:06:13Z

@jackfrancis Yes I can do so, but still need a patch for a customer. when does the next version suppose to be released? BTW there is another issue exists in the master branch... #2198

UPDATE

After building from master, this is the error I've got:

Deployment failed. Correlation ID: 95b11df4-e602-4e31-97a1-7ace41350afe. {
  "status": "Failed",
  "error": {
    "code": "ResourceDeploymentFailure",
    "message": "The resource operation completed with terminal provisioning state 'Failed'.",
    "details": [
      {
        "code": "VMExtensionProvisioningError",
        "message": "VM has reported a failure when processing extension 'cse0'. Error message: \"Enable failed: failed to execute command: command terminated with exit status=3\n[stdout]\n\n[stderr]\n\"."
      }
    ]
  }
}

khaldoune · 2018-03-26T22:35:45Z

@CecileRobertMichon Good news thanks.
I will send cluster-provision and cloud-init-output logs as soon as I have access to these VMs.

CecileRobertMichon · 2018-03-26T22:38:56Z

@khaldoune I think I have the cause. There seems to be a regression with Calico. I'm trying to find out which commit introduced the regression. In the meantime, if it's an option for you, the same apimodel will work if you remove the line "networkPolicy": "calico", .

khaldoune · 2018-03-26T22:45:53Z

@CecileRobertMichon I've already tried to go from 0.14.0 and replace calico 0.7 with 0.1 (I've updated the tgz url), the provisioning had failed.

I've also seen something strange in calico's manifest: cniversion:0.1 instead of 0.7, changing it to 0.7 did not change anything.

I hope it helps.

CecileRobertMichon · 2018-03-26T22:49:55Z

+@dtzar who is working on a PR to upgrade Calico (#2521 ) and might be able to provide insight on the above.

CecileRobertMichon · 2018-03-26T23:06:02Z

To clarify, the regression is not a general calico regression as deployments using "networkPolicy": "calico" in our regression tests are succeeding, but rather a regression with this particular apimodel, most likely involving another custom property that became incompatible with calico in v0.14.0.

dtzar · 2018-03-26T23:41:24Z

The version of calico in the master branch being deployed is quite old 2.6.3 See releases. Could you check to see if the latest version in my PR referenced above resolves your problem?

As mentioned - the script extension will fail if the nodes are not ready. The calico-node daemonset needs to be operational in order for the scripts to finish/pass and nodes to be ready.

I haven't done any digging, but one suspect is that the kubeClusterCidr is different than your listed vnetCidr in your apimodel. The kubeClusterCidr is the value used in the calico network configuration here.

CecileRobertMichon · 2018-03-26T23:46:47Z

Thanks @dtzar! @khaldoune can you try changing the value of "clusterSubnet" to match the value of "vnetCidr"?

dtzar · 2018-03-27T00:08:11Z

To clarify and record, clusterSubnet translates into kubeClusterCidr in engine.go Line 649 :)

dtzar · 2018-03-27T00:15:26Z

@khaldoune It would be good to understand what's going on with your network topology/configuration. Per your above configuration, I see:
"dnsServiceIP": "192.168.1.10"
"serviceCidr": "192.168.1.0/24"
"clusterSubnet": "10.10.0.0/16"
Master - "vnetCidr": "198.18.184.0/22"

khaldoune · 2018-03-28T17:33:06Z

Hi,

Thanks all for your assistance, I was out of the office yesterday...

@CecileRobertMichon , we need Calico because we are using it for project/namespaces/network isolation.

@dtzar : I had workarround the issue 2202 by disabling the Encryption at Rest.

If my understanding is good, whe should have clusterSubnet=kubeClusterCidr=clusterSubnet= PODs CIDR.

From a design point of view, the pods's cidr should be private (not directly addressable from outside the k8s cluster), and thus, we should be able to use something else than the masters and workers cidrs as a pod CIDR. That's what I'm trying to achieve.

In Azure, a VNET can have several address spaces, so if we reed here: https://github.com/Azure/acs-engine/blob/master/docs/kubernetes/features.md#feat-custom-vnet

"Additionally, to prevent source address NAT'ing within the VNET, we assign to the vnetCidr property in masterProfile the CIDR block that represents the usable address space in the existing VNET"

I understand that I just need to add another address space (10.10.0.0/16) to my k8s VNET (198.18.184.0/22) and the magic should happen.

I've just deployed with success a K8S 1.9.6 using a modified version of the acs 0.13.0

$ k get nodes
NAME STATUS ROLES AGE VERSION
k8s-master-35332392-0 Ready master 2h v1.9.6
k8s-master-35332392-1 Ready master 2h v1.9.6
k8s-master-35332392-2 Ready master 2h v1.9.6
k8s-master-35332392-3 Ready master 2h v1.9.6
k8s-master-35332392-4 Ready master 2h v1.9.6
k8s-wbronze-35332392-0 Ready agent 2h v1.9.6
k8s-wdiamond-35332392-0 Ready agent 2h v1.9.6
k8s-wgold-35332392-0 Ready agent 2h v1.9.6
k8s-wplatin-35332392-0 Ready agent 2h v1.9.6
k8s-wsilver-35332392-0 Ready agent 2h v1.9.6

122/125 of the Sonobuoy tests on this cluster are passing (I will analyse 3 KOs later)

Here is the complete configuration of the subnet:
vpod0a-k8s-prd-1-vnet.zip

An exerpt:
"properties": { "provisioningState": "Succeeded", "resourceGuid": "xxxxxxxxxxxxxxxxxx", "addressSpace": { "addressPrefixes": [ "198.18.184.0/21", "172.16.0.0/16" ] },

I've replaced 10.0.0.0/16 with 172.16.0.0/16 because the first one is used.

As you can see, I've 2 address spaces in my VNET.

@CecileRobertMichon I've also been able to deploy a k8s 1.9.3 cluster using acs0.13.1

k get nodes
NAME STATUS ROLES AGE VERSION
k8s-master-35332392-0 Ready master 6m v1.9.3
k8s-master-35332392-1 Ready master 6m v1.9.3
k8s-master-35332392-2 Ready master 7m v1.9.3
k8s-master-35332392-3 Ready master 6m v1.9.3
k8s-master-35332392-4 Ready master 6m v1.9.3
k8s-wbronze-35332392-0 Ready agent 7m v1.9.3
k8s-wdiamond-35332392-0 Ready agent 7m v1.9.3
k8s-wgold-35332392-0 Ready agent 7m v1.9.3
k8s-wplatin-35332392-0 Ready agent 7m v1.9.3
k8s-wsilver-35332392-0 Ready agent 7m v1.9.3

I will try to provision a cluster with the PR2551. I will keep you updated.

khaldoune · 2018-03-28T18:31:16Z

@CecileRobertMichon @dtzar

The provisioning with PR2521 has failed.

$ acs-engine version
Version: canary
GitCommit: ae4590a
GitTreeState: clean

Here are logs (provision, cloud init, cloud init output):
azurelogs.zip

Thanks

khaldoune · 2018-03-29T15:58:27Z

@jackfrancis @CecileRobertMichon @dtzar

The deployment fails even with the acs 0.14.5 and with a routable cidr for PODs (in the vnet address space):

vnet_prefix: 198.18.184.0/21
vnet_master_subnet: 198.18.190.0/24
vnet_worker_subnet: 198.18.189.0/24
vnet_master_first_ip: 198.18.190.50
k8s_pod_cidr: 198.18.184.0/22
k8s_service_cidr: 198.18.188.0/23
k8s_dns_service: 198.18.188.10

Here is the cluster definition:

{
"apiVersion": "vlabs",
"properties": {
"orchestratorProfile": {
"orchestratorType": "Kubernetes",
"orchestratorRelease": "1.9",
"orchestratorVersion": "1.9.6",
"kubernetesConfig": {
"networkPolicy": "calico",
"etcdDiskSizeGB": "16",
"enableAggregatedAPIs": true,
"enablePodSecurityPolicy": true,
"EnableRbac": true,
"clusterSubnet": "198.18.184.0/22",
"serviceCidr": "198.18.188.0/23",
"dnsServiceIP": "198.18.188.10",
"kubeletConfig": {
"--event-qps": "0",
"--non-masquerade-cidr": "198.18.184.0/22",
"--authentication-token-webhook": "true"
},
"controllerManagerConfig": {
"--address": "0.0.0.0",
"--profiling": "false",
"--terminated-pod-gc-threshold": "100",
"--node-cidr-mask-size": "27",
"--node-monitor-grace-period": "40s",
"--pod-eviction-timeout": "60s",
"--horizontal-pod-autoscaler-use-rest-clients": "true"
},
"cloudControllerManagerConfig": {
"--profiling": "false"
},
"apiServerConfig": {
"--profiling": "false",
"--repair-malformed-updates": "false",
"--endpoint-reconciler-type": "lease"
},
"addons": [
{
"name": "tiller",
"enabled": false
}
]
}
},
"masterProfile": {
"dnsPrefix": "k8s-noprd",
"vnetCidr": "198.18.190.0/24",
"count": 5,
"vmSize": "Standard_D2_v2",
"OSDiskSizeGB": 80,
"vnetSubnetId": "/subscriptions/xxxxxxxxxxxxxxxxxx/resourceGroups/vpod0a-apps-prd-rg/providers/Microsoft.Network/virtualNetworks/vpod0a-k8s-prd-1-vnet/subnets/master_subnet",
"firstConsecutiveStaticIP": "198.18.190.50",
"preProvisionExtension": {
"name": "setup"
}
},
"agentPoolProfiles": [
{
"name": "wbronze",
"count": 1,
"vmSize": "Standard_D2_v2",
"OSDiskSizeGB": 80,
"availabilityProfile": "AvailabilitySet",
"vnetSubnetId": "/subscriptions/xxxxxxxxxxxxxxxxxx/resourceGroups/vpod0a-apps-prd-rg/providers/Microsoft.Network/virtualNetworks/vpod0a-k8s-prd-1-vnet/subnets/worker_subnet",
"diskSizesGB": [ 50 ],
"StorageProfile": "ManagedDisks",
"preProvisionExtension": {
"name": "setup_node"
}
},
{
"name": "wsilver",
"count": 1,
"vmSize": "Standard_D2_v2",
"OSDiskSizeGB": 80,
"availabilityProfile": "AvailabilitySet",
"vnetSubnetId": "/subscriptions/xxxxxxxxxxxxxxxxxx/resourceGroups/vpod0a-apps-prd-rg/providers/Microsoft.Network/virtualNetworks/vpod0a-k8s-prd-1-vnet/subnets/worker_subnet",
"diskSizesGB": [ 50 ],
"StorageProfile": "ManagedDisks",
"preProvisionExtension": {
"name": "setup_node"
}
},
{
"name": "wgold",
"count": 1,
"vmSize": "Standard_D2_v2",
"OSDiskSizeGB": 80,
"availabilityProfile": "AvailabilitySet",
"vnetSubnetId": "/subscriptions/xxxxxxxxxxxxxxxxxx/resourceGroups/vpod0a-apps-prd-rg/providers/Microsoft.Network/virtualNetworks/vpod0a-k8s-prd-1-vnet/subnets/worker_subnet",
"diskSizesGB": [ 50 ],
"StorageProfile": "ManagedDisks",
"preProvisionExtension": {
"name": "setup_node"
}
},
{
"name": "wplatin",
"count": 1,
"vmSize": "Standard_D2_v2",
"OSDiskSizeGB": 80,
"availabilityProfile": "AvailabilitySet",
"vnetSubnetId": "/subscriptions/xxxxxxxxxxxxxxxxxx/resourceGroups/vpod0a-apps-prd-rg/providers/Microsoft.Network/virtualNetworks/vpod0a-k8s-prd-1-vnet/subnets/worker_subnet",
"diskSizesGB": [ 50 ],
"StorageProfile": "ManagedDisks",
"preProvisionExtension": {
"name": "setup_node"
}
},
{
"name": "wdiamond",
"count": 1,
"vmSize": "Standard_D2_v2",
"OSDiskSizeGB": 80,
"availabilityProfile": "AvailabilitySet",
"vnetSubnetId": "/subscriptions/xxxxxxxxxxxxxxxxxx/resourceGroups/vpod0a-apps-prd-rg/providers/Microsoft.Network/virtualNetworks/vpod0a-k8s-prd-1-vnet/subnets/worker_subnet",
"diskSizesGB": [ 50 ],
"StorageProfile": "ManagedDisks",
"preProvisionExtension": {
"name": "setup_node"
}
}
],
"linuxProfile": {
"adminUsername": "k8s",
"ssh": {
"publicKeys": [
{
"keyData": "ssh-rsa xxxxxxxxxxxxxxxxxx"
}
]
}
},
"servicePrincipalProfile": {
"clientId": "xxxxxxxxxxxxxxxxxx",
"secret": "xxxxxxxxxxxxxxxxxx"
},
"extensionProfiles": [
{
"name": "setup_node",
"version": "v1",
"script": "setup.sh",
"rootURL": "https://gitlab.com/octo-carrefour-k8s/acs-extensions/raw/master/",
"extensionParameters": "198.18.192.4 k8s-noprd.xpod.carrefour.com"
},
{
"name": "setup",
"version": "v1",
"script": "setup.sh",
"rootURL": "https://gitlab.com/octo-carrefour-k8s/acs-extensions/raw/master/",
"extensionParameters": "198.18.192.4 k8s-noprd.xpod.carrefour.com"
}
]
}
}
~

khaldoune · 2018-03-29T18:35:27Z

@jackfrancis @idanshahar @CecileRobertMichon @dtzar

My /etc/cni is empty. Where/When its content gets created by acs-engine?

Thanks.

CecileRobertMichon · 2018-03-29T23:32:59Z

@khaldoune /etc/cni should contain net.d:

acs-engine/parts/k8s/kubernetesmastercustomscript.sh

Line 229 in b38a170

setDockerOpts " --volume=/etc/cni/:/etc/cni:ro --volume=/opt/cni/:/opt/cni:ro"

acs-engine/parts/k8s/addons/kubernetesmasteraddons-calico-daemonset.yaml

Line 212 in b38a170

- mountPath: /host/etc/cni/net.d

Since you were able to deploy the same api model with two vnets in v13.1 and see ready nodes this might be a regression. I suspect it could be linked to issue #2476. Could you please open a new issue since I think we are outside the scope of this current issue for better tracking of the bug/fix? Thank you for your patience, let's get this resolved asap! cc @jackfrancis

khaldoune · 2018-04-05T23:10:38Z

@CecileRobertMichon @jackfrancis

Provisioning using Azure CNI instead of Calico with acs 0.14.5 works fine.

Provisioning with Calico and a single subnet for both masters and workers fails.

I've also double-checked weither or not the Encryption At Rest has been enabled by default in 0.14.5, it is not.

I've just created a new issue: #2607

Thanks for your help.

marty2bell · 2018-04-21T19:06:56Z

I got this error yesterday using acs-engine 15.2 with the distro set to coreos. Removing this from the template and reverting to ubuntu mitigated the issue, but means we can't provision coreOS vms.

Marty

rocketraman · 2018-04-23T20:39:32Z

I just upgraded a cluster from 1.7.5 to 1.8.10 via acs-engine 0.15.2 and ran into this issue. The cluster uses Azure CNI and Ubuntu.

The resource group Deployment is still showing the Failure if more details are needed.

Ignoring the error, and resuming the upgrade seems to have worked fine, but the cse0 extension on the master VM is still showing status "Provisioning failed". I don't know what the implications of this are, but as I said, everything seems to be working.

BrendanThompson · 2018-04-24T05:33:21Z

I am seeing this same issue with the following:

Application	Version
`acs-engine`	v0.16.0
`k8s`	1.10

The cluster is trying to use Ubuntu with Azure CNI

CecileRobertMichon · 2018-04-24T17:54:28Z

@rocketraman and @BrendanThompson please share the apimodel you used to generate the template/deploy the cluster as well as the exact error message (what was the error code?).

rocketraman · 2018-04-24T18:06:42Z

@CecileRobertMichon Here is my API model, with private information elided:

apimodel.json

Here is the error (operation status was "Conflict", Provisioning state is "Failed"):

{
  "status": "Failed",
  "error": {
    "code": "ResourceDeploymentFailure",
    "message": "The resource operation completed with terminal provisioning state 'Failed'.",
    "details": [
      {
        "code": "VMExtensionProvisioningError",
        "message": "VM has reported a failure when processing extension 'cse0'. Error message: \"Enable failed: failed to execute command: command terminated with exit status=3\n[stdout]\n\n[stderr]\n\"."
      }
    ]
  }
}

Same exact error on two different clusters.

rocketraman · 2018-04-24T18:27:17Z

@CecileRobertMichon I think I understand what happened in my case.

Looking at /var/log/azure/cluster-provision.log, it looks like it failed because it couldn't connect to etcd. This is probably because of some customizations I had made to the pre-upgrade cluster due to some other weirdness [1]. This prevented the upgraded cluster etcd from starting up and the cse0 script to fail.

[1] In my previous cluster, I was experiencing an issue in which etcd wasn't starting up because it was choking on the lost+found directory in /var/lib/etcddisk/. To fix this, I had manually moved the etcd data directory on the older cluster into a subdirectory. Thankfully, I don't have this issue with etcd on my current cluster.

hensilva · 2018-05-11T13:50:12Z

I'm facing similar issues today. 1.9 or 1.10 with ACS 0.16.1

hmarcelodn · 2018-05-18T16:48:38Z

Same issue with k8s 1.6.6 and acs-engine 1.16.2

dennis-benzinger-hybris · 2018-06-07T13:24:04Z

In our case the apt package indexes in /var/lib/apt/lists got corrupted somehow and Docker couldn't be installed. Many of the files there are empty but still apt-get update doesn't download them. Only after removing the files manually apt-get update downloaded them again and you can (for example) install docker-engine again.

CecileRobertMichon · 2018-06-07T18:03:20Z

For everyone here, https://github.com/Azure/acs-engine/blob/master/docs/kubernetes/troubleshooting.md#vmextensionprovisioningerror-or-vmextensionprovisioningtimeout has been added to help troubleshoot VM extension errors. Please follow the instructions if you encounter one of those.

Navlesh · 2018-06-15T19:14:24Z

@CecileRobertMichon I too face #vmextensionprovisioningtimeout error all the time when i have 3 masters.
I am using ACS v0.18.6
following is the sample input file.
{
"apiVersion": "vlabs",
"properties": {
"orchestratorProfile": {
"orchestratorType": "Kubernetes",
"orchestratorRelease": "1.9",
"kubernetesConfig": {
"privateCluster": {
"enabled": true
}
}
},
"masterProfile": {
"count": 3,
"dnsPrefix": "egsms",
"vmSize": "Standard_D2s_v3",
"vnetSubnetId": "/subscriptions/xxxxxxxxxxxxxxxxxxxxxxxx/resourceGroups/MyRG/providers/Microsoft.Network/virtualNetworks/vnet/subnets/frontend",
"firstConsecutiveStaticIP": "10.0.0.45",
"vnetCidr": "10.0.0.0/24"
},
"agentPoolProfiles": [
{
"name": "egsagent",
"count": 1,
"vmSize": "Standard_D2s_v3",
"vnetSubnetId": "/subscriptions/xxxxxxxxxxxxxxxxxxxxxxxxxx/resourceGroups/MyRG/providers/Microsoft.Network/virtualNetworks/vnet/subnets/frontend",
"availabilityProfile": "AvailabilitySet"
}
],
"linuxProfile": {
"adminUsername": "useradmin",
"ssh": {
"publicKeys": [
{
"keyData": "ssh-rsa xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
}
]
}
},
"servicePrincipalProfile": {
"clientId": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
"secret": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx="
}
}
}

CecileRobertMichon · 2018-06-15T20:32:26Z

@Navlesh please take a look at https://github.com/Azure/acs-engine/blob/master/docs/kubernetes/troubleshooting.md#vmextensionprovisioningerror-or-vmextensionprovisioningtimeout if you haven't already and open a new issue with title "CSE error: exit code <INSERT_YOUR_EXIT_CODE>" and include the following in the description:

The apimodel json used to deploy the cluster (aka your cluster config). Please make sure you remove all secrets and keys before posting it on GitHub (what you pasted above)
The output of kubectl get nodes
The content of /var/log/azure/cluster-provision.log and /var/log/cloud-init-output.log

jackfrancis added the kind/bug label Nov 21, 2017

ducas mentioned this issue Jan 17, 2018

Unable to deploy Kubernetes cluster to existing VNet #2057

Closed

ppathan mentioned this issue Mar 29, 2018

CLI Standard Deployment Failing Azure/azure-iot-pcs-remote-monitoring-dotnet#68

Closed

2 tasks

CecileRobertMichon mentioned this issue May 21, 2018

Failed to start etcd after upgrading k8s 1.6.6 -> 1.7.16 #2962

Closed

CecileRobertMichon closed this as completed Jun 15, 2018

VM has reported a failure when processing extension 'cse0' #1806

VM has reported a failure when processing extension 'cse0' #1806

Comments

IvanCaro commented Nov 21, 2017 • edited Loading

Is this an ISSUE or FEATURE REQUEST? (choose one): ISSUE

What version of acs-engine?: Version: canary GitCommit: 8db990b GitTreeState: clean

jackfrancis commented Nov 21, 2017

JackQuincy commented Nov 21, 2017

jackfrancis commented Nov 21, 2017

IvanCaro commented Nov 22, 2017

tamilmani1989 commented Dec 1, 2017

jackfrancis commented Dec 1, 2017

mpluhar commented Jan 8, 2018 • edited Loading

ducas commented Jan 19, 2018

jsturtevant commented Jan 23, 2018

nakah commented Jan 25, 2018

msorby commented Jan 30, 2018 • edited Loading

rodrigoffonseca commented Jan 30, 2018

msorby commented Jan 30, 2018

jackfrancis commented Feb 2, 2018

feiskyer commented Feb 3, 2018

jackfrancis commented Feb 3, 2018

feiskyer commented Feb 3, 2018

ilyalukyanov commented Feb 5, 2018 • edited Loading

msorby commented Feb 5, 2018

jackfrancis commented Feb 5, 2018

msorby commented Feb 6, 2018

ilyalukyanov commented Feb 7, 2018

idanshahar commented Feb 7, 2018

Jarlotee commented Feb 7, 2018

jackfrancis commented Feb 8, 2018

Jarlotee commented Feb 8, 2018 • edited Loading

jackfrancis commented Feb 9, 2018

idanshahar commented Feb 11, 2018 • edited Loading

khaldoune commented Mar 26, 2018

CecileRobertMichon commented Mar 26, 2018

khaldoune commented Mar 26, 2018

CecileRobertMichon commented Mar 26, 2018

CecileRobertMichon commented Mar 26, 2018

dtzar commented Mar 26, 2018

CecileRobertMichon commented Mar 26, 2018

dtzar commented Mar 27, 2018

dtzar commented Mar 27, 2018

khaldoune commented Mar 28, 2018 • edited Loading

khaldoune commented Mar 28, 2018

khaldoune commented Mar 29, 2018

khaldoune commented Mar 29, 2018

CecileRobertMichon commented Mar 29, 2018

khaldoune commented Apr 5, 2018

marty2bell commented Apr 21, 2018

rocketraman commented Apr 23, 2018

BrendanThompson commented Apr 24, 2018

CecileRobertMichon commented Apr 24, 2018

rocketraman commented Apr 24, 2018 • edited Loading

rocketraman commented Apr 24, 2018

hensilva commented May 11, 2018

hmarcelodn commented May 18, 2018

dennis-benzinger-hybris commented Jun 7, 2018

CecileRobertMichon commented Jun 7, 2018

Navlesh commented Jun 15, 2018

CecileRobertMichon commented Jun 15, 2018

IvanCaro commented Nov 21, 2017 •

edited

Loading

Is this an ISSUE or FEATURE REQUEST? (choose one):
ISSUE

What version of acs-engine?:
Version: canary
GitCommit: `8db990b`
GitTreeState: clean

mpluhar commented Jan 8, 2018 •

edited

Loading

msorby commented Jan 30, 2018 •

edited

Loading

ilyalukyanov commented Feb 5, 2018 •

edited

Loading

Jarlotee commented Feb 8, 2018 •

edited

Loading

idanshahar commented Feb 11, 2018 •

edited

Loading

khaldoune commented Mar 28, 2018 •

edited

Loading

rocketraman commented Apr 24, 2018 •

edited

Loading