Error creating vxlan interface: file exists #1765

discotroy · 2017-05-18T20:33:48Z

Previous related threads:

network sandbox join failed: error creating vxlan interface: file exists #562
What do I do with "subnet sandbox join failed for "10.0.0.0/24": error creating vxlan interface: file exists" #751
Cannot start container: subnet sandbox join failed for "10.0.0.0/24": error creating vxlan interface: file exists #945
Containers on overlay network and "error creating vxlan interface: file exists" moby#21482
[1.13.0-rc1][Intermittent] docker: Error response from daemon: subnet sandbox join failed for "10.0.0.0/16": error creating vxlan interface: file exists. moby#28559

Comment at the current tail-end of #945 recommends opening a new ticket. I couldn't find one opened by the original poster, so here we go.

I've been using swarm for the past couple of months, and frequently hit upon this problem. I have a modest swarm (~8-9 nodes) all running Ubuntu 16.04, now with Docker 17.05-ce on. There is not a great amount of container churn, but I do use a stack yaml file to deploy ~20 services across ~20 encrypted overlay networks.

I tend to find that after a couple of stack deploy / stack rm cycles, my containers get killed at startup with the "Error creating vxlan: file exists" error. This prevents the containers coming up on a host and forces them to attempt to relocate, which may / may not work.

I have noted in the above issues that the problems are, several times over, thought to have been rectified, but yet always creep back in for various users.

To rectify the issue, I have tried rebooting the node, restarting iptables, removing the stack and re-creating, all of which work to varying degrees but are most definitely workarounds and not solutions.

I cannot think how I can attempt to reproduce this error, but if anyone wants to suggest ways to debug, I am at your service.

The text was updated successfully, but these errors were encountered:

mpepping · 2017-05-22T16:40:08Z

We're suffering from the same issue on RHEL7 /w Docker 17.03-ee and are able to reproduce the issue by adding a Service on a swarm-node where the overlay-network isn't active yet.
Tried about the same level of troubleshooting as @discotroy and can confirm the rebooting or restarting docker-engine fixes the issue up to some level, with fluctuating results. Also open for suggestions on how to debug this issue.

fcrisciani · 2017-05-25T22:48:31Z

Do you guys have some logs to share? would be super helpful to have a way to reproduce and grab logs with the engine in debug mode

Engine in debug:
echo '{"debug": true}' > /etc/docker/daemon.json
then: sudo kill -HUP <pid of dockerd>

mpepping · 2017-05-26T07:10:46Z

Will collect more logging. Here's some debug /w the error-message: https://gist.github.com/mpepping/50cb9b71b5535b318c6a548d4e8ba97b

fcrisciani · 2017-05-26T15:25:53Z

@mpepping thanks, the error message is clear. The current suspect that I have is a race condition during the deletion of the sandbox that leak the vxlan interface behind it. When a new container comes up tries to create the vxlan interface again and instead finds that there is already one and errors out. The more interesting part now of the logs would be the block where there is suppose to be the interface deletion and figure out why that is not happening properly.

fcrisciani · 2017-05-26T15:27:47Z

I'm also already trying to reproduce it locally, but if you guys narrow down a specific set of steps that are able to reproduce with high probability let me know

mpepping · 2017-05-27T07:06:43Z

@fcrisciani indeed it seems a race condition running into a locking issue. A breakdown of the steps, with debug output, is available at https://gist.github.com/mpepping/739e9a486b6c3266093a8af712869e90 .

Basically, the command-set for us to reproduce the following .. but the gist provides more detail:

docker swarm init
docker network create -d overlay  ucp-hrm
docker stack deploy -c stack.yml test
docker service ls #OK
docker stack rm test
docker service ls
docker stack deploy -c stack.yml test
docker service ls #NOK

Also, we're running into this issue on RHEL7 /w Docker 1703-ee on VMware vSphere virtuals. We were thus far unable to reproduce the issue on Virtualbox or VMware Fusion, using the same stack. Our next steps would be to run an other OS on VMware vSphere to reproduce the issue, and debug the vxlan config.

pjutard · 2017-05-29T23:23:34Z

Same problem here. Same scenario: multiple stacks deployed, each with its own network, after some docker stack rm and docker stack deploy, we get the "Error creating vxlan: file exists" error msg.
We have a swarm in this state right now...

Server:
 Version:      17.05.0-ce
 API version:  1.29 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   89658be
 Built:        Thu May  4 21:43:09 2017
 OS/Arch:      linux/amd64
 Experimental: false

Using Docker4Azure

mpepping · 2017-05-30T14:31:21Z

Allright, some extensive tests led to interesting results. We're running RHEL7 /w docker-1703-ee.
The issue was direct reproducible when running the 3.10.0-327.10.1.el7.x86_64 kernel with iptables (firewalld removed). A docker swarm deploy/rm/deploy combo fails every test run on this setup.
After bumping the kernel (3.10.0-514.6.1.el7.x86_64) and installing+enabling the firewalld service, the results are much more reliable .. but still can break after 200+ or 800+ docker swarm deploy/rm/deploy runs .. after which rebooting the host is the only reliable way to fix this. Note that just bumping the kernel, or enabling firewalld isn't sufficient .. the combination of both made the difference in our use case.

jgeyser · 2017-05-31T14:32:52Z

As per #562

You can correct this by running:

sudo umount /var/run/docker/netns/*
sudo rm /var/run/docker/netns/*

Not sure if this is a long term solution.

mavenugo · 2017-05-31T15:57:54Z

@jgeyser thats a workaround to get out of issue. But that is not a solution. We have to narrow down the RC and fix it in the code.

dcrystalj · 2017-06-01T15:09:49Z

@jgeyser this is not working for me. sometimes i also get issue Unable to complete atomic operation, key modified

I have tried removing docker-engine and leaving docker swarm, but it didn't work.

update:
Needed full machine restart as well

dang3r · 2017-06-04T22:36:27Z

We are encountering the same problem with the following configuration:

Linux hostname 4.9.0-0.bpo.2-amd64 #1 SMP Debian 4.9.13-1~bpo8+1 (2017-02-27) x86_64 GNU/Linux

Client:
 Version:      17.03.0-ce
 API version:  1.26
 Go version:   go1.7.5
 Git commit:   60ccb22
 Built:        Thu Feb 23 10:53:29 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.03.0-ce
 API version:  1.26 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   60ccb22
 Built:        Thu Feb 23 10:53:29 2017
 OS/Arch:      linux/amd64
 Experimental: false

lnshi · 2017-06-09T04:03:38Z

Exactly same problem experienced, quite randomly.

mpepping · 2017-06-09T20:05:36Z

@lnshi Care to share about your environment .. os, docker version, using virtualisation?

sanimej · 2017-06-12T00:48:28Z

@dang3r @lnshi Can you add the details on what triggers this issue for you ? Have you able to find any pattern or a way to recreate this issue ? If your host is a VM, what hypervisor are you using ?

lnshi · 2017-06-12T01:22:35Z

@sanimej Maybe I just misreported this, I just figured out that my actual problem is like this: I reported there: Issue #33626, it is also subnet overlaps problem, but seems different reason. Can you help on that also? thanks.

sanimej · 2017-06-14T22:54:21Z

@dang3r @dcrystalj @discotroy If you are still having this issue can you check if your host has any udev rules that might rename interface names that start with vx. ?

For overlay networks, docker daemon creates a vxlan device with the name like vx-001001-a12eme where 001001 is the VNI id in hex, followed by shortened network id. This device then gets moved to a overlay network specific namespace. When the overlay network is deleted, the device is moved back to the host namespace before its deleted. If there is a udev rule that could rename these interfaces and if the rename happens before docker daemon can delete it, the host will end up with an orphaned interface with that vni id. So subsequent attempts to create that interface will fail.

adpjay · 2017-07-11T16:13:54Z

@mpepping Were you able to get the error message ("ERRO[0143] fatal task error error="starting container failed:...") to show up in the docker daemon logs? My swarm is in a state right now where several containers are in this condition. When I try to start one of the containers, the client returns an error message:
Error response from daemon: Error response from daemon: subnet sandbox join failed for "10.0.8.0/24": error creating vxlan interface: operation not supported
But I don't see any corresponding docker daemon API log message in my swarm.

I would like to forward all the daemon messages to splunk so that I can create an event recognizing when this condition occurs so we can execute a workaround to keep people moving forward and to validate that we aren't seeing it anymore when we get a fix.

mpepping · 2017-07-11T18:18:43Z

@adpjay Messages with the loglevel ERROR are logged by the daemon by default. Syslog should them up and send them to something like /var/log/messages or journald.
I our case, the exact error was: ERRO[0143] fatal task error error="starting container failed: subnet sandbox join failed for \"10.0.2.0/24\": error creating vxlan interface: file exists". The file exists message differs from the operation not supported message in your error-message. I our case a custom udev-rule for renaming network-interfaces was part of the issue. Maybe something worth checking out.

adpjay · 2017-07-11T18:52:24Z

@mpepping Thanks for responding. I see lots of messages for docker API GET calls in /var/log/messages and when I run docker logs -f <ucp-manager> (for any of the 3 managers in our swarm) but I don't see the error reported by the client. I wonder if there is a specific node in the swarm they show up on.
I did notice that the exact error was different (file exists vs. operation not supported) but I was thinking it could be due to the fact that our host has a different OS than you (we're running SUSE).
Thanks for the hint about the udev-rule I'll check it out.

mpepping · 2017-07-11T19:08:10Z

@adpjay You should be able to see the error at the daemon level (not in the UCP container logging). You should be able to see the error running something like journalctl -u docker.service. Good luck troubleshooting!

RAKedz · 2017-07-18T19:08:27Z

I am getting this error on one of my nodes, I have 5 total. Any service trying to run on it will get his error:

error creating vxlan interface: file exists

I tried to do a 'docker system prune' and even booted the server, but it didn't fix it. Then someone mentioned it could be the network and I thought that could be it because I was heavily messing with the network because I have having issues with the encrypted network I created. I ended up creating a new non-encrypted network and using it for my services, abandoning the previous.

I began to examine the network on my working nodes and noticed that the encrypted network I was using was either removed or was still listed. But on the node not working the encrypted network was there but it was showing a scope of local unlike the others. (not sure how/why it was changed to local)

Bad node:
iw3w9kdywnay jupiter overlay local

Good nodes:
iw3w9kdywnay jupiter overlay swarm

When I tried to remove the network on the bad node I received this message:

Error response from daemon: network jupiter has active endpoints

Which is why the 'docker system prune' couldn't remove it.

I removed it by doing the following:

Looked up its endpoint
docker network inspect jupiter
Remove it
docker network disconnect -f jupiter ingress-endpoint
docker network rm jupiter

Then I created a service to run on that node, and it started working for me.

This is my docker version:

Client:
Version: 17.03.0-ce
API version: 1.26
Go version: go1.7.5
Git commit: 3a232c8
Built: Tue Feb 28 08:01:32 2017
OS/Arch: linux/amd64

Server:
Version: 17.03.0-ce
API version: 1.26 (minimum version 1.12)
Go version: go1.7.5
Git commit: 3a232c8
Built: Tue Feb 28 08:01:32 2017
OS/Arch: linux/amd64
Experimental: false

Working on Digital Ocean - Ubuntu 16.04

gitbensons · 2017-09-15T09:06:34Z

Found a workaround for this issue, without the need of rebooting or restarting docker daemon.
As @sanimej mentioned

For overlay networks, docker daemon creates a vxlan device with the name like vx-001001-a12eme where 001001 is the VNI id in hex, followed by shortened network id. This device then gets moved to a overlay network specific namespace. When the overlay network is deleted, the device is moved back to the host namespace before its deleted

So once you know which vxlan id fails to be created (did a strace of the docker daemon process, which is overkill for sure, but I was in a hurry)
4993 15:01:04.640588 recvfrom(30, "\254\0\0\0\2\0\0\0\267\273\0\0\212\265\372\377\357\377\377\377\230\0\0\0\20\0\5\6\267\273\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\24\0\3\0vx-000105-1158f\0\10\0\r\0\0\0\0\0\\\0\22\0\t\0\1\0vxlan\0\0\0L\0\2\0\10\0\1\0\5\1\0\0\5\0\5\0\0\0\0\0\5\0\6\0\0\0\0\0\5\0\7\0\1\0\0\0\5\0\v\0\1\0\0\0\5\0\f\0\0\0\0\0\5\0\r\0\1\0\0\0\5\0\16\0\1\0\0\0\6\0\17\0\22\265\0\0", 4096, 0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, [12]) = 172
So 000105-1158f aka 0x105 aka vxlan id 261 in my case.

Build a list of active network namespaces and its vxlan's on the failing host.
For example:
# for i in $(ls /var/run/docker/netns/*); do echo ":::: $ns" >> ip.link.show; nsenter -m -t <PID of docker daemon> nsenter --net=$ns ip -d link show ; done >> ip.link.show

Now that you know the affected network namespace, double nsenter into it
# nsenter -m -t <PID of docker daemon> bash
# nsenter --net=/var/run/docker/netns/<affected namespace> bash
# ip link delete vxlan1

After that, the error is gone. Pretty sure Docker Inc. knows about that workaround, why they don't share it is up to the imagination of the reader.
Hope this helps.

lukewendling · 2018-04-25T19:35:52Z

I was this getting error on a docker swarm stack (docker v18.03) and finally removed the entire stack (docker stack rm) and re-created with docker stack deploy and the problem resolved.

ctelfer · 2018-04-27T21:47:39Z

So far have not been able to reproduce this locally. I've tried the steps described above and have also scripted them to run them repeatedly. No dice so far. Will try with larger #s of networks next.

Having said that, while inspecting the code I definitely found several race conditions. I think that one in particular could cause this issue, but without reproduction its hard to prove. Will issue a PR shortly.

tafelpootje · 2019-02-15T10:33:25Z

Same issue here.
the
sudo umount /var/run/docker/netns/* sudo rm /var/run/docker/netns/*
fix did not work
Removing the stack and readding seems to have worked (for one stack it worked directly, the other stack I had to redo the steps)

Docker version:
Docker version 18.09.2
Ubuntu 16.04

fendo64 · 2019-02-15T22:16:18Z

Same issue here.
the
sudo umount /var/run/docker/netns/* sudo rm /var/run/docker/netns/*
fix did not work
Removing the stack and readding seems to have worked (for one stack it worked directly, the other stack I had to redo the steps)

Next time, can you check if you have "vx-" interface on host:
ip link show | grep vx

If so, delete them, it worked for me:
ip link delete vx-xxxx

The correction that I propose is after reading the code, I do not have the environment to test.
If a good soul, has a test environment, could he test my correction proposal.

ryandaniels · 2019-03-18T17:35:48Z

Ran into the same issue.
Docker version: 18.06.1-ce

Fixed after @sanimej / @fendo64 's work-around:
ip link delete vx-xxxx

hannseman · 2019-04-01T13:09:58Z

Ran into the same issue.
Docker version 18.03.0-ce, build 0520e24

ip link delete vx-xxxx resolved it.

rnickle · 2019-04-03T15:18:08Z

Same problem, Docker CE 18.09.4, using swarm.

I deleted the ip links but a simple docker service update did not restart.

I was able to do it using docker service update in conjunction with '--with-registry-auth' and '--force'

leojonathanoh · 2019-05-01T17:32:56Z

Same issue here.
the
sudo umount /var/run/docker/netns/* sudo rm /var/run/docker/netns/*
fix did not work
Removing the stack and readding seems to have worked (for one stack it worked directly, the other stack I had to redo the steps)

Next time, can you check if you have "vx-" interface on host:
ip link show | grep vx

If so, delete them, it worked for me:
ip link delete vx-xxxx

The correction that I propose is after reading the code, I do not have the environment to test.
If a good soul, has a test environment, could he test my correction proposal.

That resolved it for me on docker stack deploy on Docker 18.06.1-ce Swarm

sderungs · 2019-06-12T14:23:34Z

Same happens to our environment:
# docker -v

Docker version 18.09.6, build 481bc77156

# cat /proc/version

Linux version 3.10.0-957.21.2.el7.x86_64 ([email protected]) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-36) (GCC) ) #1 SMP Wed Jun 5 14:26:44 UTC 2019`

xys2015 · 2019-08-02T11:42:10Z

I do this below:
sudo umount /var/run/docker/netns/*
sudo rm /var/run/docker/netns/*
stop and start docker
rebuild docker swarm
change docker network ip
but all not work
then I do this
docker network rm mysystem
docker network create --driver overlay newsystem
It worked, I change the overlay network name

hexmode · 2020-02-22T17:16:03Z

Next time, can you check if you have "vx-" interface on host:
ip link show | grep vx

If so, delete them, it worked for me:
ip link delete vx-xxxx

This worked for me. Thanks, @fendo64 !

beckyjmcdabq · 2020-02-26T21:09:13Z

@hexmode - I see you did this just 4 days ago. Do you have a set of steps that you did? Did you leave your existing deployments running when you ran this command? Is this command dangerous in any way? Do we need to make sure we are deleting only the problem interfaces and if so, how do we do that?

hexmode · 2020-02-26T22:01:37Z

@beckyjmcdabq The steps are there: two commands. I really can't answer those questions since I'm only doing this on a development box for now. I only saw one interface. Is it safe? Well, if something goes wrong, you can redeploy.

wolfgangpfnuer · 2020-02-27T08:38:33Z

@beckyjmcdabq essentially, if everything is correct, ip link show | grep vx is empty.

Only when I got the error this issue is all about, did I ever see a result on any of my machines (double digits)
When deleting the network with ip link delete the problem was solved. other than doing this, a full restart of the node (not just docker, the machine) solved the problem as well, but of course takes longer and might have other side-effects.

I assume that the deletion of those networks is side-effect free, as they do not exist if the problem is not there.

you could probably go all willy-nilly by running the command with xargs I guess, but do so at your own risk:
# use at your own risk: ip link show | grep vx | xargs -rn1 ip link delete

elthariel · 2020-03-12T11:58:49Z

Happened to me on a single node swarm on Ubuntu 16.04.6 LTS host / 4.4.0-169-generic, tried with Docker 18.09.1, 18.09.9 and 19.03.7.

@fendo64 trick worked for me (i.e. ip link delete vx-xxx)

arturslogins · 2020-06-01T13:15:12Z

Same issue here.
the
sudo umount /var/run/docker/netns/* sudo rm /var/run/docker/netns/*
fix did not work
Removing the stack and readding seems to have worked (for one stack it worked directly, the other stack I had to redo the steps)

Next time, can you check if you have "vx-" interface on host:
ip link show | grep vx

If so, delete them, it worked for me:
ip link delete vx-xxxx

The correction that I propose is after reading the code, I do not have the environment to test.
If a good soul, has a test environment, could he test my correction proposal.

Thanks for solution !

Find this issue on Docker version 19.03.5, build 633a0ea838

reachworld · 2020-06-17T09:52:35Z

You can find full information and "easy" resolution on docker.

In brief:

Check each node for any vx-* interfaces in /sys/class/net:
$ ls -l /sys/class/net/ | grep vx
Once we have interface id's pull more details:
$ udevadm info /sys/class/net/
If these interfaces exist we should be able to safely remove them. Replace vx-000000-xxxxx with the interface id from Step 2:
$ sudo ip -d link show vx-000000-xxxxx
$ sudo ip link delete vx-000000-xxxxx
etc.
Redeploy the service.

stephMiotke · 2020-08-06T01:30:20Z

Confirmed this is still occurring in Docker EE version 19.03.11, build 0da829ac52. Linked instructions here do indeed solve the problem for us.

belfo · 2020-08-25T11:32:32Z

Had the issue on 19.03.12. removing the ip link fixed the issue too

ProteanCode · 2020-11-26T18:52:15Z

You can find full information and "easy" resolution on docker.

In brief:

Check each node for any vx-* interfaces in /sys/class/net:
$ ls -l /sys/class/net/ | grep vx

Once we have interface id's pull more details:
$ udevadm info /sys/class/net/

If these interfaces exist we should be able to safely remove them. Replace vx-000000-xxxxx with the interface id from Step 2:
$ sudo ip -d link show vx-000000-xxxxx
$ sudo ip link delete vx-000000-xxxxx
etc.

Redeploy the service.

This overall fixed problem, but it may be dangerous if the removed network is shared, ie. servers as a traefik proxy...
How can I check what service use which interface?

Abhishek-Govula · 2021-03-12T06:31:01Z

Next time, can you check if you have "vx-" interface on host:
ip link show | grep vx

If so, delete them, it worked for me:
ip link delete vx-xxxx

This worked for me on Docker CE 19.03.12. Thanks @fendo64 for your fix

bobf · 2021-04-14T18:08:28Z

If it's helpful to anybody else I can confirm that this solution also worked for me - I iterated through the list of devices and did:

ip -d link show "{device}" && ip link delete "${device}"

We were able to bring the cluster back to a happy state once this had been applied - thank you very much for sharing the solution, it solved a big headache at the end of a very stressful day.

matusnovak · 2021-05-02T18:45:50Z

I am also having this issue.

Removing ip links did not work for me.

What worked for me was to remove all services that depend on the overlay network. Remove the network. Reboot. Then re-create everything but this time I have changed the network name.

bdoublet91 · 2022-03-10T14:51:55Z

I think we find the solution.
When could we hope a fix for docker swarm overlay network ?

Thank you

albertvveld · 2022-06-23T10:03:04Z

Removing IP links does fix the problem however, please fix this permanently please.

dberardo-com · 2022-08-02T11:48:18Z

how about removing the node from the cluster and then let it join again? would this workaround work too?

Stainless5792 · 2022-10-14T08:49:32Z

Got the same issue suddenly in docker swarm mode(1 master + 1 worker) with docker version 20.10.17

$ docker info
Client:
 Context:    default
 Debug Mode: false

Server:
 Containers: 32
  Running: 9
  Paused: 0
  Stopped: 23
 Images: 578
 Server Version: 20.10.17

error message


Oct 14 08:01:15 ip-10-44-2-202 dockerd: time="2022-10-14T08:01:15.703465453Z" level=warning msg="Health check for container 0b2233152912ebb25fe518413b1f7d712b9c0b4107ec9eb27a82887542c72447 error: context deadline exceeded: unknown"
Oct 14 08:01:16 ip-10-44-2-202 dockerd: time="2022-10-14T08:01:16.973267829Z" level=warning msg="rmServiceBinding 56da2a4a026ea6073e56c0a1ddd7dc60964920195ac3583fa9af522cd498bf37 possible transient state ok:false entries:0 set:false "
Oct 14 08:01:17 ip-10-44-2-202 dockerd: time="2022-10-14T08:01:17.297022988Z" level=error msg="stream copy error: reading from a closed fifo"
Oct 14 08:01:17 ip-10-44-2-202 dockerd: time="2022-10-14T08:01:17.297161300Z" level=error msg="stream copy error: reading from a closed fifo"
Oct 14 08:01:17 ip-10-44-2-202 dockerd: time="2022-10-14T08:01:17.301269053Z" level=warning msg="Health check for container b94b333102363c5b5e4e3743e3a94bcb2b5dfedf5fe28a88ff5e282b866c2752 error: context deadline exceeded"
Oct 14 08:01:18 ip-10-44-2-202 dockerd: time="2022-10-14T08:01:18.232215740Z" level=info msg="initialized VXLAN UDP port to 4789 "
Oct 14 08:01:18 ip-10-44-2-202 dockerd: time="2022-10-14T08:01:18.413119164Z" level=warning msg="failed to deactivate service binding for container hh_prod_backend-nginx.1.jflf98y83j8s00zblju2rf0sz" error="No such container: hh_prod_backend-nginx.1.jflf98y83j8s00zblju2rf0sz" module=node/agent node.id=f4c87y0fn18gq8jlmcjvn74f2
Oct 14 08:01:19 ip-10-44-2-202 kernel: br0: renamed from ov-001007-pwdcn
Oct 14 08:01:19 ip-10-44-2-202 kernel: br1: renamed from ov-001007-pwdcn
Oct 14 08:01:19 ip-10-44-2-202 kernel: ov-001007-pwdcn: renamed from br1
Oct 14 08:01:19 ip-10-44-2-202 dockerd: time="2022-10-14T08:01:19.497912332Z" level=error msg="moving interface ov-001007-pwdcn to host ns failed, invalid argument, after config error error setting interface \"ov-001007-pwdcn\" IP to 20.0.7.1/24: cannot program address 20.0.7.1/24 in sandbox interface because it conflicts with existing route {Ifindex: 2 Dst: 20.0.7.0/24 Src: 20.0.7.1 Gw: <nil> Flags: [] Table: 254}"
Oct 14 08:01:19 ip-10-44-2-202 dockerd: time="2022-10-14T08:01:19.597860439Z" level=error msg="fatal task error" error="network sandbox join failed: subnet sandbox join failed for \"20.0.7.0/24\": error creating vxlan interface: file exists" module=node/agent/taskmanager node.id=f4c87y0fn18gq8jlmcjvn74f2 service.id=sljpybp0vgh5bpf78uvwimztd task.id=xe7fihzdr1dco9o1req3eicdf
Oct 14 08:01:19 ip-10-44-2-202 dockerd: time="2022-10-14T08:01:19.598356846Z" level=error msg="failed adding service bindingfor 4566d70127e1a41749bda34b8b0588a7ffb51c5f61c58b524c1a2248a3774ea6 epRec:{hh_prod_mobile.1.o24aonnnvpea5z4xhtgmq781m hh_prod_mobile ksdvp0uyapg39npoqodcmspn6 20.0.7.8 20.0.7.175 [] [mobile] [b5778ef16d92] false} err:network pwdcn1sb3j4zwvhd1of3vudz6 not found"
Oct 14 08:01:19 ip-10-44-2-202 dockerd: time="2022-10-14T08:01:19.598539819Z" level=error msg="failed adding service bindingfor 4f0e4759b1eac4a1e5f8ce5b3aef3b86a362f3c62ebbaf0112e2dd20898df4bf epRec:{hh_prod_auth.1.loq5llju3sn267b2ybmbznkwd hh_prod_auth 0tpkzhq0rdyhsc0rfhs4wmsyh 20.0.7.103 20.0.7.4 [] [auth] [48e59cb68f4e] false} err:network pwdcn1sb3j4zwvhd1of3vudz6 notfound"
Oct 14 08:01:19 ip-10-44-2-202 dockerd: time="2022-10-14T08:01:19.598811672Z" level=error msg="failed adding service bindingfor 65146bc83fc0fccd86ce38fb6de78ffd6a43dcd3274ef79561242f5e1448f0fe epRec:{hh_prod_backend-server.1.dp5da1ojjg8vucmt3n1ibrsz5 hh_prod_backend-server 0d9x9kmd34155liter9w6j0ml 20.0.7.2 20.0.7.186 [] [backend-server] [6c315162586e] false} err:networkpwdcn1sb3j4zwvhd1of3vudz6 not found"
Oct 14 08:01:19 ip-10-44-2-202 dockerd: time="2022-10-14T08:01:19.599159518Z" level=error msg="failed adding service bindingfor 94bacf56fda675999a4c863d1466bea62f4a8cbc2960a41a091452193433fc07 epRec:{hh_prod_adminui.1.siplb7h3zz9guzew1vr68y72f hh_prod_adminui 0whft1xsmxxe8tmarqhljpxy1 20.0.7.10 20.0.7.145 [] [adminui] [66dd4c3a8173] false} err:network pwdcn1sb3j4zwvhd1of3vudz6 not found"

jerrac · 2022-11-30T17:33:42Z

I'm 90% sure I'm running into this issue. I see the error="network sandbox join failed: subnet sandbox join failed for \"172.20.15.0/24\": error creating vxlan interface: file exists" messages in my logs and I was able to fix a node by deleting the vx-* interface.

Docker 20.10.21 on Ubuntu 22.04 in swarm mode.

I'm also pretty sure it occurs when something causes Docker to not be able to clean up after a docker rm < container > command. I have been testing my stack by running a script that will randomly delete a container or run systemctl stop docker. And the vxlan error has shown up multiple times now.

Ruppsn · 2023-01-16T09:17:25Z

We also encounterd this today on a production setup. After a Service failed an the restart policy kicked in the service wasnt able to start anymore with the error:

Rejected 10 seconds ago "network sandbox join failed: subnet sandbox join failed for "10.0.101.0/24": error creating vxlan interface: file exists

After a reboot, the devices under ls -l /sys/class/net/ | grep vx were gone and the services could start in a second.

Version: 20.10.22
containerd version: 5b842e528e99d4d4c1686467debf2bd4b88ecd86
runc version: v1.1.4-0-g5fd4c4d
init version: de40ad0
Kernel Version: 5.4.17-2136.315.5.el8uek.x86_64

t0mtaylor · 2023-06-28T16:52:09Z

This works for me on CentOS - calling this alot with cron jobs to ensure they dont cause more stress!

I take no responsibility if this doesn't work or it removes all of your vxlan interfaces so be careful and test first!

/sbin/ip -d link show | grep vx | grep DOWN | xargs -rn1 ip link delete

discotroy mentioned this issue May 18, 2017

Cannot start container: subnet sandbox join failed for "10.0.0.0/24": error creating vxlan interface: file exists #945

Closed

mpepping mentioned this issue Jun 8, 2017

Starting container failed: Address already in use when deploying service moby/moby#31698

Open

sanimej mentioned this issue Jun 10, 2017

Cleanup interfaces properly when vxlan plumbing fails #1800

Closed

jcmcote mentioned this issue Feb 20, 2018

overlay network stops working after stack down/up cycles (possible race condition or locking issue) #2081

Open

fcrisciani assigned ctelfer Apr 25, 2018

fendo64 unassigned ctelfer Feb 2, 2019

Fred-DTV mentioned this issue Aug 4, 2023

Agent install fails - problem with mounting and vxlan interface portainer/portainer#9993

Open

2 tasks

Error creating vxlan interface: file exists #1765

Error creating vxlan interface: file exists #1765

Comments

discotroy commented May 18, 2017

mpepping commented May 22, 2017

fcrisciani commented May 25, 2017

mpepping commented May 26, 2017

fcrisciani commented May 26, 2017

fcrisciani commented May 26, 2017

mpepping commented May 27, 2017

pjutard commented May 29, 2017 • edited Loading

mpepping commented May 30, 2017 • edited Loading

jgeyser commented May 31, 2017

mavenugo commented May 31, 2017

dcrystalj commented Jun 1, 2017 • edited Loading

dang3r commented Jun 4, 2017

lnshi commented Jun 9, 2017

mpepping commented Jun 9, 2017

sanimej commented Jun 12, 2017

lnshi commented Jun 12, 2017

sanimej commented Jun 14, 2017

adpjay commented Jul 11, 2017

mpepping commented Jul 11, 2017

adpjay commented Jul 11, 2017

mpepping commented Jul 11, 2017

RAKedz commented Jul 18, 2017

gitbensons commented Sep 15, 2017 • edited Loading

lukewendling commented Apr 25, 2018

ctelfer commented Apr 27, 2018

tafelpootje commented Feb 15, 2019

fendo64 commented Feb 15, 2019

ryandaniels commented Mar 18, 2019 • edited Loading

hannseman commented Apr 1, 2019

rnickle commented Apr 3, 2019

leojonathanoh commented May 1, 2019

sderungs commented Jun 12, 2019

xys2015 commented Aug 2, 2019

hexmode commented Feb 22, 2020

beckyjmcdabq commented Feb 26, 2020

hexmode commented Feb 26, 2020

wolfgangpfnuer commented Feb 27, 2020 • edited Loading

elthariel commented Mar 12, 2020

arturslogins commented Jun 1, 2020

reachworld commented Jun 17, 2020

stephMiotke commented Aug 6, 2020

belfo commented Aug 25, 2020

ProteanCode commented Nov 26, 2020 • edited Loading

Abhishek-Govula commented Mar 12, 2021

bobf commented Apr 14, 2021

matusnovak commented May 2, 2021

bdoublet91 commented Mar 10, 2022

albertvveld commented Jun 23, 2022 • edited Loading

dberardo-com commented Aug 2, 2022

Stainless5792 commented Oct 14, 2022

jerrac commented Nov 30, 2022

Ruppsn commented Jan 16, 2023

t0mtaylor commented Jun 28, 2023 • edited Loading

pjutard commented May 29, 2017 •

edited

Loading

mpepping commented May 30, 2017 •

edited

Loading

dcrystalj commented Jun 1, 2017 •

edited

Loading

gitbensons commented Sep 15, 2017 •

edited

Loading

ryandaniels commented Mar 18, 2019 •

edited

Loading

wolfgangpfnuer commented Feb 27, 2020 •

edited

Loading

ProteanCode commented Nov 26, 2020 •

edited

Loading

albertvveld commented Jun 23, 2022 •

edited

Loading

t0mtaylor commented Jun 28, 2023 •

edited

Loading