Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error creating vxlan interface: file exists #1765

Open
discotroy opened this issue May 18, 2017 · 57 comments
Open

Error creating vxlan interface: file exists #1765

discotroy opened this issue May 18, 2017 · 57 comments

Comments

@discotroy
Copy link

Previous related threads:

Comment at the current tail-end of #945 recommends opening a new ticket. I couldn't find one opened by the original poster, so here we go.

I've been using swarm for the past couple of months, and frequently hit upon this problem. I have a modest swarm (~8-9 nodes) all running Ubuntu 16.04, now with Docker 17.05-ce on. There is not a great amount of container churn, but I do use a stack yaml file to deploy ~20 services across ~20 encrypted overlay networks.

I tend to find that after a couple of stack deploy / stack rm cycles, my containers get killed at startup with the "Error creating vxlan: file exists" error. This prevents the containers coming up on a host and forces them to attempt to relocate, which may / may not work.

I have noted in the above issues that the problems are, several times over, thought to have been rectified, but yet always creep back in for various users.

To rectify the issue, I have tried rebooting the node, restarting iptables, removing the stack and re-creating, all of which work to varying degrees but are most definitely workarounds and not solutions.

I cannot think how I can attempt to reproduce this error, but if anyone wants to suggest ways to debug, I am at your service.

@mpepping
Copy link

We're suffering from the same issue on RHEL7 /w Docker 17.03-ee and are able to reproduce the issue by adding a Service on a swarm-node where the overlay-network isn't active yet.
Tried about the same level of troubleshooting as @discotroy and can confirm the rebooting or restarting docker-engine fixes the issue up to some level, with fluctuating results. Also open for suggestions on how to debug this issue.

@fcrisciani
Copy link

Do you guys have some logs to share? would be super helpful to have a way to reproduce and grab logs with the engine in debug mode

Engine in debug:
echo '{"debug": true}' > /etc/docker/daemon.json
then: sudo kill -HUP <pid of dockerd>

@mpepping
Copy link

Will collect more logging. Here's some debug /w the error-message: https://gist.github.com/mpepping/50cb9b71b5535b318c6a548d4e8ba97b

@fcrisciani
Copy link

@mpepping thanks, the error message is clear. The current suspect that I have is a race condition during the deletion of the sandbox that leak the vxlan interface behind it. When a new container comes up tries to create the vxlan interface again and instead finds that there is already one and errors out. The more interesting part now of the logs would be the block where there is suppose to be the interface deletion and figure out why that is not happening properly.

@fcrisciani
Copy link

I'm also already trying to reproduce it locally, but if you guys narrow down a specific set of steps that are able to reproduce with high probability let me know

@mpepping
Copy link

@fcrisciani indeed it seems a race condition running into a locking issue. A breakdown of the steps, with debug output, is available at https://gist.github.com/mpepping/739e9a486b6c3266093a8af712869e90 .

Basically, the command-set for us to reproduce the following .. but the gist provides more detail:

docker swarm init
docker network create -d overlay  ucp-hrm
docker stack deploy -c stack.yml test
docker service ls #OK
docker stack rm test
docker service ls
docker stack deploy -c stack.yml test
docker service ls #NOK

Also, we're running into this issue on RHEL7 /w Docker 1703-ee on VMware vSphere virtuals. We were thus far unable to reproduce the issue on Virtualbox or VMware Fusion, using the same stack. Our next steps would be to run an other OS on VMware vSphere to reproduce the issue, and debug the vxlan config.

@pjutard
Copy link

pjutard commented May 29, 2017

Same problem here. Same scenario: multiple stacks deployed, each with its own network, after some docker stack rm and docker stack deploy, we get the "Error creating vxlan: file exists" error msg.
We have a swarm in this state right now...

Server:
 Version:      17.05.0-ce
 API version:  1.29 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   89658be
 Built:        Thu May  4 21:43:09 2017
 OS/Arch:      linux/amd64
 Experimental: false

Using Docker4Azure

@mpepping
Copy link

mpepping commented May 30, 2017

Allright, some extensive tests led to interesting results. We're running RHEL7 /w docker-1703-ee.
The issue was direct reproducible when running the 3.10.0-327.10.1.el7.x86_64 kernel with iptables (firewalld removed). A docker swarm deploy/rm/deploy combo fails every test run on this setup.
After bumping the kernel (3.10.0-514.6.1.el7.x86_64) and installing+enabling the firewalld service, the results are much more reliable .. but still can break after 200+ or 800+ docker swarm deploy/rm/deploy runs .. after which rebooting the host is the only reliable way to fix this. Note that just bumping the kernel, or enabling firewalld isn't sufficient .. the combination of both made the difference in our use case.

@jgeyser
Copy link

jgeyser commented May 31, 2017

As per #562

You can correct this by running:

sudo umount /var/run/docker/netns/*
sudo rm /var/run/docker/netns/*

Not sure if this is a long term solution.

@mavenugo
Copy link
Contributor

@jgeyser thats a workaround to get out of issue. But that is not a solution. We have to narrow down the RC and fix it in the code.

@dcrystalj
Copy link

dcrystalj commented Jun 1, 2017

@jgeyser this is not working for me. sometimes i also get issue Unable to complete atomic operation, key modified

I have tried removing docker-engine and leaving docker swarm, but it didn't work.

update:
Needed full machine restart as well

@dang3r
Copy link

dang3r commented Jun 4, 2017

We are encountering the same problem with the following configuration:

Linux hostname 4.9.0-0.bpo.2-amd64 #1 SMP Debian 4.9.13-1~bpo8+1 (2017-02-27) x86_64 GNU/Linux

Client:
 Version:      17.03.0-ce
 API version:  1.26
 Go version:   go1.7.5
 Git commit:   60ccb22
 Built:        Thu Feb 23 10:53:29 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.03.0-ce
 API version:  1.26 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   60ccb22
 Built:        Thu Feb 23 10:53:29 2017
 OS/Arch:      linux/amd64
 Experimental: false

@lnshi
Copy link

lnshi commented Jun 9, 2017

Exactly same problem experienced, quite randomly.

@mpepping
Copy link

mpepping commented Jun 9, 2017

@lnshi Care to share about your environment .. os, docker version, using virtualisation?

@sanimej
Copy link

sanimej commented Jun 12, 2017

@dang3r @lnshi Can you add the details on what triggers this issue for you ? Have you able to find any pattern or a way to recreate this issue ? If your host is a VM, what hypervisor are you using ?

@lnshi
Copy link

lnshi commented Jun 12, 2017

@sanimej Maybe I just misreported this, I just figured out that my actual problem is like this: I reported there: Issue #33626, it is also subnet overlaps problem, but seems different reason. Can you help on that also? thanks.

@sanimej
Copy link

sanimej commented Jun 14, 2017

@dang3r @dcrystalj @discotroy If you are still having this issue can you check if your host has any udev rules that might rename interface names that start with vx. ?

For overlay networks, docker daemon creates a vxlan device with the name like vx-001001-a12eme where 001001 is the VNI id in hex, followed by shortened network id. This device then gets moved to a overlay network specific namespace. When the overlay network is deleted, the device is moved back to the host namespace before its deleted. If there is a udev rule that could rename these interfaces and if the rename happens before docker daemon can delete it, the host will end up with an orphaned interface with that vni id. So subsequent attempts to create that interface will fail.

@adpjay
Copy link

adpjay commented Jul 11, 2017

@mpepping Were you able to get the error message ("ERRO[0143] fatal task error error="starting container failed:...") to show up in the docker daemon logs? My swarm is in a state right now where several containers are in this condition. When I try to start one of the containers, the client returns an error message:
Error response from daemon: Error response from daemon: subnet sandbox join failed for "10.0.8.0/24": error creating vxlan interface: operation not supported
But I don't see any corresponding docker daemon API log message in my swarm.

I would like to forward all the daemon messages to splunk so that I can create an event recognizing when this condition occurs so we can execute a workaround to keep people moving forward and to validate that we aren't seeing it anymore when we get a fix.

@mpepping
Copy link

@adpjay Messages with the loglevel ERROR are logged by the daemon by default. Syslog should them up and send them to something like /var/log/messages or journald.
I our case, the exact error was: ERRO[0143] fatal task error error="starting container failed: subnet sandbox join failed for \"10.0.2.0/24\": error creating vxlan interface: file exists". The file exists message differs from the operation not supported message in your error-message. I our case a custom udev-rule for renaming network-interfaces was part of the issue. Maybe something worth checking out.

@adpjay
Copy link

adpjay commented Jul 11, 2017

@mpepping Thanks for responding. I see lots of messages for docker API GET calls in /var/log/messages and when I run docker logs -f <ucp-manager> (for any of the 3 managers in our swarm) but I don't see the error reported by the client. I wonder if there is a specific node in the swarm they show up on.
I did notice that the exact error was different (file exists vs. operation not supported) but I was thinking it could be due to the fact that our host has a different OS than you (we're running SUSE).
Thanks for the hint about the udev-rule I'll check it out.

@mpepping
Copy link

@adpjay You should be able to see the error at the daemon level (not in the UCP container logging). You should be able to see the error running something like journalctl -u docker.service. Good luck troubleshooting!

@RAKedz
Copy link

RAKedz commented Jul 18, 2017

I am getting this error on one of my nodes, I have 5 total. Any service trying to run on it will get his error:

error creating vxlan interface: file exists

I tried to do a 'docker system prune' and even booted the server, but it didn't fix it. Then someone mentioned it could be the network and I thought that could be it because I was heavily messing with the network because I have having issues with the encrypted network I created. I ended up creating a new non-encrypted network and using it for my services, abandoning the previous.

I began to examine the network on my working nodes and noticed that the encrypted network I was using was either removed or was still listed. But on the node not working the encrypted network was there but it was showing a scope of local unlike the others. (not sure how/why it was changed to local)

Bad node:
iw3w9kdywnay jupiter overlay local

Good nodes:
iw3w9kdywnay jupiter overlay swarm

When I tried to remove the network on the bad node I received this message:

Error response from daemon: network jupiter has active endpoints

Which is why the 'docker system prune' couldn't remove it.

I removed it by doing the following:

  1. Looked up its endpoint
    docker network inspect jupiter
  2. Remove it
    docker network disconnect -f jupiter ingress-endpoint
    docker network rm jupiter

Then I created a service to run on that node, and it started working for me.

This is my docker version:

Client:
Version: 17.03.0-ce
API version: 1.26
Go version: go1.7.5
Git commit: 3a232c8
Built: Tue Feb 28 08:01:32 2017
OS/Arch: linux/amd64

Server:
Version: 17.03.0-ce
API version: 1.26 (minimum version 1.12)
Go version: go1.7.5
Git commit: 3a232c8
Built: Tue Feb 28 08:01:32 2017
OS/Arch: linux/amd64
Experimental: false

Working on Digital Ocean - Ubuntu 16.04

@gitbensons
Copy link

gitbensons commented Sep 15, 2017

Found a workaround for this issue, without the need of rebooting or restarting docker daemon.
As @sanimej mentioned

For overlay networks, docker daemon creates a vxlan device with the name like vx-001001-a12eme where 001001 is the VNI id in hex, followed by shortened network id. This device then gets moved to a overlay network specific namespace. When the overlay network is deleted, the device is moved back to the host namespace before its deleted

So once you know which vxlan id fails to be created (did a strace of the docker daemon process, which is overkill for sure, but I was in a hurry)
4993 15:01:04.640588 recvfrom(30, "\254\0\0\0\2\0\0\0\267\273\0\0\212\265\372\377\357\377\377\377\230\0\0\0\20\0\5\6\267\273\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\24\0\3\0vx-000105-1158f\0\10\0\r\0\0\0\0\0\\\0\22\0\t\0\1\0vxlan\0\0\0L\0\2\0\10\0\1\0\5\1\0\0\5\0\5\0\0\0\0\0\5\0\6\0\0\0\0\0\5\0\7\0\1\0\0\0\5\0\v\0\1\0\0\0\5\0\f\0\0\0\0\0\5\0\r\0\1\0\0\0\5\0\16\0\1\0\0\0\6\0\17\0\22\265\0\0", 4096, 0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, [12]) = 172
So 000105-1158f aka 0x105 aka vxlan id 261 in my case.

Build a list of active network namespaces and its vxlan's on the failing host.
For example:
# for i in $(ls /var/run/docker/netns/*); do echo ":::: $ns" >> ip.link.show; nsenter -m -t <PID of docker daemon> nsenter --net=$ns ip -d link show ; done >> ip.link.show

Now that you know the affected network namespace, double nsenter into it
# nsenter -m -t <PID of docker daemon> bash
# nsenter --net=/var/run/docker/netns/<affected namespace> bash
# ip link delete vxlan1

After that, the error is gone. Pretty sure Docker Inc. knows about that workaround, why they don't share it is up to the imagination of the reader.
Hope this helps.

@lukewendling
Copy link

I was this getting error on a docker swarm stack (docker v18.03) and finally removed the entire stack (docker stack rm) and re-created with docker stack deploy and the problem resolved.

@ctelfer
Copy link
Contributor

ctelfer commented Apr 27, 2018

So far have not been able to reproduce this locally. I've tried the steps described above and have also scripted them to run them repeatedly. No dice so far. Will try with larger #s of networks next.

Having said that, while inspecting the code I definitely found several race conditions. I think that one in particular could cause this issue, but without reproduction its hard to prove. Will issue a PR shortly.

@tafelpootje
Copy link

Same issue here.
the
sudo umount /var/run/docker/netns/* sudo rm /var/run/docker/netns/*
fix did not work
Removing the stack and readding seems to have worked (for one stack it worked directly, the other stack I had to redo the steps)

Docker version:
Docker version 18.09.2
Ubuntu 16.04

@fendo64
Copy link

fendo64 commented Feb 15, 2019

Same issue here.
the
sudo umount /var/run/docker/netns/* sudo rm /var/run/docker/netns/*
fix did not work
Removing the stack and readding seems to have worked (for one stack it worked directly, the other stack I had to redo the steps)

Next time, can you check if you have "vx-" interface on host:
ip link show | grep vx

If so, delete them, it worked for me:
ip link delete vx-xxxx

The correction that I propose is after reading the code, I do not have the environment to test.
If a good soul, has a test environment, could he test my correction proposal.

@ryandaniels
Copy link

ryandaniels commented Mar 18, 2019

Ran into the same issue.
Docker version: 18.06.1-ce

Fixed after @sanimej / @fendo64 's work-around:
ip link delete vx-xxxx

@hannseman
Copy link

Ran into the same issue.
Docker version 18.03.0-ce, build 0520e24

ip link delete vx-xxxx resolved it.

@rnickle
Copy link

rnickle commented Apr 3, 2019

Same problem, Docker CE 18.09.4, using swarm.

I deleted the ip links but a simple docker service update did not restart.

I was able to do it using docker service update in conjunction with '--with-registry-auth' and '--force'

@leojonathanoh
Copy link

Same issue here.
the
sudo umount /var/run/docker/netns/* sudo rm /var/run/docker/netns/*
fix did not work
Removing the stack and readding seems to have worked (for one stack it worked directly, the other stack I had to redo the steps)

Next time, can you check if you have "vx-" interface on host:
ip link show | grep vx

If so, delete them, it worked for me:
ip link delete vx-xxxx

The correction that I propose is after reading the code, I do not have the environment to test.
If a good soul, has a test environment, could he test my correction proposal.

That resolved it for me on docker stack deploy on Docker 18.06.1-ce Swarm

@sderungs
Copy link

Same happens to our environment:
# docker -v

Docker version 18.09.6, build 481bc77156

# cat /proc/version

Linux version 3.10.0-957.21.2.el7.x86_64 ([email protected]) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-36) (GCC) ) #1 SMP Wed Jun 5 14:26:44 UTC 2019`

@xys2015
Copy link

xys2015 commented Aug 2, 2019

I do this below:
sudo umount /var/run/docker/netns/*
sudo rm /var/run/docker/netns/*
stop and start docker
rebuild docker swarm
change docker network ip
but all not work
then I do this
docker network rm mysystem
docker network create --driver overlay newsystem
It worked, I change the overlay network name

@hexmode
Copy link

hexmode commented Feb 22, 2020

Next time, can you check if you have "vx-" interface on host:
ip link show | grep vx

If so, delete them, it worked for me:
ip link delete vx-xxxx

This worked for me. Thanks, @fendo64 !

@beckyjmcdabq
Copy link

@hexmode - I see you did this just 4 days ago. Do you have a set of steps that you did? Did you leave your existing deployments running when you ran this command? Is this command dangerous in any way? Do we need to make sure we are deleting only the problem interfaces and if so, how do we do that?

@hexmode
Copy link

hexmode commented Feb 26, 2020

@beckyjmcdabq The steps are there: two commands. I really can't answer those questions since I'm only doing this on a development box for now. I only saw one interface. Is it safe? Well, if something goes wrong, you can redeploy.

@wolfgangpfnuer
Copy link

wolfgangpfnuer commented Feb 27, 2020

@beckyjmcdabq essentially, if everything is correct, ip link show | grep vx is empty.

Only when I got the error this issue is all about, did I ever see a result on any of my machines (double digits)
When deleting the network with ip link delete the problem was solved. other than doing this, a full restart of the node (not just docker, the machine) solved the problem as well, but of course takes longer and might have other side-effects.

I assume that the deletion of those networks is side-effect free, as they do not exist if the problem is not there.

you could probably go all willy-nilly by running the command with xargs I guess, but do so at your own risk:
# use at your own risk: ip link show | grep vx | xargs -rn1 ip link delete

@elthariel
Copy link

Happened to me on a single node swarm on Ubuntu 16.04.6 LTS host / 4.4.0-169-generic, tried with Docker 18.09.1, 18.09.9 and 19.03.7.

@fendo64 trick worked for me (i.e. ip link delete vx-xxx)

@arturslogins
Copy link

Same issue here.
the
sudo umount /var/run/docker/netns/* sudo rm /var/run/docker/netns/*
fix did not work
Removing the stack and readding seems to have worked (for one stack it worked directly, the other stack I had to redo the steps)

Next time, can you check if you have "vx-" interface on host:
ip link show | grep vx

If so, delete them, it worked for me:
ip link delete vx-xxxx

The correction that I propose is after reading the code, I do not have the environment to test.
If a good soul, has a test environment, could he test my correction proposal.

Thanks for solution !

Find this issue on Docker version 19.03.5, build 633a0ea838

@reachworld
Copy link

You can find full information and "easy" resolution on docker.

In brief:

  1. Check each node for any vx-* interfaces in /sys/class/net:
    $ ls -l /sys/class/net/ | grep vx

  2. Once we have interface id's pull more details:
    $ udevadm info /sys/class/net/

  3. If these interfaces exist we should be able to safely remove them. Replace vx-000000-xxxxx with the interface id from Step 2:
    $ sudo ip -d link show vx-000000-xxxxx
    $ sudo ip link delete vx-000000-xxxxx
    etc.

  4. Redeploy the service.

@stephMiotke
Copy link

Confirmed this is still occurring in Docker EE version 19.03.11, build 0da829ac52. Linked instructions here do indeed solve the problem for us.

@belfo
Copy link

belfo commented Aug 25, 2020

Had the issue on 19.03.12. removing the ip link fixed the issue too

@ProteanCode
Copy link

ProteanCode commented Nov 26, 2020

You can find full information and "easy" resolution on docker.

In brief:

  1. Check each node for any vx-* interfaces in /sys/class/net:
    $ ls -l /sys/class/net/ | grep vx
  2. Once we have interface id's pull more details:
    $ udevadm info /sys/class/net/
  3. If these interfaces exist we should be able to safely remove them. Replace vx-000000-xxxxx with the interface id from Step 2:
    $ sudo ip -d link show vx-000000-xxxxx
    $ sudo ip link delete vx-000000-xxxxx
    etc.
  4. Redeploy the service.

This overall fixed problem, but it may be dangerous if the removed network is shared, ie. servers as a traefik proxy...
How can I check what service use which interface?

@Abhishek-Govula
Copy link

Next time, can you check if you have "vx-" interface on host:
ip link show | grep vx

If so, delete them, it worked for me:
ip link delete vx-xxxx

This worked for me on Docker CE 19.03.12. Thanks @fendo64 for your fix

@bobf
Copy link

bobf commented Apr 14, 2021

If it's helpful to anybody else I can confirm that this solution also worked for me - I iterated through the list of devices and did:

ip -d link show "{device}" && ip link delete "${device}"

We were able to bring the cluster back to a happy state once this had been applied - thank you very much for sharing the solution, it solved a big headache at the end of a very stressful day.

@matusnovak
Copy link

I am also having this issue.

Removing ip links did not work for me.

What worked for me was to remove all services that depend on the overlay network. Remove the network. Reboot. Then re-create everything but this time I have changed the network name.

@bdoublet91
Copy link

I think we find the solution.
When could we hope a fix for docker swarm overlay network ?

Thank you

@albertvveld
Copy link

albertvveld commented Jun 23, 2022

Removing IP links does fix the problem however, please fix this permanently please.

@dberardo-com
Copy link

how about removing the node from the cluster and then let it join again? would this workaround work too?

@Stainless5792
Copy link

Got the same issue suddenly in docker swarm mode(1 master + 1 worker) with docker version 20.10.17

$ docker info
Client:
 Context:    default
 Debug Mode: false

Server:
 Containers: 32
  Running: 9
  Paused: 0
  Stopped: 23
 Images: 578
 Server Version: 20.10.17

error message


Oct 14 08:01:15 ip-10-44-2-202 dockerd: time="2022-10-14T08:01:15.703465453Z" level=warning msg="Health check for container 0b2233152912ebb25fe518413b1f7d712b9c0b4107ec9eb27a82887542c72447 error: context deadline exceeded: unknown"
Oct 14 08:01:16 ip-10-44-2-202 dockerd: time="2022-10-14T08:01:16.973267829Z" level=warning msg="rmServiceBinding 56da2a4a026ea6073e56c0a1ddd7dc60964920195ac3583fa9af522cd498bf37 possible transient state ok:false entries:0 set:false "
Oct 14 08:01:17 ip-10-44-2-202 dockerd: time="2022-10-14T08:01:17.297022988Z" level=error msg="stream copy error: reading from a closed fifo"
Oct 14 08:01:17 ip-10-44-2-202 dockerd: time="2022-10-14T08:01:17.297161300Z" level=error msg="stream copy error: reading from a closed fifo"
Oct 14 08:01:17 ip-10-44-2-202 dockerd: time="2022-10-14T08:01:17.301269053Z" level=warning msg="Health check for container b94b333102363c5b5e4e3743e3a94bcb2b5dfedf5fe28a88ff5e282b866c2752 error: context deadline exceeded"
Oct 14 08:01:18 ip-10-44-2-202 dockerd: time="2022-10-14T08:01:18.232215740Z" level=info msg="initialized VXLAN UDP port to 4789 "
Oct 14 08:01:18 ip-10-44-2-202 dockerd: time="2022-10-14T08:01:18.413119164Z" level=warning msg="failed to deactivate service binding for container hh_prod_backend-nginx.1.jflf98y83j8s00zblju2rf0sz" error="No such container: hh_prod_backend-nginx.1.jflf98y83j8s00zblju2rf0sz" module=node/agent node.id=f4c87y0fn18gq8jlmcjvn74f2
Oct 14 08:01:19 ip-10-44-2-202 kernel: br0: renamed from ov-001007-pwdcn
Oct 14 08:01:19 ip-10-44-2-202 kernel: br1: renamed from ov-001007-pwdcn
Oct 14 08:01:19 ip-10-44-2-202 kernel: ov-001007-pwdcn: renamed from br1
Oct 14 08:01:19 ip-10-44-2-202 dockerd: time="2022-10-14T08:01:19.497912332Z" level=error msg="moving interface ov-001007-pwdcn to host ns failed, invalid argument, after config error error setting interface \"ov-001007-pwdcn\" IP to 20.0.7.1/24: cannot program address 20.0.7.1/24 in sandbox interface because it conflicts with existing route {Ifindex: 2 Dst: 20.0.7.0/24 Src: 20.0.7.1 Gw: <nil> Flags: [] Table: 254}"
Oct 14 08:01:19 ip-10-44-2-202 dockerd: time="2022-10-14T08:01:19.597860439Z" level=error msg="fatal task error" error="network sandbox join failed: subnet sandbox join failed for \"20.0.7.0/24\": error creating vxlan interface: file exists" module=node/agent/taskmanager node.id=f4c87y0fn18gq8jlmcjvn74f2 service.id=sljpybp0vgh5bpf78uvwimztd task.id=xe7fihzdr1dco9o1req3eicdf
Oct 14 08:01:19 ip-10-44-2-202 dockerd: time="2022-10-14T08:01:19.598356846Z" level=error msg="failed adding service bindingfor 4566d70127e1a41749bda34b8b0588a7ffb51c5f61c58b524c1a2248a3774ea6 epRec:{hh_prod_mobile.1.o24aonnnvpea5z4xhtgmq781m hh_prod_mobile ksdvp0uyapg39npoqodcmspn6 20.0.7.8 20.0.7.175 [] [mobile] [b5778ef16d92] false} err:network pwdcn1sb3j4zwvhd1of3vudz6 not found"
Oct 14 08:01:19 ip-10-44-2-202 dockerd: time="2022-10-14T08:01:19.598539819Z" level=error msg="failed adding service bindingfor 4f0e4759b1eac4a1e5f8ce5b3aef3b86a362f3c62ebbaf0112e2dd20898df4bf epRec:{hh_prod_auth.1.loq5llju3sn267b2ybmbznkwd hh_prod_auth 0tpkzhq0rdyhsc0rfhs4wmsyh 20.0.7.103 20.0.7.4 [] [auth] [48e59cb68f4e] false} err:network pwdcn1sb3j4zwvhd1of3vudz6 notfound"
Oct 14 08:01:19 ip-10-44-2-202 dockerd: time="2022-10-14T08:01:19.598811672Z" level=error msg="failed adding service bindingfor 65146bc83fc0fccd86ce38fb6de78ffd6a43dcd3274ef79561242f5e1448f0fe epRec:{hh_prod_backend-server.1.dp5da1ojjg8vucmt3n1ibrsz5 hh_prod_backend-server 0d9x9kmd34155liter9w6j0ml 20.0.7.2 20.0.7.186 [] [backend-server] [6c315162586e] false} err:networkpwdcn1sb3j4zwvhd1of3vudz6 not found"
Oct 14 08:01:19 ip-10-44-2-202 dockerd: time="2022-10-14T08:01:19.599159518Z" level=error msg="failed adding service bindingfor 94bacf56fda675999a4c863d1466bea62f4a8cbc2960a41a091452193433fc07 epRec:{hh_prod_adminui.1.siplb7h3zz9guzew1vr68y72f hh_prod_adminui 0whft1xsmxxe8tmarqhljpxy1 20.0.7.10 20.0.7.145 [] [adminui] [66dd4c3a8173] false} err:network pwdcn1sb3j4zwvhd1of3vudz6 not found"

@jerrac
Copy link

jerrac commented Nov 30, 2022

I'm 90% sure I'm running into this issue. I see the error="network sandbox join failed: subnet sandbox join failed for \"172.20.15.0/24\": error creating vxlan interface: file exists" messages in my logs and I was able to fix a node by deleting the vx-* interface.

Docker 20.10.21 on Ubuntu 22.04 in swarm mode.

I'm also pretty sure it occurs when something causes Docker to not be able to clean up after a docker rm < container > command. I have been testing my stack by running a script that will randomly delete a container or run systemctl stop docker. And the vxlan error has shown up multiple times now.

@Ruppsn
Copy link

Ruppsn commented Jan 16, 2023

We also encounterd this today on a production setup. After a Service failed an the restart policy kicked in the service wasnt able to start anymore with the error:

Rejected 10 seconds ago "network sandbox join failed: subnet sandbox join failed for "10.0.101.0/24": error creating vxlan interface: file exists

After a reboot, the devices under ls -l /sys/class/net/ | grep vx were gone and the services could start in a second.

Version: 20.10.22
containerd version: 5b842e528e99d4d4c1686467debf2bd4b88ecd86
runc version: v1.1.4-0-g5fd4c4d
init version: de40ad0
Kernel Version: 5.4.17-2136.315.5.el8uek.x86_64

@t0mtaylor
Copy link

t0mtaylor commented Jun 28, 2023

This works for me on CentOS - calling this alot with cron jobs to ensure they dont cause more stress!

I take no responsibility if this doesn't work or it removes all of your vxlan interfaces so be careful and test first!

/sbin/ip -d link show | grep vx | grep DOWN | xargs -rn1 ip link delete

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests