-
Notifications
You must be signed in to change notification settings - Fork 882
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error creating vxlan interface: file exists #1765
Comments
We're suffering from the same issue on RHEL7 /w Docker 17.03-ee and are able to reproduce the issue by adding a Service on a swarm-node where the overlay-network isn't active yet. |
Do you guys have some logs to share? would be super helpful to have a way to reproduce and grab logs with the engine in debug mode Engine in debug: |
Will collect more logging. Here's some debug /w the error-message: https://gist.github.com/mpepping/50cb9b71b5535b318c6a548d4e8ba97b |
@mpepping thanks, the error message is clear. The current suspect that I have is a race condition during the deletion of the sandbox that leak the vxlan interface behind it. When a new container comes up tries to create the vxlan interface again and instead finds that there is already one and errors out. The more interesting part now of the logs would be the block where there is suppose to be the interface deletion and figure out why that is not happening properly. |
I'm also already trying to reproduce it locally, but if you guys narrow down a specific set of steps that are able to reproduce with high probability let me know |
@fcrisciani indeed it seems a race condition running into a locking issue. A breakdown of the steps, with debug output, is available at https://gist.github.com/mpepping/739e9a486b6c3266093a8af712869e90 . Basically, the command-set for us to reproduce the following .. but the gist provides more detail:
Also, we're running into this issue on RHEL7 /w Docker 1703-ee on VMware vSphere virtuals. We were thus far unable to reproduce the issue on Virtualbox or VMware Fusion, using the same stack. Our next steps would be to run an other OS on VMware vSphere to reproduce the issue, and debug the vxlan config. |
Same problem here. Same scenario: multiple stacks deployed, each with its own network, after some
Using Docker4Azure |
Allright, some extensive tests led to interesting results. We're running RHEL7 /w docker-1703-ee. |
As per #562 You can correct this by running: sudo umount /var/run/docker/netns/* Not sure if this is a long term solution. |
@jgeyser thats a workaround to get out of issue. But that is not a solution. We have to narrow down the RC and fix it in the code. |
@jgeyser this is not working for me. sometimes i also get issue I have tried removing docker-engine and leaving docker swarm, but it didn't work. update: |
We are encountering the same problem with the following configuration:
|
Exactly same problem experienced, quite randomly. |
@lnshi Care to share about your environment .. os, docker version, using virtualisation? |
@sanimej Maybe I just misreported this, I just figured out that my actual problem is like this: I reported there: Issue #33626, it is also subnet overlaps problem, but seems different reason. Can you help on that also? thanks. |
@dang3r @dcrystalj @discotroy If you are still having this issue can you check if your host has any udev rules that might rename interface names that start with For overlay networks, docker daemon creates a vxlan device with the name like |
@mpepping Were you able to get the error message ("ERRO[0143] fatal task error error="starting container failed:...") to show up in the docker daemon logs? My swarm is in a state right now where several containers are in this condition. When I try to start one of the containers, the client returns an error message: I would like to forward all the daemon messages to splunk so that I can create an event recognizing when this condition occurs so we can execute a workaround to keep people moving forward and to validate that we aren't seeing it anymore when we get a fix. |
@adpjay Messages with the loglevel ERROR are logged by the daemon by default. Syslog should them up and send them to something like |
@mpepping Thanks for responding. I see lots of messages for docker API GET calls in /var/log/messages and when I run |
@adpjay You should be able to see the error at the daemon level (not in the UCP container logging). You should be able to see the error running something like |
I am getting this error on one of my nodes, I have 5 total. Any service trying to run on it will get his error:
I tried to do a 'docker system prune' and even booted the server, but it didn't fix it. Then someone mentioned it could be the network and I thought that could be it because I was heavily messing with the network because I have having issues with the encrypted network I created. I ended up creating a new non-encrypted network and using it for my services, abandoning the previous. I began to examine the network on my working nodes and noticed that the encrypted network I was using was either removed or was still listed. But on the node not working the encrypted network was there but it was showing a scope of local unlike the others. (not sure how/why it was changed to local) Bad node: Good nodes: When I tried to remove the network on the bad node I received this message:
Which is why the 'docker system prune' couldn't remove it. I removed it by doing the following:
Then I created a service to run on that node, and it started working for me. This is my docker version: Client: Server: Working on Digital Ocean - Ubuntu 16.04 |
Found a workaround for this issue, without the need of rebooting or restarting docker daemon.
So once you know which vxlan id fails to be created (did a strace of the docker daemon process, which is overkill for sure, but I was in a hurry) Build a list of active network namespaces and its vxlan's on the failing host. Now that you know the affected network namespace, double nsenter into it After that, the error is gone. Pretty sure Docker Inc. knows about that workaround, why they don't share it is up to the imagination of the reader. |
I was this getting error on a docker swarm stack (docker v18.03) and finally removed the entire stack ( |
So far have not been able to reproduce this locally. I've tried the steps described above and have also scripted them to run them repeatedly. No dice so far. Will try with larger #s of networks next. Having said that, while inspecting the code I definitely found several race conditions. I think that one in particular could cause this issue, but without reproduction its hard to prove. Will issue a PR shortly. |
Same issue here. Docker version: |
Next time, can you check if you have "vx-" interface on host: If so, delete them, it worked for me: The correction that I propose is after reading the code, I do not have the environment to test. |
Ran into the same issue.
|
Same problem, Docker CE 18.09.4, using swarm. I deleted the ip links but a simple docker service update did not restart. I was able to do it using docker service update in conjunction with '--with-registry-auth' and '--force' |
That resolved it for me on |
Same happens to our environment:
|
I do this below: |
This worked for me. Thanks, @fendo64 ! |
@hexmode - I see you did this just 4 days ago. Do you have a set of steps that you did? Did you leave your existing deployments running when you ran this command? Is this command dangerous in any way? Do we need to make sure we are deleting only the problem interfaces and if so, how do we do that? |
@beckyjmcdabq The steps are there: two commands. I really can't answer those questions since I'm only doing this on a development box for now. I only saw one interface. Is it safe? Well, if something goes wrong, you can redeploy. |
@beckyjmcdabq essentially, if everything is correct, Only when I got the error this issue is all about, did I ever see a result on any of my machines (double digits) I assume that the deletion of those networks is side-effect free, as they do not exist if the problem is not there. you could probably go all willy-nilly by running the command with xargs I guess, but do so at your own risk: |
Happened to me on a single node swarm on Ubuntu 16.04.6 LTS host / 4.4.0-169-generic, tried with Docker 18.09.1, 18.09.9 and 19.03.7. @fendo64 trick worked for me (i.e. |
Thanks for solution ! Find this issue on Docker version 19.03.5, build 633a0ea838 |
You can find full information and "easy" resolution on docker. In brief:
|
Confirmed this is still occurring in Docker EE version 19.03.11, build 0da829ac52. Linked instructions here do indeed solve the problem for us. |
Had the issue on 19.03.12. removing the ip link fixed the issue too |
This overall fixed problem, but it may be dangerous if the removed network is shared, ie. servers as a traefik proxy... |
This worked for me on Docker CE 19.03.12. Thanks @fendo64 for your fix |
If it's helpful to anybody else I can confirm that this solution also worked for me - I iterated through the list of devices and did:
We were able to bring the cluster back to a happy state once this had been applied - thank you very much for sharing the solution, it solved a big headache at the end of a very stressful day. |
I am also having this issue. Removing ip links did not work for me. What worked for me was to remove all services that depend on the overlay network. Remove the network. Reboot. Then re-create everything but this time I have changed the network name. |
I think we find the solution. Thank you |
Removing IP links does fix the problem however, please fix this permanently please. |
how about removing the node from the cluster and then let it join again? would this workaround work too? |
Got the same issue suddenly in docker swarm mode(1 master + 1 worker) with docker version 20.10.17
error message
|
I'm 90% sure I'm running into this issue. I see the Docker 20.10.21 on Ubuntu 22.04 in swarm mode. I'm also pretty sure it occurs when something causes Docker to not be able to clean up after a |
We also encounterd this today on a production setup. After a Service failed an the restart policy kicked in the service wasnt able to start anymore with the error: Rejected 10 seconds ago "network sandbox join failed: subnet sandbox join failed for "10.0.101.0/24": error creating vxlan interface: file exists After a reboot, the devices under Version: 20.10.22 |
This works for me on CentOS - calling this alot with cron jobs to ensure they dont cause more stress! I take no responsibility if this doesn't work or it removes all of your vxlan interfaces so be careful and test first!
|
Previous related threads:
Comment at the current tail-end of #945 recommends opening a new ticket. I couldn't find one opened by the original poster, so here we go.
I've been using swarm for the past couple of months, and frequently hit upon this problem. I have a modest swarm (~8-9 nodes) all running Ubuntu 16.04, now with Docker 17.05-ce on. There is not a great amount of container churn, but I do use a stack yaml file to deploy ~20 services across ~20 encrypted overlay networks.
I tend to find that after a couple of
stack deploy
/stack rm
cycles, my containers get killed at startup with the "Error creating vxlan: file exists" error. This prevents the containers coming up on a host and forces them to attempt to relocate, which may / may not work.I have noted in the above issues that the problems are, several times over, thought to have been rectified, but yet always creep back in for various users.
To rectify the issue, I have tried rebooting the node, restarting iptables, removing the stack and re-creating, all of which work to varying degrees but are most definitely workarounds and not solutions.
I cannot think how I can attempt to reproduce this error, but if anyone wants to suggest ways to debug, I am at your service.
The text was updated successfully, but these errors were encountered: