Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

port already allocated #1790

Open
saiwl opened this issue Jun 1, 2017 · 15 comments
Open

port already allocated #1790

saiwl opened this issue Jun 1, 2017 · 15 comments

Comments

@saiwl
Copy link

saiwl commented Jun 1, 2017

We met "port already allocated" problem in our docker environment. It always happened after docker-daemon restarts abnormally or machine restarts abnormally.
I read the related code, and found a possible bug about this.
In the source code, the process of creating a container using port mapping is like below:

  1. create container
  2. create sandbox
  3. create endpoint
  4. allocate ports
  5. update driver endpoint store
  6. join sandbox
  7. update sandbox store

And the restore process after the daemon restarts is like below:

  1. restore port mapping based on driver endpoint restore
  2. clean up sandbox based on sandbox store
  3. clean up endpoint based on endpoint store

In the creating process, if the docker daemon or the machine restarts abnormally between step 5 and step 7 which truely happened in our environment, after docker daemon restarts, the port mapping would be restored in step 1 of restore process and will not be released in step 2 of restore process because sandbox was not updated, which causing ports leak.

I have made a simple fix which has been tested in our environment. I will make a PR later. Looking forward to your suggestions!

@saiwl
Copy link
Author

saiwl commented Jun 9, 2017

ping @mavenugo

@fcrisciani
Copy link

@saiwl is this a case with live-restore enabled?

@fcrisciani
Copy link

@saiwl the problem that you describe makes sense to me. I actually created this PR #1805 with the objective to simplify a bit the logic there in the sandboxCleanup.

As you are correctly stating in the description it can happen that the driver and the sandbox stores go out of sync. The idea of my patch is to make the network the source of truth, so that we can reconstruct the sandbox endpoints directly from there. That simplify the logic and should ensure that we are not missing endpoints.
By chance do you have the possibility to give it a try?

@saiwl
Copy link
Author

saiwl commented Jun 13, 2017

@fcrisciani Thanks.
'live-restore' is disabled.
I read your patch, actually I almost did the same thing in my PR #1794 a few days ago. The network endpoints should be deleted from store after the driver endpoints, I agree on that, which was ignored in my last PR.
I have done some testing on our machines with my PR. All seems OK except I need to fix the endpoint deleting problem by your PR.

@bdeluca
Copy link

bdeluca commented Jun 13, 2017

I am hitting this issue also, but I can't use live-restore as I am in swarm mode.

testing both your PR's I will let you know how I go.

@fcrisciani
Copy link

Thanks @saiwl yes, I took the part where you fetch all the endpoints from the network, I also aggregated them by sandbox ID so to not iterate through them again later and also I removed the logic that was looping on the ones from the sandbox.
I made sure that the network is actually the only source of truth for the endpoints so that we can avoid multiple loops on different lists. The delete I guess is needed in case the driver has the endpoint saved in its store but the sandbox did not have the chance to receive it.

If @saiwl and @bdeluca can give it a try considering that you were seeing this issue would be a great validation for the patch itself and proceed further. This code path is kind of tricky so we want to avoid introducing some new issues. Thanks!

@bdeluca
Copy link

bdeluca commented Jun 13, 2017

Hi @fcrisciani so your patch doesn't appear to fix my issue,

My issue is
Establish a lot of containers in swarm mode. Reboot. wait.
After start up Containers that start not longer have ports mapped.

I followed a trail of similar issues to @saiwl last PR and I think this might be related.

But if you say this seems like a different issue I will disappear (my issue is trivial to replicate)

@fcrisciani
Copy link

@bdeluca
couple of questions:

  1. are you able to reproduce on single node or only multi node?
  2. "a lot of containers" means? can you give an indication about your test?
  3. is there a specific error message in the logs that can explain why the port is not exposed?

@bdeluca
Copy link

bdeluca commented Jun 13, 2017 via email

@bdeluca
Copy link

bdeluca commented Jun 13, 2017

  1. single node.
  2. the more containers I have the more likely it is to happen.

example
docker service create --name registry0 --constraint node.role==manager --publish 5000:5000 registry:2
docker service create --name registry1 --constraint node.role==manager --publish 5001:5000 registry:2
docker service create --name registry2 --constraint node.role==manager --publish 5002:5000 registry:2
......
docker service create --name registry10 --constraint node.role==manager --publish 5010:5000 registry:2

after is created first time you will be able to attack the ports on localhost 5000-5010.
reboot the machine.
ports will randomly not be available.

Note: I am just using the registry image as an example.

Mostly the swarm cluster isnt down but I was testing failure modes and discovered this one.

  1. I see
    address already in use
    port already in use.
    or just nothing, everything looks like it should be open but it is not.

@bdeluca
Copy link

bdeluca commented Jun 14, 2017

Some where some thing is very confused. swarm thinks things have other ip addresses than they do.

docker service inspect zz_grafana
          "VirtualIPs": [
                {
                    "NetworkID": "xg5ilscw3zfb1ul4pgm3apv9l",
                    "Addr": "10.255.0.7/16"
                }
            ]

docker service inspect zz-registry-browser

            "VirtualIPs": [
                {
                    "NetworkID": "xg5ilscw3zfb1ul4pgm3apv9l",
                    "Addr": "10.255.0.12/16"
                }
            ]


docker network inspect ingress


"Containers": {
    "14b62c140721ba7222467b3ee0112a366ab921b3cf5382093a03c45040ec0a7b": {
        "Name": "zz-registry-browser.1.vqd41eqpe0b6y9zwajf19p2e9",
        "EndpointID": "64cc796198e0f48634a30243ff64051dc3ba638089ededc33794bac9c55624eb",
        "MacAddress": "02:42:0a:ff:00:07",
        "IPv4Address": "10.255.0.7/16",
        "IPv6Address": ""
    },

    "cc70321aba5e2b5c9a9e7e21bd88990d4a57828da8a87413379f02c50b76055a": {
        "Name": "zz_grafana.1.qydyeoebramw6al5xmqtosp78",
        "EndpointID": "8b47b3438023a80d01b793190109422ac14fb84657efc79aebbcb6322c350c4f",
        "MacAddress": "02:42:0a:ff:00:04",
        "IPv4Address": "10.255.0.4/16",
        "IPv6Address": ""
    },
            "7580111e2e3b225d33c1748cf76752265896997bb07bc44930e81ac928305638": {
                "Name": "test_ssh.1.roy8f55ht9w3207fu7eel6c0v",
                "EndpointID": "a8f2374a188bb1be47d74d0a7d4902aaabb571aa131d4cdbcc9f1d93cc2c3b2f",
                "MacAddress": "02:42:0a:ff:00:0c",
                "IPv4Address": "10.255.0.12/16",
                "IPv6Address": ""
            },

@bdeluca
Copy link

bdeluca commented Jun 14, 2017

my simple example with the registry doesnt work because on every container port 5000 is open.

@neelaruban
Copy link

neelaruban commented Aug 30, 2017

@fcrisciani I am also hitting this issue even with live-restore option and also i am using always restart option for all the containers . this only happens when the system is rebooted .

I have given you the logs i receive below .

2017-08-30T15:05:39.455087+10:00 [localhost] dockerd: time="2017-08-30T15:05:39.454162035+10:00" level=warning msg="Failed to allocate and map port 80-80: Bind for 0.0.0.0:80 failed: port is already allocated" 2017-08-30T15:05:39.548964+10:00 [localhost] dockerd: time="2017-08-30T15:05:39.547939639+10:00" level=warning msg="Failed to allocate and map port 8080-8080: Bind for 0.0.0.0:8080 failed: port is already allocated" 2017-08-30T15:05:39.551442+10:00 [localhost] dockerd: time="2017-08-30T15:05:39.550968169+10:00" level=warning msg="failed to cleanup ipc mounts:\nfailed to umount /opt/app/docker/containers/09d5818abf3d42153ada33af97faf255a48d3dac10afd00127f1bad04dac1656/shm: invalid argument" 2017-08-30T15:05:39.551701+10:00 [localhost] dockerd: time="2017-08-30T15:05:39.551068223+10:00" level=error msg="Failed to start container 09d5818abf3d42153ada33af97faf255a48d3dac10afd00127f1bad04dac1656: driver failed programming external connectivity on endpoint tyk-gateway (1a41682df361402e2839e605b474195ad6e0d30b42c1a3b573600d63c32442b0): Bind for 0.0.0.0:80 failed: port is already allocated" 2017-08-30T15:05:39.577807+10:00 [localhost] dockerd: time="2017-08-30T15:05:39.576520164+10:00" level=warning msg="failed to cleanup ipc mounts:\nfailed to umount /opt/app/docker/containers/8df7c5dd9381d22a5fe7b13f0080a60fb967d0bf18a5dd4b39e556e18bc8101e/shm: invalid argument" 2017-08-30T15:05:39.577954+10:00 [localhost] dockerd: time="2017-08-30T15:05:39.576615265+10:00" level=error msg="Failed to start container 8df7c5dd9381d22a5fe7b13f0080a60fb967d0bf18a5dd4b39e556e18bc8101e: driver failed programming external connectivity on endpoint tyk_dashboard (49c544e72a5bb9dde135450610309aa8a381219350ec623d70f5121a94cfa18d): Bind for 0.0.0.0:8080 failed: port is already allocated" 2017-08-30T15:05:39.578100+10:00 [localhost] dockerd: time="2017-08-30T15:05:39.576801111+10:00" level=info msg="Loading containers: done." 2017-08-30T15:05:39.631765+10:00 [localhost] dockerd: time="2017-08-30T15:05:39.629799087+10:00" level=info msg="Daemon has completed initialization" 2017-08-30T15:05:39.632066+10:00 [localhost] dockerd: time="2017-08-30T15:05:39.629865544+10:00" level=info msg="Docker daemon" commit=f5ec1e2 graphdriver=overlay version=17.03.2-ce 2017-08-30T15:05:39.642335+10:00 [localhost] dockerd: time="2017-08-30T15:05:39.641918337+10:00" level=info msg="API listen on /var/run/docker.sock"

@BSWANG
Copy link

BSWANG commented Dec 1, 2017

@fcrisciani @saiwl I get same issue on my environment. My reproduce script as follow:

~# docker run -d --restart always --net host --name etcd elcolio/etcd
ad1d4571a9da839aec622c7f25185e5cee8b4a293ce3ee44641416de6d17ba3e
~# echo '{"cluster-store":"etcd://127.0.0.1:2379"}' > /etc/docker/daemon.json
~# service docker restart
~# docker network create -d overlay testov
6e7703dac16997d82c8012269c1f2ed080bb63aee50a160cadf3a58240accc6b
~# docker run -itd --name test_1 -p 80:80 --network testov busybox
7c45e0c4d1b0e8e7413a35d098ae964ed88cd42023abddcd757cf6978b64db68
~# docker stop etcd # endpoint will delete fail from kv-store
etcd
~# docker rm -f test_1 # remove container when kv-store unavailable
test_1
~# docker start etcd 
etcd
~# docker run -itd --name test_2 -p 80:80 --network testov busybox
42186e3bdf2dcdbba467a0fc56e8acbb8656a7e001ed13fc4854641552df52d1
docker: Error response from daemon: container 42186e3bdf2dcdbba467a0fc56e8acbb8656a7e001ed13fc4854641552df52d1: endpoint join on GW Network failed: driver failed programming external connectivity on endpoint gateway_42186e3bdf2d (b935a883b3977e1ffef17c37e2f65e8ef453c5db2d5c725acec0913922991d72): Bind for 0.0.0.0:80 failed: port is already allocated.

The #1805 can not resolve this scene.

@louisburton
Copy link

Any chance of merging #1805 ? Would be great to address the original issue here, even if other problem scenarios remain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants