Skip to content
This repository has been archived by the owner on Jun 20, 2024. It is now read-only.

containers fail to start correctly when networking options are passed on start #1258

Closed
errordeveloper opened this issue Aug 4, 2015 · 20 comments
Assignees
Milestone

Comments

@errordeveloper
Copy link
Contributor

It turns out that the version 1.18 of Docker remote API allows HostConfig attribute to be set when starting a container and, moreover, Kubernetes actually happens to use it.

I have added a few of debug message to the proxy like this

diff --git a/proxy/create_container_interceptor.go b/proxy/create_container_interceptor.go
index e6942a5..e737532 100644
--- a/proxy/create_container_interceptor.go
+++ b/proxy/create_container_interceptor.go
@@ -48,6 +48,9 @@ func (i *createContainerInterceptor) InterceptRequest(r *http.Request) error {
                return err
        }

+       Log.Debugf("%#v", container)
+       Log.Debugf("%#v", container.HostConfig)
+
        if cidrs, err := i.proxy.weaveCIDRsFromConfig(container.Config, container.HostConfig); err != nil {
                Log.Infof("Ignoring container due to %s", err)
        } else {
diff --git a/proxy/start_container_interceptor.go b/proxy/start_container_interceptor.go
index 7353c07..f3b8e9c 100644
--- a/proxy/start_container_interceptor.go
+++ b/proxy/start_container_interceptor.go
@@ -21,6 +21,9 @@ func (i *startContainerInterceptor) InterceptResponse(r *http.Response) error {
                return err
        }

+       Log.Debugf("%#v", container)
+       Log.Debugf("%#v", container.HostConfig)
+
        cidrs, err := i.proxy.weaveCIDRsFromConfig(container.Config, container.HostConfig)
        if err != nil {
                Log.Infof("Ignoring container %s due to %s", container.ID, err)

and below is what I got.

INFO: 2015/08/04 14:28:53.636380 POST /containers/create?name=k8s_master.1681ebfb_redis-master-djojn_default_148c42d8-3ab5-11e5-8b8d-080027308234_c5875ebf
DEBU: 2015/08/04 14:28:53.636577 proxy.createContainerRequestBody{Config:(*docker.Config)(0xc209112d80), HostConfig:(*docker.HostConfig)(nil), MacAddress:""}
DEBU: 2015/08/04 14:28:53.636592 (*docker.HostConfig)(nil)
INFO: 2015/08/04 14:28:53.636598 Creating container with WEAVE_CIDR ""

we just create a container is expecting it to be wanting attached to weave

INFO: 2015/08/04 14:28:53.696899 GET /containers/json
DEBU: 2015/08/04 14:28:53.731191 Response from target: 200 OK map[Content-Length:[1183] Content-Type:[application/json] Date:[Tue, 04 Aug 2015 14:28:53 GMT]]
DEBU: 2015/08/04 14:28:53.745209 Response from target: 201 Created map[Date:[Tue, 04 Aug 2015 14:28:53 GMT] Content-Length:[90] Content-Type:[application/json]]
INFO: 2015/08/04 14:28:53.763274 POST /containers/33a47c80f51a37f574fc11921b7285839fd003380b01466fcfd24a25c17d37bb/start

we are now about to start it

INFO: 2015/08/04 14:28:53.833493 GET /containers/json
DEBU: 2015/08/04 14:28:53.843715 Response from target: 200 OK map[Content-Type:[application/json] Date:[Tue, 04 Aug 2015 14:28:53 GMT] Content-Length:[1562]]
DEBU: 2015/08/04 14:28:53.845850 &docker.Container{ID:"33a47c80f51a37f574fc11921b7285839fd003380b01466fcfd24a25c17d37bb", Created:time.Time{sec:63574295333, nsec:711737800, loc:(*time.Location)(0xf82480)}, Path:"/w/w", Args:[]string{"/entrypoint.sh", "redis-server"}, Config:(*docker.Config)(0xc209113200), State:docker.State{Running:true, Paused:false, Restarting:false, OOMKilled:false, Pid:5560, ExitCode:0, Error:"", StartedAt:time.Time{sec:63574295333, nsec:840536648, loc:(*time.Location)(0xf82480)}, FinishedAt:time.Time{sec:0, nsec:0, loc:(*time.Location)(0xf82480)}}, Image:"0ff407d5a7d9ed36acdf3e75de8cc127afecc9af234d05486be2981cdc01a38c", Node:(*docker.SwarmNode)(nil), NetworkSettings:(*docker.NetworkSettings)(0xc209645620), SysInitPath:"", ResolvConfPath:"/var/lib/docker/containers/d42839cdb477cec9decb0981cf972f40268b80ea3c0ca0a8ec5878baddef0cb6/resolv.conf", HostnamePath:"/var/lib/docker/containers/d42839cdb477cec9decb0981cf972f40268b80ea3c0ca0a8ec5878baddef0cb6/hostname", HostsPath:"/var/lib/docker/containers/d42839cdb477cec9decb0981cf972f40268b80ea3c0ca0a8ec5878baddef0cb6/hosts", Name:"/k8s_master.1681ebfb_redis-master-djojn_default_148c42d8-3ab5-11e5-8b8d-080027308234_c5875ebf", Driver:"overlay", Volumes:map[string]string{"/data":"/var/lib/docker/vfs/dir/2137d563501b2eef62c7ce701719056a99c9337d9e26b7c0ca30420192b243e4", "/dev/termination-log":"/var/lib/kubelet/pods/148c42d8-3ab5-11e5-8b8d-080027308234/containers/master/33a47c80f51a37f574fc11921b7285839fd003380b01466fcfd24a25c17d37bb", "/hostproc":"/proc", "/var/run":"/run", "/w":"/var/lib/docker/vfs/dir/4c11cc5c5962cfa4b67e8ea74c0efa542ef05e686f98cc6b8ea756be191ad9a7"}, VolumesRW:map[string]bool{"/data":true, "/dev/termination-log":true, "/hostproc":false, "/var/run":false, "/w":false}, HostConfig:(*docker.HostConfig)(0xc209134b40), ExecIDs:[]string(nil), RestartCount:0, AppArmorProfile:""}
DEBU: 2015/08/04 14:28:53.845965 &docker.HostConfig{Binds:[]string{"/var/lib/kubelet/pods/148c42d8-3ab5-11e5-8b8d-080027308234/containers/master/33a47c80f51a37f574fc11921b7285839fd003380b01466fcfd24a25c17d37bb:/dev/termination-log"}, CapAdd:[]string(nil), CapDrop:[]string(nil), ContainerIDFile:"", LxcConf:[]docker.KeyValuePair(nil), Privileged:false, PortBindings:map[docker.Port][]docker.PortBinding(nil), Links:[]string(nil), PublishAllPorts:false, DNS:[]string{"10.1.0.3", "10.0.2.3"}, DNSSearch:[]string{"default.svc.kube.local", "svc.kube.local", "kube.local"}, ExtraHosts:[]string(nil), VolumesFrom:[]string(nil), NetworkMode:"container:d42839cdb477cec9decb0981cf972f40268b80ea3c0ca0a8ec5878baddef0cb6", IpcMode:"container:d42839cdb477cec9decb0981cf972f40268b80ea3c0ca0a8ec5878baddef0cb6", PidMode:"", UTSMode:"", RestartPolicy:docker.RestartPolicy{Name:"", MaximumRetryCount:0}, Devices:[]docker.Device(nil), LogConfig:docker.LogConfig{Type:"json-file", Config:map[string]string(nil)}, ReadonlyRootfs:false, SecurityOpt:[]string(nil), CgroupParent:"", Memory:0, MemorySwap:-1, CPUShares:2, CPUSet:"", CPUQuota:0, CPUPeriod:0, Ulimits:[]docker.ULimit(nil)}

so now we are actually getting the HostConfig.NetworkMode passed in, hence the container is actually being ignored. However, /w/w is already there and waits for SIGUSR2...

INFO: 2015/08/04 14:28:53.845986 Ignoring container 33a47c80f51a37f574fc11921b7285839fd003380b01466fcfd24a25c17d37bb due to --net option
DEBU: 2015/08/04 14:28:53.845997 Response from target: 204 No Content map[Date:[Tue, 04 Aug 2015 14:28:53 GMT]]
INFO: 2015/08/04 14:28:53.855651 GET /containers/33a47c80f51a37f574fc11921b7285839fd003380b01466fcfd24a25c17d37bb/json
DEBU: 2015/08/04 14:28:53.857430 Calling weave[--local ps 33a47c80f51a37f574fc11921b7285839fd003380b01466fcfd24a25c17d37bb]
DEBU: 2015/08/04 14:28:53.916636 Response from target: 200 OK map[Date:[Tue, 04 Aug 2015 14:28:53 GMT] Content-Type:[application/json]]

The entry point was meant to be /entrypoint.sh redis-server, which is expect to turn into redis-server process, instead this is what appears in the process table

core@core-03 ~ $ ps -ef | grep redis-server
root      5560  1038  0 14:28 ?        00:00:00 /w/w /entrypoint.sh redis-server
core     19344  6012  0 14:36 pts/0    00:00:00 grep --colour=auto redis-server
@errordeveloper
Copy link
Contributor Author

The issue can be reproduced via the API, not the CLI. It's not Kubernetes specific as such, IMO.

@rade
Copy link
Member

rade commented Aug 4, 2015

Apparently, POST /containers/(id)/start "For backwards compatibility, this endpoint accepts a HostConfig as JSON-encoded request body. See create a container for details."

If so, the proxy will need to cope with everything in HostConfig being specified in start. In particular volume binds can be specified that way - though perhaps they are additive, rather than replacing what's there?

@errordeveloper
Copy link
Contributor Author

Actually, it's pretty clear from reading dockertools.runContainer(). Namely, it sets-up HostConfig between the CreateContainer() and StartContainer() calls.

@rade rade added the bug label Aug 4, 2015
@rade rade added this to the 1.0.2 milestone Aug 4, 2015
@rade rade changed the title docker allows networking options to be passed on start containers fail to start correctly when networking options are passed on start Aug 4, 2015
@errordeveloper
Copy link
Contributor Author

@errordeveloper
Copy link
Contributor Author

I have traced docker and it's children, here is the output filtered for one of the hanging /w/w processes.

@errordeveloper
Copy link
Contributor Author

I have looked at the docs, one should set channel size to 1:

sig := make(chan os.Signal, 1)

but weavewait currently does usr2 := make(chan os.Signal)

We do have this in other places already as well

common/signals.go:      sigs := make(chan os.Signal, 1)
prog/sigproxy/main.go:  sc := make(chan os.Signal, 1)

@rade
Copy link
Member

rade commented Aug 12, 2015

According to the signal.Notify docs,

Package signal will not block sending to c: the caller must ensure that c has sufficient buffer space to keep up with the expected signal rate. For a channel used for notification of just one signal value, a buffer of size 1 is sufficient.

I am baffled that it says 1 and not 0. As I suspected, the signal delivery is done with a non-blocking send. But that should still succeed provided the receiver is ready. Which should be the case, unless the signal arrives between calling signal.Notify and the <- receive - the window for which ought to be tiny in this case.

I am worried about another race though... what if the proxy manages to send the signal to the weavewait process before that has invoked signal.Notify? I suspect it will get stuck. Should be easy check by inserting a sleep prior to signal.Notify.

Even then though, I would have thought that it is extremely unlikely to hit that case. weavewait should reach the signal.Notify way faster than Docker sending out the start reply, the proxy receiving it, invoking weave attach, and then docker kill.

@errordeveloper
Copy link
Contributor Author

I can see that initial test was a false-positive, I'm afraid.

Sounds like proxy sending the signal too soon is most likely. Looking at the gist I can see two rt_sigaction(SIGUSR2, ...) calls and SIGUSR2 is being received in between, however I'm not sure how to interpret it exactly.

@rade
Copy link
Member

rade commented Aug 12, 2015

would be good to get a trace with timings.

@rade
Copy link
Member

rade commented Aug 12, 2015

certainly looks like the SIGUSR2 arrives very shortly after the /w/w exec. This is mysterious. Can scheduling really be so extreme to produce such outcomes regularly?

@rade
Copy link
Member

rade commented Aug 12, 2015

does this happen under load, e.g. are there perhaps lots of containers starting simultaneously?

@bboreham
Copy link
Contributor

I suspect #1210 would improve this, as it doesn't rely on intercepting the start request to attach the network.

@errordeveloper
Copy link
Contributor Author

does this happen under load, e.g. are there perhaps lots of containers starting simultaneously?

Yes, it occurs when ~10 containers are being started on a potentially overcommitted VM...

@rade
Copy link
Member

rade commented Aug 13, 2015

Ok, so perhaps the race I described in #1258 (comment) can indeed occur then.

That's a tricky one to fix. The following sequence would improve matters:

  1. set up signal notification
  2. check whether ethwe is there & up
  3. if not then wait for signal

The problem then is that SIGUSR2 might get delivered to the application process.

Perhaps we should go back to just checking for "ethwe up" in a loop. Or find a way to be notified when the interface comes up. netlink. sigh. That would also address #1209.

@rade
Copy link
Member

rade commented Aug 14, 2015

The proposed fix to #1300 - going back waiting for the interface to come up, instead of waiting for a signal - will also fix this issue's original problem of containers getting started with different network settings than they were created with, provided that the new network settings are something like "container:otherContainer", where otherContainer has an ethwe interface. That's what I'd expect kubernetes to do, but my reading of their code does not reflect that - afaict the container is started with a network mode of "" or "host". I suspect I've missed something.

@rade
Copy link
Member

rade commented Aug 14, 2015

"container:otherContainer" ... that's what I'd expect kubernetes to do

found it

@rade rade modified the milestones: current, 1.0.2 Aug 14, 2015
@paulbellamy
Copy link
Contributor

Based on this, I don't immediately see why they couldn't pass those options to the container create instead of the start.

@rade
Copy link
Member

rade commented Aug 14, 2015

I don't immediately see why they couldn't pass those options to the container create instead of the start.

correct.

@bboreham
Copy link
Contributor

Github shows a comment "@rade modified the milestone: current, 1.0.2 3 days ago"; but Github currently shows this issue as only being in 'current'. Should #1210 be targeting 1.0.2 in order to fix this?

@rade
Copy link
Member

rade commented Aug 17, 2015

It's too intrusive a change for 1.0.2.

@bboreham bboreham self-assigned this Aug 24, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants