containers fail to start correctly when networking options are passed on start #1258

errordeveloper · 2015-08-04T14:34:30Z

It turns out that the version 1.18 of Docker remote API allows HostConfig attribute to be set when starting a container and, moreover, Kubernetes actually happens to use it.

I have added a few of debug message to the proxy like this

diff --git a/proxy/create_container_interceptor.go b/proxy/create_container_interceptor.go
index e6942a5..e737532 100644
--- a/proxy/create_container_interceptor.go
+++ b/proxy/create_container_interceptor.go
@@ -48,6 +48,9 @@ func (i *createContainerInterceptor) InterceptRequest(r *http.Request) error {
                return err
        }

+       Log.Debugf("%#v", container)
+       Log.Debugf("%#v", container.HostConfig)
+
        if cidrs, err := i.proxy.weaveCIDRsFromConfig(container.Config, container.HostConfig); err != nil {
                Log.Infof("Ignoring container due to %s", err)
        } else {
diff --git a/proxy/start_container_interceptor.go b/proxy/start_container_interceptor.go
index 7353c07..f3b8e9c 100644
--- a/proxy/start_container_interceptor.go
+++ b/proxy/start_container_interceptor.go
@@ -21,6 +21,9 @@ func (i *startContainerInterceptor) InterceptResponse(r *http.Response) error {
                return err
        }

+       Log.Debugf("%#v", container)
+       Log.Debugf("%#v", container.HostConfig)
+
        cidrs, err := i.proxy.weaveCIDRsFromConfig(container.Config, container.HostConfig)
        if err != nil {
                Log.Infof("Ignoring container %s due to %s", container.ID, err)

and below is what I got.

INFO: 2015/08/04 14:28:53.636380 POST /containers/create?name=k8s_master.1681ebfb_redis-master-djojn_default_148c42d8-3ab5-11e5-8b8d-080027308234_c5875ebf
DEBU: 2015/08/04 14:28:53.636577 proxy.createContainerRequestBody{Config:(*docker.Config)(0xc209112d80), HostConfig:(*docker.HostConfig)(nil), MacAddress:""}
DEBU: 2015/08/04 14:28:53.636592 (*docker.HostConfig)(nil)
INFO: 2015/08/04 14:28:53.636598 Creating container with WEAVE_CIDR ""

we just create a container is expecting it to be wanting attached to weave

INFO: 2015/08/04 14:28:53.696899 GET /containers/json
DEBU: 2015/08/04 14:28:53.731191 Response from target: 200 OK map[Content-Length:[1183] Content-Type:[application/json] Date:[Tue, 04 Aug 2015 14:28:53 GMT]]
DEBU: 2015/08/04 14:28:53.745209 Response from target: 201 Created map[Date:[Tue, 04 Aug 2015 14:28:53 GMT] Content-Length:[90] Content-Type:[application/json]]
INFO: 2015/08/04 14:28:53.763274 POST /containers/33a47c80f51a37f574fc11921b7285839fd003380b01466fcfd24a25c17d37bb/start

we are now about to start it

INFO: 2015/08/04 14:28:53.833493 GET /containers/json
DEBU: 2015/08/04 14:28:53.843715 Response from target: 200 OK map[Content-Type:[application/json] Date:[Tue, 04 Aug 2015 14:28:53 GMT] Content-Length:[1562]]
DEBU: 2015/08/04 14:28:53.845850 &docker.Container{ID:"33a47c80f51a37f574fc11921b7285839fd003380b01466fcfd24a25c17d37bb", Created:time.Time{sec:63574295333, nsec:711737800, loc:(*time.Location)(0xf82480)}, Path:"/w/w", Args:[]string{"/entrypoint.sh", "redis-server"}, Config:(*docker.Config)(0xc209113200), State:docker.State{Running:true, Paused:false, Restarting:false, OOMKilled:false, Pid:5560, ExitCode:0, Error:"", StartedAt:time.Time{sec:63574295333, nsec:840536648, loc:(*time.Location)(0xf82480)}, FinishedAt:time.Time{sec:0, nsec:0, loc:(*time.Location)(0xf82480)}}, Image:"0ff407d5a7d9ed36acdf3e75de8cc127afecc9af234d05486be2981cdc01a38c", Node:(*docker.SwarmNode)(nil), NetworkSettings:(*docker.NetworkSettings)(0xc209645620), SysInitPath:"", ResolvConfPath:"/var/lib/docker/containers/d42839cdb477cec9decb0981cf972f40268b80ea3c0ca0a8ec5878baddef0cb6/resolv.conf", HostnamePath:"/var/lib/docker/containers/d42839cdb477cec9decb0981cf972f40268b80ea3c0ca0a8ec5878baddef0cb6/hostname", HostsPath:"/var/lib/docker/containers/d42839cdb477cec9decb0981cf972f40268b80ea3c0ca0a8ec5878baddef0cb6/hosts", Name:"/k8s_master.1681ebfb_redis-master-djojn_default_148c42d8-3ab5-11e5-8b8d-080027308234_c5875ebf", Driver:"overlay", Volumes:map[string]string{"/data":"/var/lib/docker/vfs/dir/2137d563501b2eef62c7ce701719056a99c9337d9e26b7c0ca30420192b243e4", "/dev/termination-log":"/var/lib/kubelet/pods/148c42d8-3ab5-11e5-8b8d-080027308234/containers/master/33a47c80f51a37f574fc11921b7285839fd003380b01466fcfd24a25c17d37bb", "/hostproc":"/proc", "/var/run":"/run", "/w":"/var/lib/docker/vfs/dir/4c11cc5c5962cfa4b67e8ea74c0efa542ef05e686f98cc6b8ea756be191ad9a7"}, VolumesRW:map[string]bool{"/data":true, "/dev/termination-log":true, "/hostproc":false, "/var/run":false, "/w":false}, HostConfig:(*docker.HostConfig)(0xc209134b40), ExecIDs:[]string(nil), RestartCount:0, AppArmorProfile:""}
DEBU: 2015/08/04 14:28:53.845965 &docker.HostConfig{Binds:[]string{"/var/lib/kubelet/pods/148c42d8-3ab5-11e5-8b8d-080027308234/containers/master/33a47c80f51a37f574fc11921b7285839fd003380b01466fcfd24a25c17d37bb:/dev/termination-log"}, CapAdd:[]string(nil), CapDrop:[]string(nil), ContainerIDFile:"", LxcConf:[]docker.KeyValuePair(nil), Privileged:false, PortBindings:map[docker.Port][]docker.PortBinding(nil), Links:[]string(nil), PublishAllPorts:false, DNS:[]string{"10.1.0.3", "10.0.2.3"}, DNSSearch:[]string{"default.svc.kube.local", "svc.kube.local", "kube.local"}, ExtraHosts:[]string(nil), VolumesFrom:[]string(nil), NetworkMode:"container:d42839cdb477cec9decb0981cf972f40268b80ea3c0ca0a8ec5878baddef0cb6", IpcMode:"container:d42839cdb477cec9decb0981cf972f40268b80ea3c0ca0a8ec5878baddef0cb6", PidMode:"", UTSMode:"", RestartPolicy:docker.RestartPolicy{Name:"", MaximumRetryCount:0}, Devices:[]docker.Device(nil), LogConfig:docker.LogConfig{Type:"json-file", Config:map[string]string(nil)}, ReadonlyRootfs:false, SecurityOpt:[]string(nil), CgroupParent:"", Memory:0, MemorySwap:-1, CPUShares:2, CPUSet:"", CPUQuota:0, CPUPeriod:0, Ulimits:[]docker.ULimit(nil)}

so now we are actually getting the HostConfig.NetworkMode passed in, hence the container is actually being ignored. However, /w/w is already there and waits for SIGUSR2...

INFO: 2015/08/04 14:28:53.845986 Ignoring container 33a47c80f51a37f574fc11921b7285839fd003380b01466fcfd24a25c17d37bb due to --net option
DEBU: 2015/08/04 14:28:53.845997 Response from target: 204 No Content map[Date:[Tue, 04 Aug 2015 14:28:53 GMT]]
INFO: 2015/08/04 14:28:53.855651 GET /containers/33a47c80f51a37f574fc11921b7285839fd003380b01466fcfd24a25c17d37bb/json
DEBU: 2015/08/04 14:28:53.857430 Calling weave[--local ps 33a47c80f51a37f574fc11921b7285839fd003380b01466fcfd24a25c17d37bb]
DEBU: 2015/08/04 14:28:53.916636 Response from target: 200 OK map[Date:[Tue, 04 Aug 2015 14:28:53 GMT] Content-Type:[application/json]]

The entry point was meant to be /entrypoint.sh redis-server, which is expect to turn into redis-server process, instead this is what appears in the process table

core@core-03 ~ $ ps -ef | grep redis-server
root      5560  1038  0 14:28 ?        00:00:00 /w/w /entrypoint.sh redis-server
core     19344  6012  0 14:36 pts/0    00:00:00 grep --colour=auto redis-server

The text was updated successfully, but these errors were encountered:

errordeveloper · 2015-08-04T14:42:34Z

The issue can be reproduced via the API, not the CLI. It's not Kubernetes specific as such, IMO.

rade · 2015-08-04T14:53:55Z

Apparently, POST /containers/(id)/start "For backwards compatibility, this endpoint accepts a HostConfig as JSON-encoded request body. See create a container for details."

If so, the proxy will need to cope with everything in HostConfig being specified in start. In particular volume binds can be specified that way - though perhaps they are additive, rather than replacing what's there?

errordeveloper · 2015-08-04T14:54:05Z

Actually, it's pretty clear from reading dockertools.runContainer(). Namely, it sets-up HostConfig between the CreateContainer() and StartContainer() calls.

errordeveloper · 2015-08-05T09:55:45Z

kubernetes/kubernetes#12263

errordeveloper · 2015-08-12T13:54:28Z

I have traced docker and it's children, here is the output filtered for one of the hanging /w/w processes.

errordeveloper · 2015-08-12T16:46:23Z

I have looked at the docs, one should set channel size to 1:

sig := make(chan os.Signal, 1)

but weavewait currently does usr2 := make(chan os.Signal)

We do have this in other places already as well

common/signals.go:      sigs := make(chan os.Signal, 1)
prog/sigproxy/main.go:  sc := make(chan os.Signal, 1)

rade · 2015-08-12T18:46:48Z

According to the signal.Notify docs,

Package signal will not block sending to c: the caller must ensure that c has sufficient buffer space to keep up with the expected signal rate. For a channel used for notification of just one signal value, a buffer of size 1 is sufficient.

I am baffled that it says 1 and not 0. As I suspected, the signal delivery is done with a non-blocking send. But that should still succeed provided the receiver is ready. Which should be the case, unless the signal arrives between calling signal.Notify and the <- receive - the window for which ought to be tiny in this case.

I am worried about another race though... what if the proxy manages to send the signal to the weavewait process before that has invoked signal.Notify? I suspect it will get stuck. Should be easy check by inserting a sleep prior to signal.Notify.

Even then though, I would have thought that it is extremely unlikely to hit that case. weavewait should reach the signal.Notify way faster than Docker sending out the start reply, the proxy receiving it, invoking weave attach, and then docker kill.

errordeveloper · 2015-08-12T19:47:45Z

I can see that initial test was a false-positive, I'm afraid.

Sounds like proxy sending the signal too soon is most likely. Looking at the gist I can see two rt_sigaction(SIGUSR2, ...) calls and SIGUSR2 is being received in between, however I'm not sure how to interpret it exactly.

rade · 2015-08-12T20:07:06Z

would be good to get a trace with timings.

rade · 2015-08-12T20:10:18Z

certainly looks like the SIGUSR2 arrives very shortly after the /w/w exec. This is mysterious. Can scheduling really be so extreme to produce such outcomes regularly?

rade · 2015-08-12T20:11:28Z

does this happen under load, e.g. are there perhaps lots of containers starting simultaneously?

bboreham · 2015-08-13T09:37:33Z

I suspect #1210 would improve this, as it doesn't rely on intercepting the start request to attach the network.

errordeveloper · 2015-08-13T09:57:13Z

does this happen under load, e.g. are there perhaps lots of containers starting simultaneously?

Yes, it occurs when ~10 containers are being started on a potentially overcommitted VM...

rade · 2015-08-13T10:15:15Z

Ok, so perhaps the race I described in #1258 (comment) can indeed occur then.

That's a tricky one to fix. The following sequence would improve matters:

set up signal notification
check whether ethwe is there & up
if not then wait for signal

The problem then is that SIGUSR2 might get delivered to the application process.

Perhaps we should go back to just checking for "ethwe up" in a loop. Or find a way to be notified when the interface comes up. netlink. sigh. That would also address #1209.

rade · 2015-08-14T10:44:21Z

The proposed fix to #1300 - going back waiting for the interface to come up, instead of waiting for a signal - will also fix this issue's original problem of containers getting started with different network settings than they were created with, provided that the new network settings are something like "container:otherContainer", where otherContainer has an ethwe interface. That's what I'd expect kubernetes to do, but my reading of their code does not reflect that - afaict the container is started with a network mode of "" or "host". I suspect I've missed something.

rade · 2015-08-14T10:52:01Z

"container:otherContainer" ... that's what I'd expect kubernetes to do

found it

paulbellamy · 2015-08-14T12:07:18Z

Based on this, I don't immediately see why they couldn't pass those options to the container create instead of the start.

rade · 2015-08-14T12:13:05Z

I don't immediately see why they couldn't pass those options to the container create instead of the start.

correct.

bboreham · 2015-08-17T10:13:36Z

Github shows a comment "@rade modified the milestone: current, 1.0.2 3 days ago"; but Github currently shows this issue as only being in 'current'. Should #1210 be targeting 1.0.2 in order to fix this?

rade · 2015-08-17T10:15:48Z

It's too intrusive a change for 1.0.2.

Wait for ethwe via netlink, not via polling or signals Fixes #1209. Fixes #1258. Fixes #1300.

errordeveloper added the [component/proxy] label Aug 4, 2015

rade added the bug label Aug 4, 2015

rade added this to the 1.0.2 milestone Aug 4, 2015

rade changed the title ~~docker allows networking options to be passed on start~~ containers fail to start correctly when networking options are passed on start Aug 4, 2015

errordeveloper added a commit that referenced this issue Aug 12, 2015

Not setting channel size results in flaky behaviour (close #1258)

23cfb61

bboreham mentioned this issue Aug 14, 2015

Race between weavewait and proxy #1300

Closed

rade modified the milestones: current, 1.0.2 Aug 14, 2015

bboreham mentioned this issue Aug 17, 2015

make proxy attach network on docker start event #1210

Merged

bboreham self-assigned this Aug 24, 2015

This was referenced Aug 27, 2015

container start HostConfig breaks container configuration #1368

Closed

Wait for ethwe via netlink, not via polling or signals #1314

Merged

rade closed this as completed in #1314 Aug 28, 2015

rade added a commit that referenced this issue Aug 28, 2015

Merge pull request #1314 from weaveworks/1209-wait-iface-netlink

6291a89

Wait for ethwe via netlink, not via polling or signals Fixes #1209. Fixes #1258. Fixes #1300.

rade modified the milestones: current, 1.1.0 Aug 29, 2015

2opremio mentioned this issue Oct 28, 2015

Containers get stuck on weavewait when NetworkMode=host is provided at start #1600

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

containers fail to start correctly when networking options are passed on start #1258

containers fail to start correctly when networking options are passed on start #1258

errordeveloper commented Aug 4, 2015

errordeveloper commented Aug 4, 2015

rade commented Aug 4, 2015

errordeveloper commented Aug 4, 2015

errordeveloper commented Aug 5, 2015

errordeveloper commented Aug 12, 2015

errordeveloper commented Aug 12, 2015

rade commented Aug 12, 2015

errordeveloper commented Aug 12, 2015

rade commented Aug 12, 2015

rade commented Aug 12, 2015

rade commented Aug 12, 2015

bboreham commented Aug 13, 2015

errordeveloper commented Aug 13, 2015

rade commented Aug 13, 2015

rade commented Aug 14, 2015

rade commented Aug 14, 2015

paulbellamy commented Aug 14, 2015

rade commented Aug 14, 2015

bboreham commented Aug 17, 2015

rade commented Aug 17, 2015

containers fail to start correctly when networking options are passed on start #1258

containers fail to start correctly when networking options are passed on start #1258

Comments

errordeveloper commented Aug 4, 2015

errordeveloper commented Aug 4, 2015

rade commented Aug 4, 2015

errordeveloper commented Aug 4, 2015

errordeveloper commented Aug 5, 2015

errordeveloper commented Aug 12, 2015

errordeveloper commented Aug 12, 2015

rade commented Aug 12, 2015

errordeveloper commented Aug 12, 2015

rade commented Aug 12, 2015

rade commented Aug 12, 2015

rade commented Aug 12, 2015

bboreham commented Aug 13, 2015

errordeveloper commented Aug 13, 2015

rade commented Aug 13, 2015

rade commented Aug 14, 2015

rade commented Aug 14, 2015

paulbellamy commented Aug 14, 2015

rade commented Aug 14, 2015

bboreham commented Aug 17, 2015

rade commented Aug 17, 2015