Containers drop off bridge networks unexpectedly #258

klutchell · 2021-07-07T14:21:31Z

Description

In support we have seen some recent cases where containers are removed from the bridge network unexpectedly.

Steps to reproduce the issue:

TBD

Describe the results you received:

balena inspect ${CONTAINER_ID} will still show up on the proper network.
balena network inspect ${CONTAINER_ID} the network does not include the container

Describe the results you expected:

Additional information you deem important (e.g. issue happens only occasionally):

Issue happens frequently since upgrading from v2.58.4 but can be resolved by restarting the container.

Output of balena-engine version:

Client:
 Version:           19.03.13-dev
 API version:       1.40
 Go version:        go1.12.17
 Git commit:        074a481789174b4b6fd2d706086e8ffceb72e924
 Built:             Mon Feb  1 20:12:05 2021
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          19.03.13-dev
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.17
  Git commit:       074a481789174b4b6fd2d706086e8ffceb72e924
  Built:            Mon Feb  1 20:12:05 2021
  OS/Arch:          linux/amd64
  Experimental:     true
 containerd:
  Version:          1.2.0+unknown
  GitCommit:        
 runc:
  Version:          
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 balena-engine-init:
  Version:          0.13.0
  GitCommit:        949e6fa-dirty

Output of balena-engine info:

Client:
 Debug Mode: false

Server:
 Containers: 12
  Running: 12
  Paused: 0
  Stopped: 0
 Images: 15
 Server Version: 19.03.13-dev
 Storage Driver: aufs
  Root Dir: /var/lib/docker/aufs
  Backing Filesystem: extfs
  Dirs: 331
  Dirperm1 Supported: true
 Logging Driver: journald
 Cgroup Driver: systemd
 Plugins:
  Volume: local
  Network: bridge host null
  Log: journald json-file local
 Swarm: 
  NodeID: 
  Is Manager: false
  Node Address: 
 Runtimes: bare runc
 Default Runtime: runc
 Init Binary: balena-engine-init
 containerd version: 
 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
 init version: 949e6fa-dirty (expected: fec3683b971d9)
 Kernel Version: 5.8.18-yocto-standard
 Operating System: balenaOS 2.68.1+rev1
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 7.691GiB
 Name: 40623112549f.videolink.io
 ID: ODX3:BQOU:LFIR:MXUE:WZX4:L37E:BY42:VT3K:6AZL:FDZI:RDVT:UQ2B
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: true
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Additional environment details (device type, OS, etc.):

ID="balena-os"
NAME="balenaOS"
VERSION="2.68.1+rev1"
VERSION_ID="2.68.1+rev1"
PRETTY_NAME="balenaOS 2.68.1+rev1"
MACHINE="genericx86-64"
VARIANT="Production"
VARIANT_ID=prod
META_BALENA_VERSION="2.68.1"
RESIN_BOARD_REV="cd52766"
META_RESIN_REV="e658a4e"
SLUG="intel-nuc"

The text was updated successfully, but these errors were encountered:

jellyfish-bot · 2021-07-07T14:22:14Z

[klutchell] This issue has attached support thread https://jel.ly.fish/a8eee5b3-4cdd-47fd-aaf2-90443a47f2ab

jellyfish-bot · 2021-07-07T14:26:44Z

[klutchell] This issue has attached support thread https://jel.ly.fish/d530e774-0d99-4a90-9c6a-d8646495246b

20k-ultra · 2021-07-22T16:36:40Z

I know this is the engine repo and not the Supervisor but on devices running balenaOS, the Supervisor manages the containers on the engine. It can be possible that the Supervisor is responsible for removing the network from the containers which seems unlikely because it would only do that if the target state has changed. I just wanted to add we can confirm it's not the Supervisor because the Supervisor would delete the existing container and create a new one. Therefore, if we can reproduce this issue then deploy your containers and note the created_at field for the container with access to the network. Perform the steps to get the network to be removed and verify the container no longer is on the network. Once there check if the created_at has changed.

The supervisor logs would also indicate that it's going to recreate the container. This is the only way it would remove the network from a container is be recreating it.

robertgzr · 2021-10-18T13:48:58Z

This could be happening when a privileged / network: host container with NetworkManager inside is not configured to ignore veth* interfaces on the host.
Those map to containers which are connected to bridge interfaces which then form the "docker network".

A log like this then means that NM removed the container from the network, leading to this scenario:

NetworkManager[1016]: <info>  [1626122797.1098] device (vethe79bead): released from master device br-f43124f982ae

balenaOS' own NetworkManger is configured to ignore those veth interfaces.

jellyfish-bot · 2021-11-19T21:12:19Z

[pipex] This issue has attached support thread https://jel.ly.fish/950fec98-fb0b-440a-9c56-8e034833c7c0

pipex · 2021-11-19T21:18:00Z

This could be related to #261. I am looking at a device that is refusing to update with error HTTP code 404) -- no such container: sandbox when renaming a container and the affected container is also missing from the default network.

balena-engine version

Client:
 Version:           19.03.13-dev
 API version:       1.40
 Go version:        go1.12.17
 Git commit:        074a481789174b4b6fd2d706086e8ffceb72e924
 Built:             Wed Mar 17 07:23:03 2021
 OS/Arch:           linux/arm64
 Experimental:      false

Server:
 Engine:
  Version:          19.03.13-dev
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.17
  Git commit:       074a481789174b4b6fd2d706086e8ffceb72e924
  Built:            Wed Mar 17 07:23:03 2021
  OS/Arch:          linux/arm64
  Experimental:     true
 containerd:
  Version:          1.2.0+unknown
  GitCommit:        
 runc:
  Version:          
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 balena-engine-init:
  Version:          0.13.0
  GitCommit:        949e6fa-dirty

balena-engine info

Client:
 Debug Mode: false

Server:
 Containers: 5
  Running: 3
  Paused: 0
  Stopped: 2
 Images: 37
 Server Version: 19.03.13-dev
 Storage Driver: overlay2
  Backing Filesystem: <unknown>
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: journald
 Cgroup Driver: systemd
 Plugins:
  Volume: local
  Network: bridge host null
  Log: journald json-file local
 Swarm: 
  NodeID: 
  Is Manager: false
  Node Address: 
 Runtimes: bare runc
 Default Runtime: runc
 Init Binary: balena-engine-init
 containerd version: 
 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
 init version: 949e6fa-dirty (expected: fec3683b971d9)
 Kernel Version: 5.4.83-v8
 Operating System: balenaOS 2.73.1+rev1
 OSType: linux
 Architecture: aarch64
 CPUs: 4
 Total Memory: 962.2MiB
 Name: f9a83fc
 ID: EDQD:QP5A:T6RN:3W7Z:A7UF:GKSK:UTAM:ESZH:PSR6:XSIW:FFJD:QPXH
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: true
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: No swap limit support

This happened to them on at least 10% of their fleet of 10000 devices

shawaj · 2021-11-22T05:06:52Z

@pipex just FYI this seems to still be happening with every update to a number of devices.

Pushing a random fleet variable such as FOO=BAR to all services seems to allow them to get past this issue but that's very annoying to have to do manually each time

deanMike · 2021-12-01T14:55:03Z

I'm also seeing this behavior randomly on about 10% of our devices in Balena. We also have a privileged container in host networking mode used for configuring our devices networking over bluetooth. This container also has the balena socket exposed and when this occurs the socket is no longer accessible from that container. Another container that has the docker/balena socket exposed as well (also privileged but not in host networking mode) can still access the socket

jellyfish-bot · 2021-12-23T00:42:16Z

[cywang117] This issue has attached support thread https://jel.ly.fish/d9da1684-f2d8-4929-934a-7f738ee4a0da

jellyfish-bot · 2022-01-10T12:39:13Z

[pdcastro] This issue has attached support thread https://jel.ly.fish/41b56e32-5fae-4a2e-b5bb-05f9f5af1f0f

jellyfish-bot · 2022-01-19T18:54:46Z

[zwhitchcox] This issue has attached support thread https://jel.ly.fish/d215c693-4477-4359-b06e-e158be58e837

jellyfish-bot · 2022-02-16T12:43:03Z

[lmbarros] This issue has attached support thread https://jel.ly.fish/3386a82e-c9a9-4a03-8774-b0e617761a22

jellyfish-bot · 2022-03-08T19:58:36Z

[cywang117] This issue has attached support thread https://jel.ly.fish/6e8b31bc-cd9a-4d50-8e36-19ef200ded77

cywang117 · 2022-03-08T20:03:46Z

In this support ticket above, I observed that during device startup, the engine restarts due to start operation timed out. Terminating. All other containers are recreated with this engine restart except one. This problematic container survives the restart, and following restart, displays a conflicting IPAddress with another container when running balena inspect.

As a result, when inspecting the network, the container is not on it, but when inspecting the container, it shows as being on the network, which is consistent with the observations originally made in this GitHub issue.

Anyone seeing this behavior in the future, please check for conflicting IPAddresses when running balena inspect, and verify here if a container that's not on the bridge network has a conflicting IPAddress with another container on the bridge.

lmbarros · 2022-03-29T15:46:39Z

I saw another case of Engine timing out during startup. Increasing the startup timeout worked around the issue:

Remount the root file system as read-write: mount -o remount,rw /
Edit balenaEngine override settings file: systemctl edit balena.service
Add the following to it:

[Service]
TimeoutStartSec=15min

I am planning to investigate further over the next days.

lmbarros · 2022-04-01T21:58:56Z

I reviewed the 7 support tickets we have attached to this issue. We don't usually have all the data to check if they were cases of Engine startup timeouts, but for 4 tickets (from two different users) there is strong evidence this was indeed the case. Two other tickets (from two other users) were more difficult to analyze and probably involved more than one single issue -- but we have noticed unexpected behavior after reboot (in of them, it caught my attention the application was using 11 containers, which could translate to a higher Engine startup time).

Still, very importantly, in one ticket we have some good evidence that a container got dropped off the network without a reboot or Engine restart (the Engine had a 49-day uptime on this case).

So, startup timeouts seem to be a common cause of this issue, but not the only one.

lmbarros · 2022-05-18T12:26:58Z

balenaOS v2.98.4 and later shall help with the cases in which this issue is triggered by Engine startup timeouts (see balena-os/meta-balena#2584). We have good evidence that there are still other possible ways to trigger this error, so we'll keep investigating.

lmbarros mentioned this issue Apr 14, 2022

Disable Engine startup timeouts balena-os/meta-balena#2584

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Containers drop off bridge networks unexpectedly #258

Containers drop off bridge networks unexpectedly #258

klutchell commented Jul 7, 2021 •

edited

Loading

jellyfish-bot commented Jul 7, 2021

jellyfish-bot commented Jul 7, 2021

20k-ultra commented Jul 22, 2021

robertgzr commented Oct 18, 2021

jellyfish-bot commented Nov 19, 2021

pipex commented Nov 19, 2021

shawaj commented Nov 22, 2021

deanMike commented Dec 1, 2021 •

edited

Loading

jellyfish-bot commented Dec 23, 2021

jellyfish-bot commented Jan 10, 2022

jellyfish-bot commented Jan 19, 2022

jellyfish-bot commented Feb 16, 2022

jellyfish-bot commented Mar 8, 2022

cywang117 commented Mar 8, 2022

lmbarros commented Mar 29, 2022

lmbarros commented Apr 1, 2022

lmbarros commented May 18, 2022

Containers drop off bridge networks unexpectedly #258

Containers drop off bridge networks unexpectedly #258

Comments

klutchell commented Jul 7, 2021 • edited Loading

jellyfish-bot commented Jul 7, 2021

jellyfish-bot commented Jul 7, 2021

20k-ultra commented Jul 22, 2021

robertgzr commented Oct 18, 2021

jellyfish-bot commented Nov 19, 2021

pipex commented Nov 19, 2021

shawaj commented Nov 22, 2021

deanMike commented Dec 1, 2021 • edited Loading

jellyfish-bot commented Dec 23, 2021

jellyfish-bot commented Jan 10, 2022

jellyfish-bot commented Jan 19, 2022

jellyfish-bot commented Feb 16, 2022

jellyfish-bot commented Mar 8, 2022

cywang117 commented Mar 8, 2022

lmbarros commented Mar 29, 2022

lmbarros commented Apr 1, 2022

lmbarros commented May 18, 2022

klutchell commented Jul 7, 2021 •

edited

Loading

deanMike commented Dec 1, 2021 •

edited

Loading