Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Containers drop off bridge networks unexpectedly #258

Open
klutchell opened this issue Jul 7, 2021 · 17 comments
Open

Containers drop off bridge networks unexpectedly #258

klutchell opened this issue Jul 7, 2021 · 17 comments

Comments

@klutchell
Copy link
Contributor

klutchell commented Jul 7, 2021

Description

In support we have seen some recent cases where containers are removed from the bridge network unexpectedly.

Steps to reproduce the issue:

TBD

Describe the results you received:

  • balena inspect ${CONTAINER_ID} will still show up on the proper network.
  • balena network inspect ${CONTAINER_ID} the network does not include the container

Describe the results you expected:

Additional information you deem important (e.g. issue happens only occasionally):

Issue happens frequently since upgrading from v2.58.4 but can be resolved by restarting the container.

Output of balena-engine version:

Client:
 Version:           19.03.13-dev
 API version:       1.40
 Go version:        go1.12.17
 Git commit:        074a481789174b4b6fd2d706086e8ffceb72e924
 Built:             Mon Feb  1 20:12:05 2021
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          19.03.13-dev
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.17
  Git commit:       074a481789174b4b6fd2d706086e8ffceb72e924
  Built:            Mon Feb  1 20:12:05 2021
  OS/Arch:          linux/amd64
  Experimental:     true
 containerd:
  Version:          1.2.0+unknown
  GitCommit:        
 runc:
  Version:          
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 balena-engine-init:
  Version:          0.13.0
  GitCommit:        949e6fa-dirty

Output of balena-engine info:

Client:
 Debug Mode: false

Server:
 Containers: 12
  Running: 12
  Paused: 0
  Stopped: 0
 Images: 15
 Server Version: 19.03.13-dev
 Storage Driver: aufs
  Root Dir: /var/lib/docker/aufs
  Backing Filesystem: extfs
  Dirs: 331
  Dirperm1 Supported: true
 Logging Driver: journald
 Cgroup Driver: systemd
 Plugins:
  Volume: local
  Network: bridge host null
  Log: journald json-file local
 Swarm: 
  NodeID: 
  Is Manager: false
  Node Address: 
 Runtimes: bare runc
 Default Runtime: runc
 Init Binary: balena-engine-init
 containerd version: 
 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
 init version: 949e6fa-dirty (expected: fec3683b971d9)
 Kernel Version: 5.8.18-yocto-standard
 Operating System: balenaOS 2.68.1+rev1
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 7.691GiB
 Name: 40623112549f.videolink.io
 ID: ODX3:BQOU:LFIR:MXUE:WZX4:L37E:BY42:VT3K:6AZL:FDZI:RDVT:UQ2B
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: true
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Additional environment details (device type, OS, etc.):

ID="balena-os"
NAME="balenaOS"
VERSION="2.68.1+rev1"
VERSION_ID="2.68.1+rev1"
PRETTY_NAME="balenaOS 2.68.1+rev1"
MACHINE="genericx86-64"
VARIANT="Production"
VARIANT_ID=prod
META_BALENA_VERSION="2.68.1"
RESIN_BOARD_REV="cd52766"
META_RESIN_REV="e658a4e"
SLUG="intel-nuc"
@jellyfish-bot
Copy link

[klutchell] This issue has attached support thread https://jel.ly.fish/a8eee5b3-4cdd-47fd-aaf2-90443a47f2ab

@jellyfish-bot
Copy link

[klutchell] This issue has attached support thread https://jel.ly.fish/d530e774-0d99-4a90-9c6a-d8646495246b

@20k-ultra
Copy link
Contributor

I know this is the engine repo and not the Supervisor but on devices running balenaOS, the Supervisor manages the containers on the engine. It can be possible that the Supervisor is responsible for removing the network from the containers which seems unlikely because it would only do that if the target state has changed. I just wanted to add we can confirm it's not the Supervisor because the Supervisor would delete the existing container and create a new one. Therefore, if we can reproduce this issue then deploy your containers and note the created_at field for the container with access to the network. Perform the steps to get the network to be removed and verify the container no longer is on the network. Once there check if the created_at has changed.

The supervisor logs would also indicate that it's going to recreate the container. This is the only way it would remove the network from a container is be recreating it.

@robertgzr
Copy link
Contributor

This could be happening when a privileged / network: host container with NetworkManager inside is not configured to ignore veth* interfaces on the host.
Those map to containers which are connected to bridge interfaces which then form the "docker network".

A log like this then means that NM removed the container from the network, leading to this scenario:

NetworkManager[1016]: <info>  [1626122797.1098] device (vethe79bead): released from master device br-f43124f982ae

balenaOS' own NetworkManger is configured to ignore those veth interfaces.

@jellyfish-bot
Copy link

[pipex] This issue has attached support thread https://jel.ly.fish/950fec98-fb0b-440a-9c56-8e034833c7c0

@pipex
Copy link

pipex commented Nov 19, 2021

This could be related to #261. I am looking at a device that is refusing to update with error HTTP code 404) -- no such container: sandbox when renaming a container and the affected container is also missing from the default network.

balena-engine version

Client:
 Version:           19.03.13-dev
 API version:       1.40
 Go version:        go1.12.17
 Git commit:        074a481789174b4b6fd2d706086e8ffceb72e924
 Built:             Wed Mar 17 07:23:03 2021
 OS/Arch:           linux/arm64
 Experimental:      false

Server:
 Engine:
  Version:          19.03.13-dev
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.17
  Git commit:       074a481789174b4b6fd2d706086e8ffceb72e924
  Built:            Wed Mar 17 07:23:03 2021
  OS/Arch:          linux/arm64
  Experimental:     true
 containerd:
  Version:          1.2.0+unknown
  GitCommit:        
 runc:
  Version:          
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 balena-engine-init:
  Version:          0.13.0
  GitCommit:        949e6fa-dirty

balena-engine info

Client:
 Debug Mode: false

Server:
 Containers: 5
  Running: 3
  Paused: 0
  Stopped: 2
 Images: 37
 Server Version: 19.03.13-dev
 Storage Driver: overlay2
  Backing Filesystem: <unknown>
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: journald
 Cgroup Driver: systemd
 Plugins:
  Volume: local
  Network: bridge host null
  Log: journald json-file local
 Swarm: 
  NodeID: 
  Is Manager: false
  Node Address: 
 Runtimes: bare runc
 Default Runtime: runc
 Init Binary: balena-engine-init
 containerd version: 
 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
 init version: 949e6fa-dirty (expected: fec3683b971d9)
 Kernel Version: 5.4.83-v8
 Operating System: balenaOS 2.73.1+rev1
 OSType: linux
 Architecture: aarch64
 CPUs: 4
 Total Memory: 962.2MiB
 Name: f9a83fc
 ID: EDQD:QP5A:T6RN:3W7Z:A7UF:GKSK:UTAM:ESZH:PSR6:XSIW:FFJD:QPXH
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: true
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: No swap limit support

This happened to them on at least 10% of their fleet of 10000 devices

@shawaj
Copy link

shawaj commented Nov 22, 2021

@pipex just FYI this seems to still be happening with every update to a number of devices.

Pushing a random fleet variable such as FOO=BAR to all services seems to allow them to get past this issue but that's very annoying to have to do manually each time

@deanMike
Copy link

deanMike commented Dec 1, 2021

I'm also seeing this behavior randomly on about 10% of our devices in Balena. We also have a privileged container in host networking mode used for configuring our devices networking over bluetooth. This container also has the balena socket exposed and when this occurs the socket is no longer accessible from that container. Another container that has the docker/balena socket exposed as well (also privileged but not in host networking mode) can still access the socket

@jellyfish-bot
Copy link

[cywang117] This issue has attached support thread https://jel.ly.fish/d9da1684-f2d8-4929-934a-7f738ee4a0da

@jellyfish-bot
Copy link

[pdcastro] This issue has attached support thread https://jel.ly.fish/41b56e32-5fae-4a2e-b5bb-05f9f5af1f0f

@jellyfish-bot
Copy link

[zwhitchcox] This issue has attached support thread https://jel.ly.fish/d215c693-4477-4359-b06e-e158be58e837

@jellyfish-bot
Copy link

[lmbarros] This issue has attached support thread https://jel.ly.fish/3386a82e-c9a9-4a03-8774-b0e617761a22

@jellyfish-bot
Copy link

[cywang117] This issue has attached support thread https://jel.ly.fish/6e8b31bc-cd9a-4d50-8e36-19ef200ded77

@cywang117
Copy link

In this support ticket above, I observed that during device startup, the engine restarts due to start operation timed out. Terminating. All other containers are recreated with this engine restart except one. This problematic container survives the restart, and following restart, displays a conflicting IPAddress with another container when running balena inspect.

As a result, when inspecting the network, the container is not on it, but when inspecting the container, it shows as being on the network, which is consistent with the observations originally made in this GitHub issue.

Anyone seeing this behavior in the future, please check for conflicting IPAddresses when running balena inspect, and verify here if a container that's not on the bridge network has a conflicting IPAddress with another container on the bridge.

@lmbarros
Copy link
Contributor

I saw another case of Engine timing out during startup. Increasing the startup timeout worked around the issue:

  1. Remount the root file system as read-write: mount -o remount,rw /
  2. Edit balenaEngine override settings file: systemctl edit balena.service
  3. Add the following to it:
[Service]
TimeoutStartSec=15min

I am planning to investigate further over the next days.

@lmbarros
Copy link
Contributor

lmbarros commented Apr 1, 2022

I reviewed the 7 support tickets we have attached to this issue. We don't usually have all the data to check if they were cases of Engine startup timeouts, but for 4 tickets (from two different users) there is strong evidence this was indeed the case. Two other tickets (from two other users) were more difficult to analyze and probably involved more than one single issue -- but we have noticed unexpected behavior after reboot (in of them, it caught my attention the application was using 11 containers, which could translate to a higher Engine startup time).

Still, very importantly, in one ticket we have some good evidence that a container got dropped off the network without a reboot or Engine restart (the Engine had a 49-day uptime on this case).

So, startup timeouts seem to be a common cause of this issue, but not the only one.

@lmbarros
Copy link
Contributor

balenaOS v2.98.4 and later shall help with the cases in which this issue is triggered by Engine startup timeouts (see balena-os/meta-balena#2584). We have good evidence that there are still other possible ways to trigger this error, so we'll keep investigating.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants