Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EKS - Pods randomly losing network connectivity and a few other problems on AWS EKS #3446

Closed
jsalatiel opened this issue Mar 12, 2022 · 11 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/duplicate Indicates an issue is a duplicate of other open issue.

Comments

@jsalatiel
Copy link

Describe the bug
I am trying to use antrea to replace calico on aws for network policies only. In other words, I am trying to keep using aws cni for IPs and antrea for netpolicies only.
Unfortunately, after a while, any new scheduled pods will have no working network. The already running pods are just fine.
When that happens, the only thing that I can do to make new pods be able to work is create a new node group, drain and delete the old ones.
I tried to capture the logs when I start a pod and the pod does not get network connectivity. The log is full of:

E0310 20:50:50.380578       1 utils.go:164] Skipping invalid IP:
E0310 20:50:50.380583       1 utils.go:164] Skipping invalid IP:
E0310 20:50:50.380662       1 utils.go:164] Skipping invalid IP:
...

Trying to restart antrea-agent crash loops on:

kubectl  logs pod/antrea-agent-mnpct  -c antrea-ovs
[2022-03-10T21:05:07Z INFO antrea-ovs]: Starting ovsdb-server
 * Starting ovsdb-server
 * Configuring Open vSwitch system IDs
 * Enabling remote OVSDB managers
[2022-03-10T21:05:07Z INFO antrea-ovs]: Started ovsdb-server
[2022-03-10T21:05:07Z INFO antrea-ovs]: Starting ovs-vswitchd
[2022-03-10T21:05:07Z INFO antrea-ovs]: ovs-vswitchd set hw-offload to false
 * Starting ovs-vswitchd
 * Enabling remote OVSDB managers
[2022-03-10T21:05:07Z INFO antrea-ovs]: Started ovs-vswitchd
[2022-03-10T21:05:07Z INFO antrea-ovs]: Started the loop that checks OVS status every 30 seconds
hostname: Temporary failure in name resolution
----------
 kubectl  logs pod/antrea-agent-mnpct -c antrea-agent
I0310 21:07:59.914106       1 log_file.go:99] Set log file max size to 104857600
I0310 21:07:59.914661       1 agent.go:85] Starting Antrea agent (version v1.5.1)
I0310 21:07:59.914682       1 client.go:96] No kubeconfig file was specified. Falling back to in-cluster config
I0310 21:07:59.916057       1 client.go:96] No kubeconfig file was specified. Falling back to in-cluster config
I0310 21:07:59.916568       1 prometheus.go:171] Initializing prometheus metrics
I0310 21:07:59.916679       1 ovs_client.go:67] Connecting to OVSDB at address /var/run/openvswitch/db.sock
I0310 21:07:59.916945       1 agent.go:338] Setting up node network
F0311 21:07:59.926541       1 main.go:58] Error running agent: error initializing agent: failed to get local IPNet device with IP &{10.137.2.22 <nil>}: IPs of localIPs should be on the same device
goroutine 1 [running]:
k8s.io/klog/v2.stacks(0x1)
  /go/pkg/mod/k8s.io/klog/[email protected]/klog.go:1021 +0x8a
k8s.io/klog/v2.(*loggingT).output(0x37386a0, 0x3, {0x0, 0x0}, 0xc0003c8700, {0x2aa4c69, 0x1}, 0xc0007fad10, 0x0)
  /go/pkg/mod/k8s.io/klog/[email protected]/klog.go:970 +0x569
k8s.io/klog/v2.(*loggingT).printf(0xc000464100, 0x5c5da8, {0x0, 0x0}, {0x0, 0x0}, {0x21b49a9, 0x17}, {0xc0007fad10, 0x1, ...})
  /go/pkg/mod/k8s.io/klog/[email protected]/klog.go:751 +0x1d1
k8s.io/klog/v2.Fatalf(...)
  /go/pkg/mod/k8s.io/klog/[email protected]/klog.go:1509
main.newAgentCommand.func1(0xc000570dc0, {0xc0001b0900, 0x0, 0x8})
  /antrea/cmd/antrea-agent/main.go:58 +0x28e
github.com/spf13/cobra.(*Command).execute(0xc000570dc0, {0xc000130010, 0x8, 0x8})
  /go/pkg/mod/github.com/spf13/[email protected]/command.go:854 +0x5f8
github.com/spf13/cobra.(*Command).ExecuteC(0xc000570dc0)
  /go/pkg/mod/github.com/spf13/[email protected]/command.go:958 +0x3ad
github.com/spf13/cobra.(*Command).Execute(...)
  /go/pkg/mod/github.com/spf13/[email protected]/command.go:895
main.main()
  /antrea/cmd/antrea-agent/main.go:37 +0x4a

k8s.io/klog/v2.(*loggingT).flushDaemon(0x0)
  /go/pkg/mod/k8s.io/klog/[email protected]/klog.go:1164 +0x6a
created by k8s.io/klog/v2.init.0
  /go/pkg/mod/k8s.io/klog/[email protected]/klog.go:418 +0xfb

goroutine 115 [select]:
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000514fa0, {0x2488720, 0xc00033cf60}, 0x1, 0xc00010e300)
  /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:167 +0x13b
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x22655ee, 0x12a05f200, 0x0, 0xc5, 0x865da5)
  /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(...)
  /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:90
k8s.io/apimachinery/pkg/util/wait.Forever(0x865d26, 0xc0000fa180)
  /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:81 +0x28
created by k8s.io/component-base/logs.InitLogs
  /go/pkg/mod/k8s.io/[email protected]/logs/logs.go:58 +0x79

goroutine 91 [chan receive]:
k8s.io/client-go/util/workqueue.(*Type).updateUnfinishedWorkLoop(0xc00036ecc0)
  /go/pkg/mod/k8s.io/[email protected]/util/workqueue/queue.go:204 +0xa7
created by k8s.io/client-go/util/workqueue.newQueue
  /go/pkg/mod/k8s.io/[email protected]/util/workqueue/queue.go:62 +0x1af

goroutine 92 [select]:
k8s.io/client-go/util/workqueue.(*delayingType).waitingLoop(0xc00036ee40)
  /go/pkg/mod/k8s.io/[email protected]/util/workqueue/delaying_queue.go:231 +0x34e
created by k8s.io/client-go/util/workqueue.newDelayingQueue
  /go/pkg/mod/k8s.io/[email protected]/util/workqueue/delaying_queue.go:68 +0x23b

goroutine 140 [sync.Cond.Wait]:
sync.runtime_notifyListWait(0xc0005995d0, 0x1)
  /usr/local/go/src/runtime/sema.go:513 +0x13d
sync.(*Cond).Wait(0x0)
  /usr/local/go/src/sync/cond.go:56 +0x8c
github.com/TomCodeLV/OVSDB-golang-lib/pkg/ovsdb.(*Synchronize).WaitError(0xc000138370)
  /go/pkg/mod/github.com/!tom!code!l!v/[email protected]/pkg/ovsdb/client.go:119 +0x56
github.com/TomCodeLV/OVSDB-golang-lib/pkg/ovsdb.Dial.func1()
  /go/pkg/mod/github.com/!tom!code!l!v/[email protected]/pkg/ovsdb/client.go:245 +0xaa5
created by github.com/TomCodeLV/OVSDB-golang-lib/pkg/ovsdb.Dial
  /go/pkg/mod/github.com/!tom!code!l!v/[email protected]/pkg/ovsdb/client.go:167 +0x44f

goroutine 74 [IO wait]:
internal/poll.runtime_pollWait(0x7f26b81a0470, 0x72)
  /usr/local/go/src/runtime/netpoll.go:234 +0x89
internal/poll.(*pollDesc).wait(0xc000365700, 0xc0006b6c00, 0x0)
  /usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x32
internal/poll.(*pollDesc).waitRead(...)
  /usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc000365700, {0xc0006b6c00, 0x200, 0x200})
  /usr/local/go/src/internal/poll/fd_unix.go:167 +0x25a
net.(*netFD).Read(0xc000365700, {0xc0006b6c00, 0x7f26e0db3f18, 0x200})
  /usr/local/go/src/net/fd_posix.go:56 +0x29
net.(*conn).Read(0xc0000c4e78, {0xc0006b6c00, 0x1d8d7e0, 0xc00062fa01})
  /usr/local/go/src/net/net.go:183 +0x45
encoding/json.(*Decoder).refill(0xc0004d3cc0)
  /usr/local/go/src/encoding/json/stream.go:165 +0x17f
encoding/json.(*Decoder).readValue(0xc0004d3cc0)
  /usr/local/go/src/encoding/json/stream.go:140 +0xbb
encoding/json.(*Decoder).Decode(0xc0004d3cc0, {0x1cf9320, 0xc00062fa40})
  /usr/local/go/src/encoding/json/stream.go:63 +0x78
github.com/TomCodeLV/OVSDB-golang-lib/pkg/ovsdb.(*OVSDB).decodeWrapper(0xc0007e12c0, 0x0)
  /go/pkg/mod/github.com/!tom!code!l!v/[email protected]/pkg/ovsdb/client.go:318 +0x65
github.com/TomCodeLV/OVSDB-golang-lib/pkg/ovsdb.(*OVSDB).loop(0xc0007e12c0)
  /go/pkg/mod/github.com/!tom!code!l!v/[email protected]/pkg/ovsdb/client.go:335 +0x45
created by github.com/TomCodeLV/OVSDB-golang-lib/pkg/ovsdb.Dial.func1
  /go/pkg/mod/github.com/!tom!code!l!v/[email protected]/pkg/ovsdb/client.go:228 +0x9f4

goroutine 143 [chan receive]:
antrea.io/antrea/pkg/signals.RegisterSignalHandlers.func1()
  /antrea/pkg/signals/signals.go:38 +0x31
created by antrea.io/antrea/pkg/signals.RegisterSignalHandlers
  /antrea/pkg/signals/signals.go:37 +0x9d

goroutine 145 [syscall]:
os/signal.signal_recv()
  /usr/local/go/src/runtime/sigqueue.go:169 +0x98
os/signal.loop()
  /usr/local/go/src/os/signal/signal_unix.go:24 +0x19
created by os/signal.Notify.func1.1
  /usr/local/go/src/os/signal/signal.go:151 +0x2c

goroutine 72 [IO wait]:
internal/poll.runtime_pollWait(0x7f26b81a0388, 0x72)
  /usr/local/go/src/runtime/netpoll.go:234 +0x89
internal/poll.(*pollDesc).wait(0xc000622300, 0xc000848000, 0x0)
  /usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x32
internal/poll.(*pollDesc).waitRead(...)
  /usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc000622300, {0xc000848000, 0x235c, 0x235c})
  /usr/local/go/src/internal/poll/fd_unix.go:167 +0x25a
net.(*netFD).Read(0xc000622300, {0xc000848000, 0xc00084a18f, 0x1a})
  /usr/local/go/src/net/fd_posix.go:56 +0x29
net.(*conn).Read(0xc0000c44a8, {0xc000848000, 0x6dcf99, 0xc0005cb7f0})
  /usr/local/go/src/net/net.go:183 +0x45
crypto/tls.(*atLeastReader).Read(0xc00040a7c8, {0xc000848000, 0x0, 0x40b98d})
  /usr/local/go/src/crypto/tls/conn.go:777 +0x3d
bytes.(*Buffer).ReadFrom(0xc000634cf8, {0x2486560, 0xc00040a7c8})
  /usr/local/go/src/bytes/buffer.go:204 +0x98
crypto/tls.(*Conn).readFromUntil(0xc000634a80, {0x24892c0, 0xc0000c44a8}, 0x1d2)
  /usr/local/go/src/crypto/tls/conn.go:799 +0xe5
crypto/tls.(*Conn).readRecordOrCCS(0xc000634a80, 0x0)
  /usr/local/go/src/crypto/tls/conn.go:606 +0x112
crypto/tls.(*Conn).readRecord(...)
  /usr/local/go/src/crypto/tls/conn.go:574
crypto/tls.(*Conn).Read(0xc000634a80, {0xc0001a4000, 0x1000, 0xc0003fbd40})
  /usr/local/go/src/crypto/tls/conn.go:1277 +0x16f
bufio.(*Reader).Read(0xc0003fbce0, {0xc0005723b8, 0x9, 0xc000292340})
  /usr/local/go/src/bufio/bufio.go:227 +0x1b4
io.ReadAtLeast({0x24863c0, 0xc0003fbce0}, {0xc0005723b8, 0x9, 0x9}, 0x9)
  /usr/local/go/src/io/io.go:328 +0x9a
io.ReadFull(...)
  /usr/local/go/src/io/io.go:347
golang.org/x/net/http2.readFrameHeader({0xc0005723b8, 0x9, 0xc000580000}, {0x24863c0, 0xc0003fbce0})
  /go/pkg/mod/golang.org/x/[email protected]/http2/frame.go:237 +0x6e
golang.org/x/net/http2.(*Framer).ReadFrame(0xc000572380)
  /go/pkg/mod/golang.org/x/[email protected]/http2/frame.go:492 +0x95
golang.org/x/net/http2.(*clientConnReadLoop).run(0xc0005cbfa0)
  /go/pkg/mod/golang.org/x/[email protected]/http2/transport.go:1813 +0x165
golang.org/x/net/http2.(*ClientConn).readLoop(0xc0001adc80)
  /go/pkg/mod/golang.org/x/[email protected]/http2/transport.go:1735 +0x79
created by golang.org/x/net/http2.(*Transport).newClientConn
  /go/pkg/mod/golang.org/x/[email protected]/http2/transport.go:699 +0xb45

I can't find a way to easily replicate this because it starts to happens after a few days.
The only thing I know is that the crashloop is the same I get whenever I try to update antrea on eks, which also only works after creating new node groups and delete the old ones.

Versions:

  • Antrea version (Docker image tag). using antrea 1.5.1
  • eks 1.21
@jsalatiel jsalatiel added the kind/bug Categorizes issue or PR as related to a bug. label Mar 12, 2022
@jsalatiel
Copy link
Author

jsalatiel commented Mar 16, 2022

Any ideas on this one?
The error appears to come from this:

F0311 21:07:59.926541       1 main.go:58] Error running agent: error initializing agent: failed to get local IPNet device with IP &{10.137.2.22 <nil>}: IPs of localIPs should be on the same device

@antoninbas
Copy link
Contributor

@jsalatiel any chance you could ssh into a Node and run ip addr (or alternatively exec into a hostNetwork Pod and run ip addr, which may be easier on EKS and will yield the same output)?

@jsalatiel
Copy link
Author

Hi @antoninbas , I am not able to reproduce the randomly losing network connectivity for new pods, but I can easily replicate the crashloop by just trying to update antrea from 1.5.0 to 1.5.1 ( on a new test cluster ) Maybe this relates to #3471

# kubectl  logs antrea-agent-llf8t antrea-agent
I0317 21:37:27.352976       1 log_file.go:99] Set log file max size to 104857600
I0317 21:37:27.353576       1 agent.go:85] Starting Antrea agent (version v1.5.1)
I0317 21:37:27.353595       1 client.go:96] No kubeconfig file was specified. Falling back to in-cluster config
I0317 21:37:27.354894       1 client.go:96] No kubeconfig file was specified. Falling back to in-cluster config
I0317 21:37:27.355365       1 prometheus.go:171] Initializing prometheus metrics
I0317 21:37:27.355487       1 ovs_client.go:67] Connecting to OVSDB at address /var/run/openvswitch/db.sock
I0317 21:37:27.355726       1 agent.go:338] Setting up node network
F0317 21:37:27.373620       1 main.go:58] Error running agent: error initializing agent: failed to get local IPNet device with IP &{10.137.2.208 <nil>}: IPs of localIPs should be on the same device
goroutine 1 [running]:
k8s.io/klog/v2.stacks(0x1)
	/go/pkg/mod/k8s.io/klog/[email protected]/klog.go:1021 +0x8a
k8s.io/klog/v2.(*loggingT).output(0x37386a0, 0x3, {0x0, 0x0}, 0xc000463ab0, {0x2aa4c69, 0x1}, 0xc0005f6060, 0x0)
	/go/pkg/mod/k8s.io/klog/[email protected]/klog.go:970 +0x569
k8s.io/klog/v2.(*loggingT).printf(0xc000252100, 0x76fda8, {0x0, 0x0}, {0x0, 0x0}, {0x21b49a9, 0x17}, {0xc0005f6060, 0x1, ...})
	/go/pkg/mod/k8s.io/klog/[email protected]/klog.go:751 +0x1d1
k8s.io/klog/v2.Fatalf(...)
	/go/pkg/mod/k8s.io/klog/[email protected]/klog.go:1509
main.newAgentCommand.func1(0xc0005dc840, {0xc00062f180, 0x0, 0x8})
	/antrea/cmd/antrea-agent/main.go:58 +0x28e
github.com/spf13/cobra.(*Command).execute(0xc0005dc840, {0xc00004e0a0, 0x8, 0x8})
	/go/pkg/mod/github.com/spf13/[email protected]/command.go:854 +0x5f8
github.com/spf13/cobra.(*Command).ExecuteC(0xc0005dc840)
	/go/pkg/mod/github.com/spf13/[email protected]/command.go:958 +0x3ad
github.com/spf13/cobra.(*Command).Execute(...)
	/go/pkg/mod/github.com/spf13/[email protected]/command.go:895
main.main()
	/antrea/cmd/antrea-agent/main.go:37 +0x4a

goroutine 6 [chan receive]:
k8s.io/klog/v2.(*loggingT).flushDaemon(0x0)
	/go/pkg/mod/k8s.io/klog/[email protected]/klog.go:1164 +0x6a
created by k8s.io/klog/v2.init.0
	/go/pkg/mod/k8s.io/klog/[email protected]/klog.go:418 +0xfb

goroutine 84 [select]:
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000536fa0, {0x2488720, 0xc000719ef0}, 0x1, 0xc0000ae300)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:167 +0x13b
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x22655ee, 0x12a05f200, 0x0, 0xc5, 0x865da5)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(...)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:90
k8s.io/apimachinery/pkg/util/wait.Forever(0x865d26, 0xc00062e900)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:81 +0x28
created by k8s.io/component-base/logs.InitLogs
	/go/pkg/mod/k8s.io/[email protected]/logs/logs.go:58 +0x79

goroutine 45 [chan receive]:
k8s.io/client-go/util/workqueue.(*Type).updateUnfinishedWorkLoop(0xc00052d980)
	/go/pkg/mod/k8s.io/[email protected]/util/workqueue/queue.go:204 +0xa7
created by k8s.io/client-go/util/workqueue.newQueue
	/go/pkg/mod/k8s.io/[email protected]/util/workqueue/queue.go:62 +0x1af

goroutine 80 [IO wait]:
internal/poll.runtime_pollWait(0x7fb9140cb248, 0x72)
	/usr/local/go/src/runtime/netpoll.go:234 +0x89
internal/poll.(*pollDesc).wait(0xc000119180, 0xc0007f8000, 0x0)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x32
internal/poll.(*pollDesc).waitRead(...)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc000119180, {0xc0007f8000, 0x1ebb, 0x1ebb})
	/usr/local/go/src/internal/poll/fd_unix.go:167 +0x25a
net.(*netFD).Read(0xc000119180, {0xc0007f8000, 0xc0007f8d1d, 0x1a})
	/usr/local/go/src/net/fd_posix.go:56 +0x29
net.(*conn).Read(0xc000114060, {0xc0007f8000, 0x6dcf99, 0xc0001577f0})
	/usr/local/go/src/net/net.go:183 +0x45
crypto/tls.(*atLeastReader).Read(0xc00003a2d0, {0xc0007f8000, 0x0, 0x40b98d})
	/usr/local/go/src/crypto/tls/conn.go:777 +0x3d
bytes.(*Buffer).ReadFrom(0xc00022a978, {0x2486560, 0xc00003a2d0})
	/usr/local/go/src/bytes/buffer.go:204 +0x98
crypto/tls.(*Conn).readFromUntil(0xc00022a700, {0x24892c0, 0xc000114060}, 0x11a3)
	/usr/local/go/src/crypto/tls/conn.go:799 +0xe5
crypto/tls.(*Conn).readRecordOrCCS(0xc00022a700, 0x0)
	/usr/local/go/src/crypto/tls/conn.go:606 +0x112
crypto/tls.(*Conn).readRecord(...)
	/usr/local/go/src/crypto/tls/conn.go:574
crypto/tls.(*Conn).Read(0xc00022a700, {0xc0007ec000, 0x1000, 0xc0003370e0})
	/usr/local/go/src/crypto/tls/conn.go:1277 +0x16f
bufio.(*Reader).Read(0xc000337080, {0xc0006503b8, 0x9, 0xc0007c2a20})
	/usr/local/go/src/bufio/bufio.go:227 +0x1b4
io.ReadAtLeast({0x24863c0, 0xc000337080}, {0xc0006503b8, 0x9, 0x9}, 0x9)
	/usr/local/go/src/io/io.go:328 +0x9a
io.ReadFull(...)
	/usr/local/go/src/io/io.go:347
golang.org/x/net/http2.readFrameHeader({0xc0006503b8, 0x9, 0xc000215800}, {0x24863c0, 0xc000337080})
	/go/pkg/mod/golang.org/x/[email protected]/http2/frame.go:237 +0x6e
golang.org/x/net/http2.(*Framer).ReadFrame(0xc000650380)
	/go/pkg/mod/golang.org/x/[email protected]/http2/frame.go:492 +0x95
golang.org/x/net/http2.(*clientConnReadLoop).run(0xc000157fa0)
	/go/pkg/mod/golang.org/x/[email protected]/http2/transport.go:1813 +0x165
golang.org/x/net/http2.(*ClientConn).readLoop(0xc00057a480)
	/go/pkg/mod/golang.org/x/[email protected]/http2/transport.go:1735 +0x79
created by golang.org/x/net/http2.(*Transport).newClientConn
	/go/pkg/mod/golang.org/x/[email protected]/http2/transport.go:699 +0xb45

goroutine 46 [select]:
k8s.io/client-go/util/workqueue.(*delayingType).waitingLoop(0xc00052db00)
	/go/pkg/mod/k8s.io/[email protected]/util/workqueue/delaying_queue.go:231 +0x34e
created by k8s.io/client-go/util/workqueue.newDelayingQueue
	/go/pkg/mod/k8s.io/[email protected]/util/workqueue/delaying_queue.go:68 +0x23b

goroutine 110 [sync.Cond.Wait]:
sync.runtime_notifyListWait(0xc00048e890, 0x1)
	/usr/local/go/src/runtime/sema.go:513 +0x13d
sync.(*Cond).Wait(0x0)
	/usr/local/go/src/sync/cond.go:56 +0x8c
github.com/TomCodeLV/OVSDB-golang-lib/pkg/ovsdb.(*Synchronize).WaitError(0xc000396b40)
	/go/pkg/mod/github.com/!tom!code!l!v/[email protected]/pkg/ovsdb/client.go:119 +0x56
github.com/TomCodeLV/OVSDB-golang-lib/pkg/ovsdb.Dial.func1()
	/go/pkg/mod/github.com/!tom!code!l!v/[email protected]/pkg/ovsdb/client.go:245 +0xaa5
created by github.com/TomCodeLV/OVSDB-golang-lib/pkg/ovsdb.Dial
	/go/pkg/mod/github.com/!tom!code!l!v/[email protected]/pkg/ovsdb/client.go:167 +0x44f

goroutine 130 [IO wait]:
internal/poll.runtime_pollWait(0x7fb9140cb330, 0x72)
	/usr/local/go/src/runtime/netpoll.go:234 +0x89
internal/poll.(*pollDesc).wait(0xc0002e3c00, 0xc000454e00, 0x0)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x32
internal/poll.(*pollDesc).waitRead(...)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc0002e3c00, {0xc000454e00, 0x200, 0x200})
	/usr/local/go/src/internal/poll/fd_unix.go:167 +0x25a
net.(*netFD).Read(0xc0002e3c00, {0xc000454e00, 0x7fb93b73c108, 0x200})
	/usr/local/go/src/net/fd_posix.go:56 +0x29
net.(*conn).Read(0xc000114b40, {0xc000454e00, 0x1d8d7e0, 0xc000637201})
	/usr/local/go/src/net/net.go:183 +0x45
encoding/json.(*Decoder).refill(0xc00063a8c0)
	/usr/local/go/src/encoding/json/stream.go:165 +0x17f
encoding/json.(*Decoder).readValue(0xc00063a8c0)
	/usr/local/go/src/encoding/json/stream.go:140 +0xbb
encoding/json.(*Decoder).Decode(0xc00063a8c0, {0x1cf9320, 0xc0006372c0})
	/usr/local/go/src/encoding/json/stream.go:63 +0x78
github.com/TomCodeLV/OVSDB-golang-lib/pkg/ovsdb.(*OVSDB).decodeWrapper(0xc0006390e0, 0x0)
	/go/pkg/mod/github.com/!tom!code!l!v/[email protected]/pkg/ovsdb/client.go:318 +0x65
github.com/TomCodeLV/OVSDB-golang-lib/pkg/ovsdb.(*OVSDB).loop(0xc0006390e0)
	/go/pkg/mod/github.com/!tom!code!l!v/[email protected]/pkg/ovsdb/client.go:335 +0x45
created by github.com/TomCodeLV/OVSDB-golang-lib/pkg/ovsdb.Dial.func1
	/go/pkg/mod/github.com/!tom!code!l!v/[email protected]/pkg/ovsdb/client.go:228 +0x9f4

goroutine 113 [chan receive]:
antrea.io/antrea/pkg/signals.RegisterSignalHandlers.func1()
	/antrea/pkg/signals/signals.go:38 +0x31
created by antrea.io/antrea/pkg/signals.RegisterSignalHandlers
	/antrea/pkg/signals/signals.go:37 +0x9d

goroutine 115 [syscall]:
os/signal.signal_recv()
	/usr/local/go/src/runtime/sigqueue.go:169 +0x98
os/signal.loop()
	/usr/local/go/src/os/signal/signal_unix.go:24 +0x19
created by os/signal.Notify.func1.1
	/usr/local/go/src/os/signal/signal.go:151 +0x2c

Output of ip addr

ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP group default qlen 1000
    link/ether 0a:5c:5b:49:8e:81 brd ff:ff:ff:ff:ff:ff
    inet 10.137.2.208/24 brd 10.137.2.255 scope global dynamic eth0
       valid_lft 2556sec preferred_lft 2556sec
    inet6 fe80::85c:5bff:fe49:8e81/64 scope link 
       valid_lft forever preferred_lft forever
3: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 2e:ad:1f:e4:ad:ac brd ff:ff:ff:ff:ff:ff
4: antrea-gw0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 9e:fc:2b:9b:df:f8 brd ff:ff:ff:ff:ff:ff
    inet 10.137.2.208/32 scope global antrea-gw0
       valid_lft forever preferred_lft forever
    inet6 fe80::9cfc:2bff:fe9b:dff8/64 scope link 
       valid_lft forever preferred_lft forever
8: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP group default qlen 1000
    link/ether 0a:b9:18:75:bc:e5 brd ff:ff:ff:ff:ff:ff
    inet 10.137.2.198/24 brd 10.137.2.255 scope global eth1
       valid_lft forever preferred_lft forever
    inet6 fe80::8b9:18ff:fe75:bce5/64 scope link 
       valid_lft forever preferred_lft forever
9: eni6a98c9c4fcc@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue master ovs-system state UP group default 
    link/ether 22:3b:ba:0f:af:c2 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::203b:baff:fe0f:afc2/64 scope link 
       valid_lft forever preferred_lft forever
10: eni5fdf8272e8d@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue master ovs-system state UP group default 
    link/ether 12:7e:e4:f5:f4:14 brd ff:ff:ff:ff:ff:ff link-netnsid 1
    inet6 fe80::107e:e4ff:fef5:f414/64 scope link 
       valid_lft forever preferred_lft forever
11: eni5b85ac344f2@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue master ovs-system state UP group default 
    link/ether be:08:81:33:11:27 brd ff:ff:ff:ff:ff:ff link-netnsid 2
    inet6 fe80::bc08:81ff:fe33:1127/64 scope link 
       valid_lft forever preferred_lft forever

@antoninbas
Copy link
Contributor

This seems to be a duplicate of #3217. This issue was fixed on the main branch. If you could try deploying the latest Antrea to confirm that it resolves your issue, it would be great.

https://raw.githubusercontent.com/antrea-io/antrea/main/build/yamls/antrea-eks.yml

Unfortunately, it seems the issue was not back-ported to 1.5, which is why release 1.5.1 also suffers from the issue. This is a mistake on our part, we should the patch include it in 1.5.2.

@antoninbas antoninbas added the triage/duplicate Indicates an issue is a duplicate of other open issue. label Mar 17, 2022
@jsalatiel
Copy link
Author

jsalatiel commented Mar 17, 2022

btw, whats the correct way to update antrea?
I tried kubectl apply -f https://raw.githubusercontent.com/antrea-io/antrea/main/build/yamls/antrea-eks.yml
but the antrea-agent pod crashlooped with:

 agent.go:1117] "Failed to patch Node annotation" err="nodes \"ip-10-137-1-66.ec2.internal\" is forbidden: User \"system:serviceaccount:kube-system:antrea-agent\" cannot patch resource \"nodes\" in API group \"\" at the cluster scope" key="node.antrea.io/mac-address" value="12:4f:c8:00:f4:19"
F0317 22:09:55.032756       1 main.go:58] Error running agent: error initializing agent: nodes "ip-10-137-1-66.ec2.internal" is forbidden: User "system:serviceaccount:kube-system:antrea-agent" cannot patch resource "nodes" in API group "" at the cluster scope

So I used kubectl replace -f https://raw.githubusercontent.com/antrea-io/antrea/main/build/yamls/antrea-eks.yml and restarted the deployment and it worked.

This fixes the crashloop thanks. Do you think that has anything to do with the new pods losing network connectivity?

@antoninbas
Copy link
Contributor

This fixes the crashloop thanks. Do you think that has anything to do with the new pods losing network connectivity?

I would think so. This crash would happen after any agent restart, and then there is no way to recover. After a few days it is possible that the agent is restarted for various reasons.

@jsalatiel
Copy link
Author

I think I talked too soon. I restarted one of the nodes and it is crashloopback again with:

"Failed to patch Node annotation" err="nodes \"ip-10-137-1-66.ec2.internal\" is forbidden: User \"system:serviceaccount:kube-system:antrea-agent\" cannot patch resource \"nodes\" in API group \"\" at the cluster scope" key="node.antrea.io/mac-address" value="12:4f:c8:00:f4:19"
F0317 22:40:46.268163       1 main.go:58] Error running agent: error initializing agent: nodes "ip-10-137-1-66.ec2.internal" is forbidden: User "system:serviceaccount:kube-system:antrea-agent" cannot patch resource "nodes" in API group "" at the cluster scope
goroutine 1 [running]:

@antoninbas
Copy link
Contributor

This looks like a separate issue ...

@tnqn this last error looks like it could be related to #3393.

@antoninbas
Copy link
Contributor

@jsalatiel Actually it's pretty silly. The YAML manifest I pointed you to and that you applied (https://raw.githubusercontent.com/antrea-io/antrea/main/build/yamls/antrea-eks.yml) is primarily meant for development and has imagePullPolicy set to IfNotPresent... When you applied it, it used the new ServiceAccount definition with the old Antrea Docker image, which caused the error you observed.

If you replace all instances of imagePullPolicy: IfNotPresent in the yaml with imagePullPolicy: Always before re-applying it, it will take care of the error.

@antoninbas
Copy link
Contributor

@jsalatiel were you able to confirm that this is resolved with the latest Antrea image?

@jsalatiel
Copy link
Author

yes. Closing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/duplicate Indicates an issue is a duplicate of other open issue.
Projects
None yet
Development

No branches or pull requests

2 participants