SharedInformer does not survive to an API server restart #2992

akram · 2021-04-14T14:22:41Z

SharedInformer created to watch ImageStreams or BuildConfig does not survive a k8s API server restart.

Please note, that this probably related to a bug in k8s api, as I was abled to reproduce the behaviour using the oc command.

As user using oc, if I restart the api server, while watching imagestreams, I got the following error:

STATUS                REASON            MESSAGE
Failure               InternalError     an error on the server ("unable to decode an event from the watch stream: stream error: stream ID 3; INTERNAL_ERROR") has prevented the request from succeeding

The same operation using oc get secrets -w does not fail.

In the kubernetes-client, this materialized by an EOFException caught in the io.fabric8.kubernetes.client.dsl.internal.WatcherWebSocketListener which does not restart the WebSocket, but instead discards it from the manager.

2021-04-14 12:07:02 WARNING io.fabric8.kubernetes.client.dsl.internal.WatcherWebSocketListener onFailure Exec Failure
java.io.EOFException
	at okio.RealBufferedSource.require(RealBufferedSource.java:61)
	at okio.RealBufferedSource.readByte(RealBufferedSource.java:74)
	at okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:117)
	at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:101)
	at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
	at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203)
	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)

This is silent as a user point of view.

As a possible fix, we can considered having an else statement here: https://github.com/fabric8io/kubernetes-client/blob/master/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatcherWebSocketListener.java#L105

As, for an existing and started websocket, it is still possible to get a null response, which may have the signification that the websocket has been started previously, but became unaivalable.

edit:
Discussing with apiserver team, it seems that it is also impacting core objects, not only openshift specific. In my test, I was deleting the openshift apiserver part only. But, the same error is raised then in case of any other objects.

The text was updated successfully, but these errors were encountered:

shawkins · 2021-04-16T12:29:32Z

@akram you should see reconnects -

kubernetes-client/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatcherWebSocketListener.java

Line 113 in b91fd7e

scheduleReconnect();

I think it defaults to unlimited reconnect attempts -

kubernetes-client/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/RequestConfig.java

Line 46 in 74cc63d

private int watchReconnectLimit = -1;

shawkins · 2021-04-19T17:12:06Z

I believe we have hit a similar situation. If the relist operation fails due to the api server being unavailable it looks like no further reconnects will be attempted.

shawkins · 2021-04-20T12:48:12Z

Specifically we see this after logs:

ERROR [io.fab.kub.cli.dsl.int.WatchConnectionManager] (OkHttp https://172.30.0.1/...) Unhandled exception encountered in watcher event handler: java.util.concurrent.RejectedExecutionException: Error while doing ReflectorRunnable list

where the root exception is a timeout.

manusa · 2021-05-05T13:26:35Z

Relates to: #2010

akram changed the title ~~SharedInformer on build.openshift.io or image.openshift.io apiGroups resources does not survive to an API server restart~~ SharedInformer does not survive to an API server restart Apr 14, 2021

akram closed this as completed Apr 15, 2021

akram reopened this Apr 15, 2021

shawkins mentioned this issue Apr 20, 2021

allowing the auto-reconnect logic to work after an onClose exception #3018

Merged

4 tasks

rohanKanojia assigned shawkins Apr 21, 2021

manusa closed this as completed in #3018 Apr 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SharedInformer does not survive to an API server restart #2992

SharedInformer does not survive to an API server restart #2992

akram commented Apr 14, 2021 •

edited

Loading

shawkins commented Apr 16, 2021

shawkins commented Apr 19, 2021

shawkins commented Apr 20, 2021

manusa commented May 5, 2021

SharedInformer does not survive to an API server restart #2992

SharedInformer does not survive to an API server restart #2992

Comments

akram commented Apr 14, 2021 • edited Loading

shawkins commented Apr 16, 2021

shawkins commented Apr 19, 2021

shawkins commented Apr 20, 2021

manusa commented May 5, 2021

akram commented Apr 14, 2021 •

edited

Loading