Need reliable health check for SharedInformer #3101

paulgeiger-eng · 2021-05-11T19:47:05Z

We are using kubernetes-client version 5.2.1.

We observed an issue where our SharedInformer cache does not keep up with a new resource created in kubernetes. In this scenario the SharedInformer initially synced correctly and was running for a few days. The issue appears to be originating in the cluster and possibly the cluster is somehow overwhelmed although the system is not under heavy load. It has only happened in a live customer environment.

The issue appears to be triggered after the creation in kubernetes of a resource of the same type as the SharedInformer. We see a socket/connection exception in the logs that appears to be related to the SharedInformer watch connection. The exception is happening in okhttp thread and being logged by fabric8 WatcherWebSocketListener.

The main issue is that the SharedInformer continues to function with stale data even after the exception, i.e. the SharedInformer cache does not get updated with the newly created resource. In fact we have code in our application that calls the hasSynced method on each access of the SharedInformer cache. It appears that the hasSynced must be returning true because we have observed a different scenario where the informer fails to sync and in that scenario our application fails.

It has been mentioned that SharedInformer has an isRunning method but it's not clear that that method is intended as a reliable health check.

We would like to have a reliable health check for SharedInformer that would detect this exception.

Caused by: javax.net.ssl.SSLException: Socket closed
        at sun.security.ssl.Alert.createSSLException(Alert.java:127)
        at sun.security.ssl.TransportContext.fatal(TransportContext.java:353)
        at sun.security.ssl.TransportContext.fatal(TransportContext.java:296)
        at sun.security.ssl.TransportContext.fatal(TransportContext.java:291)
        at sun.security.ssl.SSLSocketImpl.handleException(SSLSocketImpl.java:1522)
        at sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:965)
        at okio.Okio$2.read(Okio.java:140)
        at okio.AsyncTimeout$2.read(AsyncTimeout.java:237)
        ... 38 more
Caused by: java.net.SocketException: Socket closed
        at java.net.SocketInputStream.read(SocketInputStream.java:183)
        at java.net.SocketInputStream.read(SocketInputStream.java:140)
        at sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:476)
        at sun.security.ssl.SSLSocketInputRecord.readHeader(SSLSocketInputRecord.java:470)
        at sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket(SSLSocketInputRecord.java:70)
        at sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1308)
        at sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:949)
        ... 40 more
2021-05-05 14:52:03.621+0000 |  | ::: | OkHttp https://10.96 |  | ERROR | io.fabric8.kubernetes.client.informers.cache.ReflectorWatcher | Watch closing
2021-05-05 16:02:03.540+0000 |  | ::: | OkHttp https://10.96 |  | ERROR | io.fabric8.kubernetes.client.informers.cache.ReflectorWatcher | Watch closing
2021-05-05 16:07:08.656+0000 |  | ::: | OkHttp WebSocket htt |  |  WARN | io.fabric8.kubernetes.client.dsl.internal.WatcherWebSocketListener | Exec Failure
java.net.SocketTimeoutException: sent ping but didn't receive pong within 30000ms (after 8 successful ping/pongs)
        at okhttp3.internal.ws.RealWebSocket.writePingFrame(RealWebSocket.java:545)
        at okhttp3.internal.ws.RealWebSocket$PingRunnable.run(RealWebSocket.java:529)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.lang.Thread.run(Thread.java:834)
        at com.oracle.svm.core.thread.JavaThreads.threadStartRoutine(JavaThreads.java:517)
        at com.oracle.svm.core.posix.thread.PosixJavaThreads.pthreadStartRoutine(PosixJavaThreads.java:192)
2021-05-05 16:07:09.723+0000 |  | ::: | OkHttp https://10.96 |  | ERROR | io.fabric8.kubernetes.client.informers.cache.ReflectorWatcher | Watch closing

The text was updated successfully, but these errors were encountered:

rohanKanojia · 2021-05-12T10:56:11Z

@shawkins : Could you please share your thoughts on this? Do you think the newly introduced isRunning can serve as health check?

shawkins · 2021-05-12T11:13:28Z

Hopefully this should be addressed in 5.4. With all of the simplifications that will likely be in for that release the isRunning method should be true once synced until stop is called.

See also #3018 #3048 #2993

My only concern left is about the health of the watch itself - there's no option for setting the timeout when establishing the watch and there seems to be an expectation that as long as it's running it's healthy. We're trying to validate that further with our project.

manusa · 2021-05-24T09:46:28Z

Hi @paulgeiger-eng , 5.4.0 was released last week, could you please confirm if the issue persists or it was fixed?

paulgeiger-eng · 2021-05-24T22:21:07Z

Thanks @manusa. We will try it out.

I'm a bit confused by the earlier comments on this issue. Currently we are using the hasSynced method as a pre-check when accessing the informer cache. There is some discussion above about using isRunning but then @shawkins mentioned some doubt about whether it would check the health of the watch itself. I believe that it is precisely an error on the watch connection that is causing our issue.

Is the expectation now that hasSynced will return false after a watch error? Should we update our code to incorporate isRunning as a health check?

shawkins · 2021-05-24T23:57:51Z

Is the expectation now that hasSynced will return false after a watch error?

The 5.3 code will return hasSynced=false when a relist is performed, and would flip back to true once those items were processed by the deltafifo.

The 5.4 code removed the deltafifo, so there is an even smaller window of time in which hasSynced is false.

I would like to refine things further with #3167

With the changes in that pr hasSynced will return true after the first time the list operation completes, and will not flip back to false. This matches the behavior of the go client.

That pr also adds an isWatching method, that will be false anytime the watch is down - that would be of interest to a health check. With the 5.3.1/5.4 changes though the watch should indefinitely try to re-connect, so isRunning can still be true when isWatching is false.

Should we update our code to incorporate isRunning as a health check?

With the 5.4 refactorings, and especially with the changes in 5.5 I don't think there will be a case where isRunning returns false and stop has not been called.

paulgeiger-eng · 2021-05-25T00:59:48Z

Thanks @shawkins. It sounds like the isWatching method would be the thing that we need.

The change you mentioned that the watch will indefinitely try to re-connect sounds like it will benefit to us. However, I understand that until we have a way to detect when the watch is down the main issue here is still unresolved, i.e. the potential still exists that we will retrieve stale data from the informer cache while the watch is reconnecting.

What we believe is happening is that the connection is going down from the cluster side. I'm not aware of any reason why the re-connect wouldn't work. Therefore if the re-connect happens quickly enough it could significantly reduce the likelihood of our stale data scenario. We will take up 5.4 and keep you posted on our observations.

For our requirements we need to get the chance of stale data to zero by having the isWatching method. I would like to keep this issue open until that time. Is it expected for 5.5?

shawkins · 2021-05-25T12:58:57Z

Is it expected for 5.5?

It should be. If anyone objects to the refinements of the Store interface, I'll separate those changes from the pr.

paulgeiger-eng · 2021-05-25T22:17:31Z

Thanks very much @shawkins. We will take it up when it's available.

shawkins · 2021-05-27T16:22:26Z

Linking to #3177 which is a condition below the level of the informers such that there's no indication that anything is wrong, but the watch is not functioning.

shawkins · 2021-06-23T13:54:38Z

Updated the resolution for 5.5 after #3269. The expectations for 5.5 are:

isRunning - will be true after a successful run call, and will stay true until stop is called. Informer watches are allowed unlimited attempts to resolve an http gone onClose(Exception). There is only one corner case in the code

kubernetes-client/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/AbstractWatchManager.java

Line 144 in 1f5061f

logger.error("Exception in reconnect", e);

that can call onClose but not as an http gone exception. In practice I don't see that exception as possible - the logic for http/websocket run is just creating okhttp object and associating a handler. In any case if there is a problem, there should be a log and isRunning/isWatching will be false.

isWatching - will be true if the underlying watch is active. Will be false after the Watch reports onClose until the next Watch is established. For an http gone exception isWatching will not be false until there is a problem establishing the new Watch. This is the closest to a health check - but it still recover from false in normal operation given the retries of http gone.

hasSynced - will be true after the first list operation completes.

paulgeiger-eng · 2021-09-01T23:16:51Z

Thanks for providing the fix. We finally had time to take up a new kubernetes-client version and we are now on 5.7.0. Our health check is implemented as follows:

public boolean isHealthy() {
    final var resourceInformer = resourceHandler.getResourceInformer();
    if (!resourceInformer.isRunning()) {
        return false;
    }
    try {
        if (!resourceInformer.hasSynced()) {
            resourceHandler.waitForSync();
        }
        if (!resourceInformer.isWatching()) {
            resourceHandler.establishWatch();
        }
        return true;
    } catch (Exception e) {
        log.atSevere().withCause(e).log("Health check failed in CRD repository.");
        return false;
    }
}

The resourceHandler is of our own class and the waitForSync and establishWatch methods will throw exception on timeout waiting for the corresponding method to return true.

This code is new and I'm not aware that we've had any health check failures so far. I will post again if there are any issues.

paulgeiger-eng · 2021-09-01T23:24:05Z

Hi @paulgeiger-eng , 5.4.0 was released last week, could you please confirm if the issue persists or it was fixed?

Per my understanding we have still been experiencing the issue since taking up 5.4.0. As mentioned in my previous comment we have now taken up 5.7.0 and incorporated isWatching into our health check.

shawkins · 2021-09-21T15:18:15Z

I think there is still an issue with isWatching. Testing in 5.7 isWatching is not working as expected #3484

manusa added the Waiting on feedback Issues that require feedback from User/Other community members label May 24, 2021

manusa added this to the 5.5.0 milestone May 26, 2021

manusa removed the Waiting on feedback Issues that require feedback from User/Other community members label May 26, 2021

shawkins closed this as completed Jun 23, 2021

shawkins added a commit to shawkins/kubernetes-client that referenced this issue Jun 24, 2021

fix fabric8io#3101 making isWatching better reflect health

db67557

shawkins mentioned this issue Jun 24, 2021

fix #3101 making isWatching better reflect health #3269

Merged

11 tasks

shawkins reopened this Jun 24, 2021

shawkins self-assigned this Jun 28, 2021

manusa closed this as completed in 3b53fe2 Jun 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need reliable health check for SharedInformer #3101

Need reliable health check for SharedInformer #3101

paulgeiger-eng commented May 11, 2021 •

edited by manusa

Loading

rohanKanojia commented May 12, 2021

shawkins commented May 12, 2021

manusa commented May 24, 2021

paulgeiger-eng commented May 24, 2021

shawkins commented May 24, 2021

paulgeiger-eng commented May 25, 2021

shawkins commented May 25, 2021

paulgeiger-eng commented May 25, 2021

shawkins commented May 27, 2021

shawkins commented Jun 23, 2021 •

edited

Loading

paulgeiger-eng commented Sep 1, 2021

paulgeiger-eng commented Sep 1, 2021

shawkins commented Sep 21, 2021

Need reliable health check for SharedInformer #3101

Need reliable health check for SharedInformer #3101

Comments

paulgeiger-eng commented May 11, 2021 • edited by manusa Loading

rohanKanojia commented May 12, 2021

shawkins commented May 12, 2021

manusa commented May 24, 2021

paulgeiger-eng commented May 24, 2021

shawkins commented May 24, 2021

paulgeiger-eng commented May 25, 2021

shawkins commented May 25, 2021

paulgeiger-eng commented May 25, 2021

shawkins commented May 27, 2021

shawkins commented Jun 23, 2021 • edited Loading

paulgeiger-eng commented Sep 1, 2021

paulgeiger-eng commented Sep 1, 2021

shawkins commented Sep 21, 2021

paulgeiger-eng commented May 11, 2021 •

edited by manusa

Loading

shawkins commented Jun 23, 2021 •

edited

Loading