Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need reliable health check for SharedInformer #3101

Closed
paulgeiger-eng opened this issue May 11, 2021 · 13 comments
Closed

Need reliable health check for SharedInformer #3101

paulgeiger-eng opened this issue May 11, 2021 · 13 comments
Assignees
Milestone

Comments

@paulgeiger-eng
Copy link

paulgeiger-eng commented May 11, 2021

We are using kubernetes-client version 5.2.1.

We observed an issue where our SharedInformer cache does not keep up with a new resource created in kubernetes. In this scenario the SharedInformer initially synced correctly and was running for a few days. The issue appears to be originating in the cluster and possibly the cluster is somehow overwhelmed although the system is not under heavy load. It has only happened in a live customer environment.

The issue appears to be triggered after the creation in kubernetes of a resource of the same type as the SharedInformer. We see a socket/connection exception in the logs that appears to be related to the SharedInformer watch connection. The exception is happening in okhttp thread and being logged by fabric8 WatcherWebSocketListener.

The main issue is that the SharedInformer continues to function with stale data even after the exception, i.e. the SharedInformer cache does not get updated with the newly created resource. In fact we have code in our application that calls the hasSynced method on each access of the SharedInformer cache. It appears that the hasSynced must be returning true because we have observed a different scenario where the informer fails to sync and in that scenario our application fails.

It has been mentioned that SharedInformer has an isRunning method but it's not clear that that method is intended as a reliable health check.

We would like to have a reliable health check for SharedInformer that would detect this exception.

Caused by: javax.net.ssl.SSLException: Socket closed
        at sun.security.ssl.Alert.createSSLException(Alert.java:127)
        at sun.security.ssl.TransportContext.fatal(TransportContext.java:353)
        at sun.security.ssl.TransportContext.fatal(TransportContext.java:296)
        at sun.security.ssl.TransportContext.fatal(TransportContext.java:291)
        at sun.security.ssl.SSLSocketImpl.handleException(SSLSocketImpl.java:1522)
        at sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:965)
        at okio.Okio$2.read(Okio.java:140)
        at okio.AsyncTimeout$2.read(AsyncTimeout.java:237)
        ... 38 more
Caused by: java.net.SocketException: Socket closed
        at java.net.SocketInputStream.read(SocketInputStream.java:183)
        at java.net.SocketInputStream.read(SocketInputStream.java:140)
        at sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:476)
        at sun.security.ssl.SSLSocketInputRecord.readHeader(SSLSocketInputRecord.java:470)
        at sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket(SSLSocketInputRecord.java:70)
        at sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1308)
        at sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:949)
        ... 40 more
2021-05-05 14:52:03.621+0000 |  | ::: | OkHttp https://10.96 |  | ERROR | io.fabric8.kubernetes.client.informers.cache.ReflectorWatcher | Watch closing
2021-05-05 16:02:03.540+0000 |  | ::: | OkHttp https://10.96 |  | ERROR | io.fabric8.kubernetes.client.informers.cache.ReflectorWatcher | Watch closing
2021-05-05 16:07:08.656+0000 |  | ::: | OkHttp WebSocket htt |  |  WARN | io.fabric8.kubernetes.client.dsl.internal.WatcherWebSocketListener | Exec Failure
java.net.SocketTimeoutException: sent ping but didn't receive pong within 30000ms (after 8 successful ping/pongs)
        at okhttp3.internal.ws.RealWebSocket.writePingFrame(RealWebSocket.java:545)
        at okhttp3.internal.ws.RealWebSocket$PingRunnable.run(RealWebSocket.java:529)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.lang.Thread.run(Thread.java:834)
        at com.oracle.svm.core.thread.JavaThreads.threadStartRoutine(JavaThreads.java:517)
        at com.oracle.svm.core.posix.thread.PosixJavaThreads.pthreadStartRoutine(PosixJavaThreads.java:192)
2021-05-05 16:07:09.723+0000 |  | ::: | OkHttp https://10.96 |  | ERROR | io.fabric8.kubernetes.client.informers.cache.ReflectorWatcher | Watch closing
@rohanKanojia
Copy link
Member

@shawkins : Could you please share your thoughts on this? Do you think the newly introduced isRunning can serve as health check?

@shawkins
Copy link
Contributor

Hopefully this should be addressed in 5.4. With all of the simplifications that will likely be in for that release the isRunning method should be true once synced until stop is called.

See also #3018 #3048 #2993

My only concern left is about the health of the watch itself - there's no option for setting the timeout when establishing the watch and there seems to be an expectation that as long as it's running it's healthy. We're trying to validate that further with our project.

@manusa
Copy link
Member

manusa commented May 24, 2021

Hi @paulgeiger-eng , 5.4.0 was released last week, could you please confirm if the issue persists or it was fixed?

@manusa manusa added the Waiting on feedback Issues that require feedback from User/Other community members label May 24, 2021
@paulgeiger-eng
Copy link
Author

Thanks @manusa. We will try it out.

I'm a bit confused by the earlier comments on this issue. Currently we are using the hasSynced method as a pre-check when accessing the informer cache. There is some discussion above about using isRunning but then @shawkins mentioned some doubt about whether it would check the health of the watch itself. I believe that it is precisely an error on the watch connection that is causing our issue.

Is the expectation now that hasSynced will return false after a watch error? Should we update our code to incorporate isRunning as a health check?

@shawkins
Copy link
Contributor

Is the expectation now that hasSynced will return false after a watch error?

The 5.3 code will return hasSynced=false when a relist is performed, and would flip back to true once those items were processed by the deltafifo.

The 5.4 code removed the deltafifo, so there is an even smaller window of time in which hasSynced is false.

I would like to refine things further with #3167

With the changes in that pr hasSynced will return true after the first time the list operation completes, and will not flip back to false. This matches the behavior of the go client.

That pr also adds an isWatching method, that will be false anytime the watch is down - that would be of interest to a health check. With the 5.3.1/5.4 changes though the watch should indefinitely try to re-connect, so isRunning can still be true when isWatching is false.

Should we update our code to incorporate isRunning as a health check?

With the 5.4 refactorings, and especially with the changes in 5.5 I don't think there will be a case where isRunning returns false and stop has not been called.

@paulgeiger-eng
Copy link
Author

Thanks @shawkins. It sounds like the isWatching method would be the thing that we need.

The change you mentioned that the watch will indefinitely try to re-connect sounds like it will benefit to us. However, I understand that until we have a way to detect when the watch is down the main issue here is still unresolved, i.e. the potential still exists that we will retrieve stale data from the informer cache while the watch is reconnecting.

What we believe is happening is that the connection is going down from the cluster side. I'm not aware of any reason why the re-connect wouldn't work. Therefore if the re-connect happens quickly enough it could significantly reduce the likelihood of our stale data scenario. We will take up 5.4 and keep you posted on our observations.

For our requirements we need to get the chance of stale data to zero by having the isWatching method. I would like to keep this issue open until that time. Is it expected for 5.5?

@shawkins
Copy link
Contributor

Is it expected for 5.5?

It should be. If anyone objects to the refinements of the Store interface, I'll separate those changes from the pr.

@paulgeiger-eng
Copy link
Author

Thanks very much @shawkins. We will take it up when it's available.

@manusa manusa added this to the 5.5.0 milestone May 26, 2021
@manusa manusa removed the Waiting on feedback Issues that require feedback from User/Other community members label May 26, 2021
@shawkins
Copy link
Contributor

Linking to #3177 which is a condition below the level of the informers such that there's no indication that anything is wrong, but the watch is not functioning.

@shawkins
Copy link
Contributor

shawkins commented Jun 23, 2021

Updated the resolution for 5.5 after #3269. The expectations for 5.5 are:

isRunning - will be true after a successful run call, and will stay true until stop is called. Informer watches are allowed unlimited attempts to resolve an http gone onClose(Exception). There is only one corner case in the code

that can call onClose but not as an http gone exception. In practice I don't see that exception as possible - the logic for http/websocket run is just creating okhttp object and associating a handler. In any case if there is a problem, there should be a log and isRunning/isWatching will be false.

isWatching - will be true if the underlying watch is active. Will be false after the Watch reports onClose until the next Watch is established. For an http gone exception isWatching will not be false until there is a problem establishing the new Watch. This is the closest to a health check - but it still recover from false in normal operation given the retries of http gone.

hasSynced - will be true after the first list operation completes.

shawkins added a commit to shawkins/kubernetes-client that referenced this issue Jun 24, 2021
@shawkins shawkins reopened this Jun 24, 2021
@shawkins shawkins self-assigned this Jun 28, 2021
@manusa manusa closed this as completed in 3b53fe2 Jun 29, 2021
@paulgeiger-eng
Copy link
Author

Thanks for providing the fix. We finally had time to take up a new kubernetes-client version and we are now on 5.7.0. Our health check is implemented as follows:

public boolean isHealthy() {
    final var resourceInformer = resourceHandler.getResourceInformer();
    if (!resourceInformer.isRunning()) {
        return false;
    }
    try {
        if (!resourceInformer.hasSynced()) {
            resourceHandler.waitForSync();
        }
        if (!resourceInformer.isWatching()) {
            resourceHandler.establishWatch();
        }
        return true;
    } catch (Exception e) {
        log.atSevere().withCause(e).log("Health check failed in CRD repository.");
        return false;
    }
}

The resourceHandler is of our own class and the waitForSync and establishWatch methods will throw exception on timeout waiting for the corresponding method to return true.

This code is new and I'm not aware that we've had any health check failures so far. I will post again if there are any issues.

@paulgeiger-eng
Copy link
Author

Hi @paulgeiger-eng , 5.4.0 was released last week, could you please confirm if the issue persists or it was fixed?

Per my understanding we have still been experiencing the issue since taking up 5.4.0. As mentioned in my previous comment we have now taken up 5.7.0 and incorporated isWatching into our health check.

@shawkins
Copy link
Contributor

I think there is still an issue with isWatching. Testing in 5.7 isWatching is not working as expected #3484

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants