-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Need reliable health check for SharedInformer #3101
Comments
@shawkins : Could you please share your thoughts on this? Do you think the newly introduced isRunning can serve as health check? |
Hopefully this should be addressed in 5.4. With all of the simplifications that will likely be in for that release the isRunning method should be true once synced until stop is called. My only concern left is about the health of the watch itself - there's no option for setting the timeout when establishing the watch and there seems to be an expectation that as long as it's running it's healthy. We're trying to validate that further with our project. |
Hi @paulgeiger-eng , 5.4.0 was released last week, could you please confirm if the issue persists or it was fixed? |
Thanks @manusa. We will try it out. I'm a bit confused by the earlier comments on this issue. Currently we are using the hasSynced method as a pre-check when accessing the informer cache. There is some discussion above about using isRunning but then @shawkins mentioned some doubt about whether it would check the health of the watch itself. I believe that it is precisely an error on the watch connection that is causing our issue. Is the expectation now that hasSynced will return false after a watch error? Should we update our code to incorporate isRunning as a health check? |
The 5.3 code will return hasSynced=false when a relist is performed, and would flip back to true once those items were processed by the deltafifo. The 5.4 code removed the deltafifo, so there is an even smaller window of time in which hasSynced is false. I would like to refine things further with #3167 With the changes in that pr hasSynced will return true after the first time the list operation completes, and will not flip back to false. This matches the behavior of the go client. That pr also adds an isWatching method, that will be false anytime the watch is down - that would be of interest to a health check. With the 5.3.1/5.4 changes though the watch should indefinitely try to re-connect, so isRunning can still be true when isWatching is false.
With the 5.4 refactorings, and especially with the changes in 5.5 I don't think there will be a case where isRunning returns false and stop has not been called. |
Thanks @shawkins. It sounds like the isWatching method would be the thing that we need. The change you mentioned that the watch will indefinitely try to re-connect sounds like it will benefit to us. However, I understand that until we have a way to detect when the watch is down the main issue here is still unresolved, i.e. the potential still exists that we will retrieve stale data from the informer cache while the watch is reconnecting. What we believe is happening is that the connection is going down from the cluster side. I'm not aware of any reason why the re-connect wouldn't work. Therefore if the re-connect happens quickly enough it could significantly reduce the likelihood of our stale data scenario. We will take up 5.4 and keep you posted on our observations. For our requirements we need to get the chance of stale data to zero by having the isWatching method. I would like to keep this issue open until that time. Is it expected for 5.5? |
It should be. If anyone objects to the refinements of the Store interface, I'll separate those changes from the pr. |
Thanks very much @shawkins. We will take it up when it's available. |
Linking to #3177 which is a condition below the level of the informers such that there's no indication that anything is wrong, but the watch is not functioning. |
Updated the resolution for 5.5 after #3269. The expectations for 5.5 are: isRunning - will be true after a successful run call, and will stay true until stop is called. Informer watches are allowed unlimited attempts to resolve an http gone onClose(Exception). There is only one corner case in the code Line 144 in 1f5061f
isWatching - will be true if the underlying watch is active. Will be false after the Watch reports onClose until the next Watch is established. For an http gone exception isWatching will not be false until there is a problem establishing the new Watch. This is the closest to a health check - but it still recover from false in normal operation given the retries of http gone. hasSynced - will be true after the first list operation completes. |
Thanks for providing the fix. We finally had time to take up a new kubernetes-client version and we are now on 5.7.0. Our health check is implemented as follows:
The resourceHandler is of our own class and the waitForSync and establishWatch methods will throw exception on timeout waiting for the corresponding method to return true. This code is new and I'm not aware that we've had any health check failures so far. I will post again if there are any issues. |
Per my understanding we have still been experiencing the issue since taking up 5.4.0. As mentioned in my previous comment we have now taken up 5.7.0 and incorporated isWatching into our health check. |
I think there is still an issue with isWatching. Testing in 5.7 isWatching is not working as expected #3484 |
We are using kubernetes-client version 5.2.1.
We observed an issue where our SharedInformer cache does not keep up with a new resource created in kubernetes. In this scenario the SharedInformer initially synced correctly and was running for a few days. The issue appears to be originating in the cluster and possibly the cluster is somehow overwhelmed although the system is not under heavy load. It has only happened in a live customer environment.
The issue appears to be triggered after the creation in kubernetes of a resource of the same type as the SharedInformer. We see a socket/connection exception in the logs that appears to be related to the SharedInformer watch connection. The exception is happening in okhttp thread and being logged by fabric8 WatcherWebSocketListener.
The main issue is that the SharedInformer continues to function with stale data even after the exception, i.e. the SharedInformer cache does not get updated with the newly created resource. In fact we have code in our application that calls the hasSynced method on each access of the SharedInformer cache. It appears that the hasSynced must be returning true because we have observed a different scenario where the informer fails to sync and in that scenario our application fails.
It has been mentioned that SharedInformer has an isRunning method but it's not clear that that method is intended as a reliable health check.
We would like to have a reliable health check for SharedInformer that would detect this exception.
The text was updated successfully, but these errors were encountered: