-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More robust system Tier creation / update #6696
More robust system Tier creation / update #6696
Conversation
5564a1c
to
c23379f
Compare
cmd/antrea-controller/controller.go
Outdated
// The apiserver has dependencies on multiple informer listers, which are obtained | ||
// from the networkPolicyController object when configuring the apiserver. Before | ||
// starting serving requests, it makes sense to wait for the informers to have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a potential issue but in theory the full sync of informers may not represent the readiness of the API handlers as the workers may be still processing the items.
Does it make sense to run the apiserver and let each API handlers to check if their own dependecies are ready? Then for InitializeTiers
, it could just wait for its only dependency, TierInformer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point, but I was mostly concerned about handlers for webhooks, which use the listers directly. In this case, there is no worker, and if the informer cache is synced, the lister should return "up-to-date" (or rather up-to-date enough) data.
You will see below that I already included some logic for InitializeTiers
to wait for TierInformer
(so it's a bit duplicate with the "generic" logic here). Since this patch is kind of specific to Tiers, I am ok with removing the changes to this file. What do you think?
I still think that the logic here is a positive improvement, but it could be done in a separate PR.
Does it make sense to run the apiserver and let each API handlers to check if their own dependecies are ready?
This was the approach I was looking at initially, but I realized it would be quite cumbersome. And because I was specifically targeting webhooks that consume listers directly, this seemed much easier. I am not sure there is a good way to wait for some condition in handlers either, whereas it is easy to do here, before starting the APIserver.
Besides validation webhooks, do you think there are handlers that would benefit from that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this patch is kind of specific to Tiers, I am ok with removing the changes to this file. What do you think?
I had two concerns with the waiting:
- I wasn't sure if it would introduce a dependency loop between kube-apiserver and antrea-controller (as it provides CRD webhooks), but thinking it more, List requests should not rely on webhooks.
- The apiserver would never run if one of the CRDs whose informers have been created is not installed, but that should be a programming error or an installation error, perhaps not make sense as a factor here.
This was the approach I was looking at initially, but I realized it would be quite cumbersome. And because I was specifically targeting webhooks that consume listers directly, this seemed much easier. I am not sure there is a good way to wait for some condition in handlers either, whereas it is easy to do here, before starting the APIserver.
Doing a cache sync check when handling every request does sound inefficient and unnecessary. Doing it once before running apiserver makes sense from this perspective. However, shouldn't we remove the check in InitializeTiers
? Otherwise people might follow it to add such checks, which would be unwanted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll remove the check in InitializeTiers
as it is redundant, and I'll improve the comment here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure if it would introduce a dependency loop between kube-apiserver and antrea-controller (as it provides CRD webhooks), but thinking it more, List requests should not rely on webhooks.
That's true for validation webhooks, but maybe not for conversion webhooks? For example, Tiers could be stored in an older version. In order to List the Tiers with the "current" version, the Controller has a dependency on itself. So it seems that your point was actually valid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was able to confirm that this is indeed an issue for conversion webhooks. I used the following steps to reproduce:
- Install Antrea v1.15.1, with AntreaIPAM enabled
- Create an IPPool CR using version v1alpha2
- Upgrade Antrea to my dev version, which includes this patch
The antrea-controller will now fail to start:
E0930 20:31:11.963897 1 reflector.go:158] "Unhandled Error" err="k8s.io/[email protected]/tools/cache/reflector.go:243: Failed to watch *v1beta1.IPPool: failed to list *v1beta1.IPPool: conversion webhook for crd.antrea.io/v1alpha2, Kind=IPPool failed: Post \"https://antrea.kube-system.svc:443/convert/ippool?timeout=30s\": dial tcp 10.96.20.239:443: connect: connection refused"
W0930 20:31:28.532298 1 reflector.go:561] k8s.io/[email protected]/tools/cache/reflector.go:243: failed to list *v1beta1.IPPool: conversion webhook for crd.antrea.io/v1alpha2, Kind=IPPool failed: Post "https://antrea.kube-system.svc:443/convert/ippool?timeout=30s": dial tcp 10.96.20.239:443: connect: connection refused
E0930 20:31:28.532622 1 reflector.go:158] "Unhandled Error" err="k8s.io/[email protected]/tools/cache/reflector.go:243: Failed to watch *v1beta1.IPPool: failed to list *v1beta1.IPPool: conversion webhook for crd.antrea.io/v1alpha2, Kind=IPPool failed: Post \"https://antrea.kube-system.svc:443/convert/ippool?timeout=30s\": dial tcp 10.96.20.239:443: connect: connection refused"
The controller will exit and get into CrashLoopBackOff
.
This is an edge case but it highlights the fact that it is tricky to introduce this dependency. Therefore I will update my patch to focus on Tier initialization. It seems that the best approach long term would be to find a way to declare and enforce dependencies for each handler, even though that means checking a condition before serving every request.
c23379f
to
6b90470
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
System Tier initialization should ideally wait for the Tier informer's cache to sync before using the lister. Otherwise we may think that the system Tiers have not been created yet when in fact it is just the informer that has not completed the initial List operation. Additionally, we improve the logic in system Tier initialization: * After every failed API call, we should probably check the state of the informer cache again (via the lister) and decide which step is needed next (creation / update / nothing). In case of a failed create or update, this gives us the opportunity to check again if the Tier actually exists, and if yes, what its priority is. * AlreadyExists is no longer treated differently for create. The reason is that if the Tier already exists and for some reason it was not returned by the Lister, it will probably be the validation webhook that fails (overlapping priority), and the error won't match AlreadyExists. Related to antrea-io#6694 Signed-off-by: Antonin Bas <[email protected]> Don't block apiserver when caches are not synced Signed-off-by: Antonin Bas <[email protected]>
6b90470
to
c40f66b
Compare
/test-all |
Thanks for the reviews! |
System Tier initialization should ideally wait for the Tier informer's cache to sync before using the lister. Otherwise we may think that the system Tiers have not been created yet when in fact it is just the informer that has not completed the initial List operation. Additionally, we improve the logic in system Tier initialization: * After every failed API call, we should probably check the state of the informer cache again (via the lister) and decide which step is needed next (creation / update / nothing). In case of a failed create or update, this gives us the opportunity to check again if the Tier actually exists, and if yes, what its priority is. * AlreadyExists is no longer treated differently for create. The reason is that if the Tier already exists and for some reason it was not returned by the Lister, it will probably be the validation webhook that fails (overlapping priority), and the error won't match AlreadyExists. Related to antrea-io#6694 Signed-off-by: Antonin Bas <[email protected]>
System Tier initialization should ideally wait for the Tier informer's cache to sync before using the lister. Otherwise we may think that the system Tiers have not been created yet when in fact it is just the informer that has not completed the initial List operation. Additionally, we improve the logic in system Tier initialization: * After every failed API call, we should probably check the state of the informer cache again (via the lister) and decide which step is needed next (creation / update / nothing). In case of a failed create or update, this gives us the opportunity to check again if the Tier actually exists, and if yes, what its priority is. * AlreadyExists is no longer treated differently for create. The reason is that if the Tier already exists and for some reason it was not returned by the Lister, it will probably be the validation webhook that fails (overlapping priority), and the error won't match AlreadyExists. Related to antrea-io#6694 Signed-off-by: Antonin Bas <[email protected]>
System Tier initialization should ideally wait for the Tier informer's cache to sync before using the lister. Otherwise we may think that the system Tiers have not been created yet when in fact it is just the informer that has not completed the initial List operation. Additionally, we improve the logic in system Tier initialization: * After every failed API call, we should probably check the state of the informer cache again (via the lister) and decide which step is needed next (creation / update / nothing). In case of a failed create or update, this gives us the opportunity to check again if the Tier actually exists, and if yes, what its priority is. * AlreadyExists is no longer treated differently for create. The reason is that if the Tier already exists and for some reason it was not returned by the Lister, it will probably be the validation webhook that fails (overlapping priority), and the error won't match AlreadyExists. Related to antrea-io#6694 Signed-off-by: Antonin Bas <[email protected]>
System Tier initialization should ideally wait for the Tier informer's cache to sync before using the lister. Otherwise we may think that the system Tiers have not been created yet when in fact it is just the informer that has not completed the initial List operation. Additionally, we improve the logic in system Tier initialization: * After every failed API call, we should probably check the state of the informer cache again (via the lister) and decide which step is needed next (creation / update / nothing). In case of a failed create or update, this gives us the opportunity to check again if the Tier actually exists, and if yes, what its priority is. * AlreadyExists is no longer treated differently for create. The reason is that if the Tier already exists and for some reason it was not returned by the Lister, it will probably be the validation webhook that fails (overlapping priority), and the error won't match AlreadyExists. Related to antrea-io#6694 Signed-off-by: Antonin Bas <[email protected]>
System Tier initialization should ideally wait for the Tier informer's cache to sync before using the lister. Otherwise we may think that the system Tiers have not been created yet when in fact it is just the informer that has not completed the initial List operation. Additionally, we improve the logic in system Tier initialization: * After every failed API call, we should probably check the state of the informer cache again (via the lister) and decide which step is needed next (creation / update / nothing). In case of a failed create or update, this gives us the opportunity to check again if the Tier actually exists, and if yes, what its priority is. * AlreadyExists is no longer treated differently for create. The reason is that if the Tier already exists and for some reason it was not returned by the Lister, it will probably be the validation webhook that fails (overlapping priority), and the error won't match AlreadyExists. Related to #6694 Signed-off-by: Antonin Bas <[email protected]>
System Tier initialization should ideally wait for the Tier informer's cache to sync before using the lister. Otherwise we may think that the system Tiers have not been created yet when in fact it is just the informer that has not completed the initial List operation. Additionally, we improve the logic in system Tier initialization: * After every failed API call, we should probably check the state of the informer cache again (via the lister) and decide which step is needed next (creation / update / nothing). In case of a failed create or update, this gives us the opportunity to check again if the Tier actually exists, and if yes, what its priority is. * AlreadyExists is no longer treated differently for create. The reason is that if the Tier already exists and for some reason it was not returned by the Lister, it will probably be the validation webhook that fails (overlapping priority), and the error won't match AlreadyExists. Related to antrea-io#6694 Signed-off-by: Antonin Bas <[email protected]>
System Tier initialization should ideally wait for the Tier informer's cache to sync before using the lister. Otherwise we may think that the system Tiers have not been created yet when in fact it is just the informer that has not completed the initial List operation. Additionally, we improve the logic in system Tier initialization: * After every failed API call, we should probably check the state of the informer cache again (via the lister) and decide which step is needed next (creation / update / nothing). In case of a failed create or update, this gives us the opportunity to check again if the Tier actually exists, and if yes, what its priority is. * AlreadyExists is no longer treated differently for create. The reason is that if the Tier already exists and for some reason it was not returned by the Lister, it will probably be the validation webhook that fails (overlapping priority), and the error won't match AlreadyExists. Related to antrea-io#6694 Signed-off-by: Antonin Bas <[email protected]>
System Tier initialization should ideally wait for the Tier informer's cache to sync before using the lister. Otherwise we may think that the system Tiers have not been created yet when in fact it is just the informer that has not completed the initial List operation. Additionally, we improve the logic in system Tier initialization: * After every failed API call, we should probably check the state of the informer cache again (via the lister) and decide which step is needed next (creation / update / nothing). In case of a failed create or update, this gives us the opportunity to check again if the Tier actually exists, and if yes, what its priority is. * AlreadyExists is no longer treated differently for create. The reason is that if the Tier already exists and for some reason it was not returned by the Lister, it will probably be the validation webhook that fails (overlapping priority), and the error won't match AlreadyExists. Related to antrea-io#6694 Signed-off-by: Antonin Bas <[email protected]>
System Tier initialization should ideally wait for the Tier informer's cache to sync before using the lister. Otherwise we may think that the system Tiers have not been created yet when in fact it is just the informer that has not completed the initial List operation. Additionally, we improve the logic in system Tier initialization: * After every failed API call, we should probably check the state of the informer cache again (via the lister) and decide which step is needed next (creation / update / nothing). In case of a failed create or update, this gives us the opportunity to check again if the Tier actually exists, and if yes, what its priority is. * AlreadyExists is no longer treated differently for create. The reason is that if the Tier already exists and for some reason it was not returned by the Lister, it will probably be the validation webhook that fails (overlapping priority), and the error won't match AlreadyExists. Related to #6694 Signed-off-by: Antonin Bas <[email protected]>
System Tier initialization should ideally wait for the Tier informer's cache to sync before using the lister. Otherwise we may think that the system Tiers have not been created yet when in fact it is just the informer that has not completed the initial List operation. Additionally, we improve the logic in system Tier initialization: * After every failed API call, we should probably check the state of the informer cache again (via the lister) and decide which step is needed next (creation / update / nothing). In case of a failed create or update, this gives us the opportunity to check again if the Tier actually exists, and if yes, what its priority is. * AlreadyExists is no longer treated differently for create. The reason is that if the Tier already exists and for some reason it was not returned by the Lister, it will probably be the validation webhook that fails (overlapping priority), and the error won't match AlreadyExists. Related to #6694 Signed-off-by: Antonin Bas <[email protected]>
System Tier initialization should ideally wait for the Tier informer's cache to sync before using the lister. Otherwise we may think that the system Tiers have not been created yet when in fact it is just the informer that has not completed the initial List operation. Additionally, we improve the logic in system Tier initialization: * After every failed API call, we should probably check the state of the informer cache again (via the lister) and decide which step is needed next (creation / update / nothing). In case of a failed create or update, this gives us the opportunity to check again if the Tier actually exists, and if yes, what its priority is. * AlreadyExists is no longer treated differently for create. The reason is that if the Tier already exists and for some reason it was not returned by the Lister, it will probably be the validation webhook that fails (overlapping priority), and the error won't match AlreadyExists. Related to antrea-io#6694 Signed-off-by: Antonin Bas <[email protected]>
System Tier initialization should ideally wait for the Tier informer's cache to sync before using the lister.
Otherwise we may think that the system Tiers have not been created yet when in fact it is just the informer that has not completed the initial List operation.
Additionally, we improve the logic in system Tier initialization:
Related to #6694