-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix deadlock when accessing dirtyRules in fqdn controller #5566
Conversation
@Dyanngg have you managed to reproduce the issue? You mentioned "a deadlock could occur when rule sync |
Thanks for the reminder. I reproduced the deadlock in UT when OVS error is simulated. Now that you mention it, I realized that the same issue could also happen if a single rule has multiple FQDNs, and during the handling of the proactive record update of these FQDNs, one of the FQDN response could have marked the rule dirty since the agent has not finished rule sync yet, while the other FQDN response tries to add a subscriber for the same rule, causing deadlock. I will add a new UT testcase validating this theory, and verify that the deadlock could occur before the fix and will not after the fix. |
7a4571c
to
8393f87
Compare
@tnqn Please check the updated PR description and the latest added testcase. While the specific scenario is tricky to reproduce in an real setup (due to the need for two concurrent DNS record updates and one with address change and one does not), the last UT testcase reproduces the deadlock very steadily: without the change, it hangs every single time after the |
@Dyanngg the code and unit test look good to me. However, could we still try to reproduce the issue in a cluster and see if this can completely fix the issue in case there is something else preventing it from working? We have met several issues related to NetworkPolicy (especially when FQDN rule is used) in the last few months, we need to be more cautious to deliver one more patch release. I think in theory it can be reproduced by generating two concurrent DNS resolution towards the same steady FQDN, the first resolution's And I just got an update from the reporter: restarting a workload pod alone could fix the issue, which I haven't figured out why. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a nit
I figured out this, it didn't really recover, the reason why new Pod can resolve domain should be because the realization of the FQDN rule for new Pod was stuck in |
Trying to reproduce the issue but no luck so far. Domains that I remember to have dynamic IP ranges seems to return pretty steady resolved dns addresses. Test yamls used:
|
Signed-off-by: Dyanngg <[email protected]>
8393f87
to
3a28b93
Compare
/test-all |
The test commands you used triggered DNS lookups sequentially, I think it can't lead to deadlock. While trying to construct a test case to reproduce the issue, I found that even concurrent DNS lookups can't trigger it because packetin events are processed sequentially. And given there is no error about networkpolicy realization, I'm thinking the issue is probably more complicated and this patch will not likely fix the issue completely. I will share my hypothesis in #5565 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@Dyanngg please backport it to 1.11-1.13. |
This commit introduces a data race. #5583 will fix it. The cherry-picking PRs need to include the latter commit as well. |
Fixes #5565
As descried in the issue above, a deadlock could occur when :
retries it later despite there's no IP changes for the FQDN.
fqdn controller receives updated records for these FQDNs at around
the same time. One of the FQDN has address updates so the rule
need to be resynced and thus marked dirty. Another FQDN does not
have address updates, but the controller tries to read the dirtyRules
to make sure that, if it is a previously failed to sync rule, it gets
re-queued even when there's no address updates.
This PR addresses the deadlock issue by explicitly putting the lock
around the get dirty rules op itself.
Additional UT for these potential rule sync scenarios are also added.