-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dns: DnsResolverImpl keeps using a "broken" c-ares channel #4543
Comments
I have just experienced a very similar situation to this. Our proxy was deployed in an environment experiencing a lot of DNS failures, and at some point all DNS lookups just stopped working. We fixed the DNS issue, but the envoy instances never recovered and we had to kill them and restart. The new instances worked just fine. We also had logs similar to the above:
It seems like after some number of DNS failures, the async resolver gets into some bad state and is unable to resolve things permanently. We are also using STRICT_DNS with active health checks. Using Envoy 1.7.0 from the published docker image. |
Looking a little more, the sequence of events in our situation is:
I also have full debug logs from one instance while it was in this state if it's helpful. |
@jasonmartens, sorry, are you testing on master now? |
It sounds like there might be a bug here in how we are interacting with c-ares but I'm not sure. I would definitely try on current master and see if we can come up with a repro. |
@dio The problem I described above happened with master only but may be a few weeks old build. |
@ramaraochavali got it, I'll take a look at it. |
I was not testing on master, using 1.7.0 from the Envoy docker image repo. |
@dio, @htuch made the DNS resolver use c-ares a long time ago, and the code really hasn't changed since then. The timeout handling is complicated in that library so I would probably start with some auditing of all the timeout code. I suspect there might be some case in which we aren't handling timeouts properly. IIRC c-areas has default timeouts in place, but I would check that also. @htuch might also have some ideas. |
Possibly the timeout handling in |
@mattklein123 @htuch got it. Let see what I can do to help. |
@dio just a ping. Did you find any thing on this? |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions. |
@dio were you able to spend time on this? Any thing you found? |
I meet the same problem, Any progress for this? |
@ramaraochavali @gatesking sorry that I haven't got anything. Will update you when I have it. OTOH if you want to help, that will be nice! |
I had the same problem. envoy static_resources config( service1 can be resolved by dns ,while service2 can't be) :
some logs(It takes almost 75s to start up):
envoy admin endpoint of /server_info:
|
Is this still an issue for anyone watching this issue? I investigated and I couldn't find anything obviously wrong. It's possible this has been fixed somehow along the way. |
It is possible that it might have been resolved along the way - We can close this and possibly revisit if someone complains about it. |
I just found this issue on one envoy 1.10.0 instance. From what we noticed in the past:
Comparing two instances with identical configuration, here is what I noticed: "Good" instance has only: "Bad" instance has: |
I suspect there is some race condition here potentially within c-ares, but I'm not sure. Reopening and marking help wanted. |
Envoy Mobile has the same issue in iOS. Steps to repro:
Config used: I am going to be looking at this issue as the setup above repros this issue 100% of the time. |
Did some late night digging yesterday and arrived at an explanation: When c-ares initializes a channel (trimming irrelevant details):
Solution:
|
By the way, it is worth noting that this would affect any cluster that uses |
Description: this PR adds logic to the DnsResolverImpl to destroy and re-initialize its c-ares channel under certain circumstances. A better option would require work in c-ares c-ares/c-ares#301. Risk Level: med changes in low-level DNS resolution. Testing: unit tests Fixes #4543 Signed-off-by: Jose Nino <[email protected]>
I use 1.11.0, still has this issue |
Yes, 1.11.0 was released before this commit went in. I believe 1.14.0 is the first version where this is fixed. |
I upgrade the ambassador, now it use 1.15.1, still has this problem. |
I have resolved this issue, not envoy's issue. It is the DNS resolution performance problem of the k8s's cluster |
envoy check c-ares ARES_ECONNREFUSED status and reinit channel to cover /etc/resolv.conf DNS server change ...... but another questions : some times DNS server down a while; and DNS recover envoy can't recover auto ?
c-area will close conn when some request not success; and reopen new conn on next request; so when dns server recover it will resolve complete also .... |
We have a
STRICT_DNS
type of a cluster defined in bootstrap config. In one of our test Pods, the membership count of this cluster became zero. This is understandable because the DNS resolution might have resulted in zero hosts. However this remained like this for quite a long time and after killing the container, Envoy is able to successfully resolve the DNS.I have taken debug logs when Envoy is not able to resolve this. I see the following line
"source/common/network/dns_impl.cc:118]
DNS request timed out 4 times
",,And I see these lines repeatedly
"source/common/network/dns_impl.cc:147] Setting DNS resolution timer for 0 milliseconds",,
"source/common/network/dns_impl.cc:147] Setting DNS resolution timer for 0 milliseconds",,
"source/common/network/dns_impl.cc:147] Setting DNS resolution timer for 0 milliseconds",,
"source/common/network/dns_impl.cc:147] Setting DNS resolution timer for 0 milliseconds",,
"source/common/network/dns_impl.cc:147] Setting DNS resolution timer for 0 milliseconds",,
"source/common/network/dns_impl.cc:147] Setting DNS resolution timer for 0 milliseconds",,
"source/common/network/dns_impl.cc:147] Setting DNS resolution timer for 0 milliseconds",,
"source/common/network/dns_impl.cc:147] Setting DNS resolution timer for 22 milliseconds"
So at this point I am not very clear if it is Envoy issue or container DNS issue - as container restart resolved the issue.
Has any one seen similar issues with DNS? and another question is it the DNS resolution timer behaviour correct in the sense it is trying to resolve 0 milliseconds?
The text was updated successfully, but these errors were encountered: