-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DNS queries with unknown datacenters can cause excessive load on consul servers and force agents to run out of file descriptors #807
Comments
Hmm interesting. All of this is mostly expected behavior with the exception of running out of file descriptors. I'm going to tag this as a bug to investigate that issue. |
@primal-github I think this was actually caused by an unrelated issue in the connection pooling between servers. If an RPC returned an error, the connection would not be reused. In this case, an invalid domain would always cause and error, so each query would start a new internal connection. This looks to be resolved in master! |
I'm just closing for now, but please comment / re-open if you see this again! |
Might this be related to #688 ? |
@frankfarmer In this case they were both co-occurring. As @armon mentioned this was likely caused by the lack of connection reuse, which may have in turn triggered excessive file descriptors being used. We addressed the cause (fixed our dns lookups) so we haven't had the urge to replicate it again. |
Interesting, because I'm seeing the same warnings coming in tens per second except in my case the DC has the correct name. This is on 0.9.3 |
* Drop support for Helm 2
If a consul agent receives DNS queries of the form
someservice.service.falsedc.domain.consul
these queries will cause excessive load on the consul servers along with log lines of the form[WARN] consul.rpc: RPC request for DC 'falsedc', no path found
. At a glance it seems like the server should fail early when it cannot find the datacenter, but instead recurses until the request's TTL is reached and dropped.Furthermore the agent that received the query will show log lines of the form
[ERR] dns: rpc error: rpc error: No path to datacenter
. Furthermore if the agent receives these queries at a moderate rate it will eventually run out of file descriptors. I suspect that perhaps a new socket is opened for each pending query. This is not necessarily bad as responses should be fast, but the first part of this issue causes consul to open more and more sockets until it can't open any more. The errors from this scenario also cause the consul agent to write gigabytes of logs within minutes.The issue can be replicated on a Linux system which has the consul agent set as its nameserver (e.g. via binding to port 53 or via dnsmasq) by adding
domain.consul
to the search domains in/etc/resolv.conf
(e.g.search domain.consul
) and running queries of the formatsomeservice.service.domain.consul
, which get expanded by the resolver tosomeservice.service.domain.consul.domain.consul
. However I'm fairly certain that this is just a special case, and that the issue should be reproducible with any nonexisting datacenter and any consul domain.The text was updated successfully, but these errors were encountered: