-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OSX Fabric binaries do not resolve DNS correctly #3372
Comments
Here is a small sample that illustrates the differences in the The
Golang DNS resolver (slow!)
Or running the peer binary (ouch!)
cgo (host) DNS resolver (fast!)
Or push the "turbo button" and call the peer:
|
Why does it take 5 seconds? Does it attempt some DNS queries which timeout? Can you perhaps tcpdump and see? By the way, I believe this was added due to issues such as these. |
I sat down with a debugger and wireshark this afternoon, stepping through the golang DNS routines and packet trace / captures. It's still not clear to me if this is a race condition / deadlock (or soft-lock) in the golang DNS, or a bug in how go is parsing the UDP responses from the DNS services. The golang resolver is issuing both an A and AAAA request in parallel to resolvers listed in /etc/resolv.conf. The 5s timeout is based on the default DNS resolver timeout of 5s and can be overridden by setting a
in /etc/resolv.conf. The resolver will block until both the A and AAAA responses are parsed from the UDP stream. It looks like one of the requests is returning instantly, and the other request is triggering a Context timeout in reading the response from the DNS service. It's weird - sometimes it's the A query, and other times the IPV6 / AAAA query that times out, depending on the setting for the resolver timeout. Even though the resolver gets a response for one of the queries, it's waiting for the second response to complete (timeout) before responding with a DNS value. So the reason for the 5s lookup is not that the DNS service is slow, but that ONE of the UDP A / AAAA queries is timing out, and the golang resolver is waiting for both requests to complete before returning a host record. Another thing I noticed was that the IBM DNS resolvers 9.0.0.1 and 9.0.0.2 behave differently from the Google DNS 8.8.8.8 and 8.8.4.4 services. The 9.0.0.1 resolvers work predictably and consistently, and the Google DNS resolvers are sporadic. The errors are not just limited to google DNS - We've seen this crop up with DNS default networking in the UK with BT, and ATT in the US. This is pretty much the exact behavior as reported in golang Issue #43398. This bug description is 100% consistent with the behavior I saw with the golang resolver while attached to a debugger - parallel A and AAAA queries are broadcast in parallel via UDP, the resolver waits for a reply for both, but only receives one response packet after timing out with the 5s resolver deadline. The 9.0.0.x may have more reliable behavior as it's associated with a VPN and managing ARP entries differently than resolvers on the wild, wild Internet. So I think the original issue was in resolving the fictitious / virtual "example.com" domains in Docker compose. There is some magic in how Docker / compose sets up the DNS resolvers on docker networks, and it's no surprise that this would lead to SIGSEGVs in the go binaries. Gari addressed this by switching to the We should be OK if we switch the OSX client binaries over to use the @denyeart would we actually need to run a Mac pipeline builder to accomplish this? Or can we put in a switch in the Makefiles to enable the cgo resolvers when cross-compiling the Darwin binaries? |
After fiddling for aeons with the ARM builds, CGO, go DNS, ARP cache, DNS resolvers, cross-compilers, stepping through go DNS routines with a debugger, and sleeping on it for months... I finally "get" this bug. And the issues such as these links above... What's sad is that it only manifests in very obscure environments that tickle the golang DNS implementation in a really weird edge condition (e.g. on some of my computers and on all of @denyeart 's dev systems.) The root issue has been papered over so many times that it's largely invisible. But the core issue is that the Fabric binaries were being prepared as dynamically linked executables, or were statically linked but still included dependencies on the libc runtimes for DNS and net packages. The original solution involved switching from the libc DNS resolver to go's DNS re-implementation, which avoided the link problems but still had other issues that came up again when we started running the binaries on the arm64 architecture on alpine, which does not technically include a stable libc runtime. The correct thing to do may seem a bit outlandish and like overkill, but it will put all of these link/runtime/libc related issues to bed for good. (Fortunately the world of portable cross-compilers and tool chain construction has advanced significantly in the last decade, as well as CGO, which is really quite a spectacular piece of software.) This also will set us up well for alternative architectures with a consistent approach.
|
Looks like this is fixed with fabric 2.5.0-beta, the native M1 cgo static builds, and possibly some magic in the newer revs of golang. 2.5.0-beta looks great on the M1. The peer CLI no longer has a 5-10s delay! Huzzah! |
The cross-platform compilation of the Fabric binaries for OSX causes DNS resolution on the Mac to have some extremely unpredictable behavior. This is due to our compile-time linkage of the golang DNS implementation (
netdns=go
) into the Fabric binaries at compile time.Due to differences in the Golang DNS resolver (
go
) and host DNS (cgo
), the peer binary will frequently time out and/or experience excessive lags when resolving DNS. This symptom is problematic when resolving dynamic, wildcard DNS entries such as those provided by the Dead simple wildcard DNS for any IP Address *.nip.io or *.vcap.me domains.This problem can be addressed (?) by running a build of the Fabric binaries and linking in the
cgo
DNS resolver at compile time. With this configuration, the OSX binaries will use the normal host DNS resolution, and should be spared from the frequent timeouts that occur when working with VLans, DNS host overrides, and VPNs.Validate the approach by compiling the peer binaries on a Mac, forcing the DNS resolver with
cgo
.If this approach pans out, build the OSX binaries on GitHub Actions, which appear to include an OSX pipeline runner.
The text was updated successfully, but these errors were encountered: