Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSX Fabric binaries do not resolve DNS correctly #3372

Closed
jkneubuh opened this issue May 3, 2022 · 6 comments
Closed

OSX Fabric binaries do not resolve DNS correctly #3372

jkneubuh opened this issue May 3, 2022 · 6 comments

Comments

@jkneubuh
Copy link
Contributor

jkneubuh commented May 3, 2022

The cross-platform compilation of the Fabric binaries for OSX causes DNS resolution on the Mac to have some extremely unpredictable behavior. This is due to our compile-time linkage of the golang DNS implementation (netdns=go) into the Fabric binaries at compile time.

Due to differences in the Golang DNS resolver (go) and host DNS (cgo), the peer binary will frequently time out and/or experience excessive lags when resolving DNS. This symptom is problematic when resolving dynamic, wildcard DNS entries such as those provided by the Dead simple wildcard DNS for any IP Address *.nip.io or *.vcap.me domains.

This problem can be addressed (?) by running a build of the Fabric binaries and linking in the cgo DNS resolver at compile time. With this configuration, the OSX binaries will use the normal host DNS resolution, and should be spared from the frequent timeouts that occur when working with VLans, DNS host overrides, and VPNs.

Validate the approach by compiling the peer binaries on a Mac, forcing the DNS resolver with cgo.

If this approach pans out, build the OSX binaries on GitHub Actions, which appear to include an OSX pipeline runner.

@jkneubuh
Copy link
Contributor Author

jkneubuh commented May 3, 2022

cc: @denyeart @mbwhite @jt-nti

@jkneubuh
Copy link
Contributor Author

jkneubuh commented May 3, 2022

Here is a small sample that illustrates the differences in the cgo and go DNS resolvers when running on OSX.

The vcap.me DNS domain is a dynamic service that includes an A wildcard record, resolving *.vcap.me to the local loopback interface at 127.0.0.1. Resolution of the URLs should be instantaneous, but when using the go DNS resolver, frequently take > 5s per TCP connection. This causes tremendous chaos downstream for peer and osnadmin commands working with Fabric binaries on the Mac.

package main

import (
	"fmt"
	"net/http"
	"os"
)

func main() {
	url := os.Args[1]
	fmt.Printf("connecting to %s\n", url)

	_, err := http.Get(url)
	if err != nil {
		fmt.Fprintf(os.Stderr, "error reading %s: %v\n", url, err)
		os.Exit(1)
	}

	fmt.Printf("OK!\n")
	os.Exit(0)
}

Golang DNS resolver (slow!)

$ time GODEBUG=netdns=go+2 time go run getit.go http://foobar.vcap.me
connecting to http://foobar.vcap.me
go package net: GODEBUG setting forcing use of Go's resolver
go package net: hostLookupOrder(foobar.vcap.me) = files,dns
OK!
        7.18 real         0.73 user         1.54 sys

real	0m7.197s
user	0m0.738s
sys	0m1.550s

Or running the peer binary (ouch!)

$ time GODEBUG=netdns=go+2  peer lifecycle     chaincode approveformyorg     --channelID     mychannel     --name          asset-transfer-basic     --version       1     --package-id    basic_1.0:b58c4a27caabaa81b2bbe47c7a9f89ff100914da9e075269e8fb65c67387e44d     --sequence      1     --orderer       org0-orderer1.vcap.me:443     --connTimeout   10s     --tls --cafile  $PWD/build/channel-msp/ordererOrganizations/org0/orderers/org0-orderer1/tls/signcerts/tls-cert.pem
go package net: GODEBUG setting forcing use of Go's resolver
go package net: hostLookupOrder(org2-peer1.vcap.me) = files,dns
go package net: hostLookupOrder(org2-peer1.vcap.me) = files,dns
go package net: hostLookupOrder(org0-orderer1.vcap.me) = files,dns
Error: proposal failed with status: 500 - failed to invoke backing implementation of 'ApproveChaincodeDefinitionForMyOrg': attempted to redefine uncommitted sequence (1) for namespace asset-transfer-basic with unchanged content

real	0m15.393s
user	0m0.056s
sys	0m0.031s

cgo (host) DNS resolver (fast!)

$ time GODEBUG=netdns=cgo+2 time go run getit.go http://foobar.vcap.me
connecting to http://foobar.vcap.me
go package net: using cgo DNS resolver
go package net: hostLookupOrder(foobar.vcap.me) = cgo
OK!
        0.95 real         0.66 user         0.88 sys

real	0m0.960s
user	0m0.663s
sys	0m0.893s

Or push the "turbo button" and call the peer:

$ time GODEBUG=netdns=cgo+2  peer lifecycle     chaincode approveformyorg     --channelID     mychannel     --name          asset-transfer-basic     --version       1     --package-id    basic_1.0:b58c4a27caabaa81b2bbe47c7a9f89ff100914da9e075269e8fb65c67387e44d     --sequence      1     --orderer       org0-orderer1.vcap.me:443     --connTimeout   10s     --tls --cafile  $PWD/build/channel-msp/ordererOrganizations/org0/orderers/org0-orderer1/tls/signcerts/tls-cert.pem
go package net: using cgo DNS resolver
go package net: hostLookupOrder(org2-peer1.vcap.me) = cgo
go package net: hostLookupOrder(org2-peer1.vcap.me) = cgo
go package net: hostLookupOrder(org0-orderer1.vcap.me) = cgo
Error: proposal failed with status: 500 - failed to invoke backing implementation of 'ApproveChaincodeDefinitionForMyOrg': attempted to redefine uncommitted sequence (1) for namespace asset-transfer-basic with unchanged content

real	0m0.329s
user	0m0.054s
sys	0m0.039s

@yacovm
Copy link
Contributor

yacovm commented May 3, 2022

but when using the go DNS resolver, frequently take > 5s per TCP connection.

Why does it take 5 seconds? Does it attempt some DNS queries which timeout? Can you perhaps tcpdump and see?
I can't test it locally because I use Linux, not MacOS.

By the way, I believe this was added due to issues such as these.

@jkneubuh
Copy link
Contributor Author

jkneubuh commented May 4, 2022

I sat down with a debugger and wireshark this afternoon, stepping through the golang DNS routines and packet trace / captures. It's still not clear to me if this is a race condition / deadlock (or soft-lock) in the golang DNS, or a bug in how go is parsing the UDP responses from the DNS services.

The golang resolver is issuing both an A and AAAA request in parallel to resolvers listed in /etc/resolv.conf. The 5s timeout is based on the default DNS resolver timeout of 5s and can be overridden by setting a

options timeout:N 

in /etc/resolv.conf.

The resolver will block until both the A and AAAA responses are parsed from the UDP stream. It looks like one of the requests is returning instantly, and the other request is triggering a Context timeout in reading the response from the DNS service. It's weird - sometimes it's the A query, and other times the IPV6 / AAAA query that times out, depending on the setting for the resolver timeout. Even though the resolver gets a response for one of the queries, it's waiting for the second response to complete (timeout) before responding with a DNS value. So the reason for the 5s lookup is not that the DNS service is slow, but that ONE of the UDP A / AAAA queries is timing out, and the golang resolver is waiting for both requests to complete before returning a host record.

Another thing I noticed was that the IBM DNS resolvers 9.0.0.1 and 9.0.0.2 behave differently from the Google DNS 8.8.8.8 and 8.8.4.4 services. The 9.0.0.1 resolvers work predictably and consistently, and the Google DNS resolvers are sporadic. The errors are not just limited to google DNS - We've seen this crop up with DNS default networking in the UK with BT, and ATT in the US.

This is pretty much the exact behavior as reported in golang Issue #43398. This bug description is 100% consistent with the behavior I saw with the golang resolver while attached to a debugger - parallel A and AAAA queries are broadcast in parallel via UDP, the resolver waits for a reply for both, but only receives one response packet after timing out with the 5s resolver deadline. The 9.0.0.x may have more reliable behavior as it's associated with a VPN and managing ARP entries differently than resolvers on the wild, wild Internet.

So I think the original issue was in resolving the fictitious / virtual "example.com" domains in Docker compose. There is some magic in how Docker / compose sets up the DNS resolvers on docker networks, and it's no surprise that this would lead to SIGSEGVs in the go binaries. Gari addressed this by switching to the netdns=go resolver on the clients, but that is clearly causing some issues when running the peer commands natively on OSX. I think this papered over some issues in how Compose is resolving the fictitious domains. There's probably a "better" way to solve this with Compose... but ... ... onward!

We should be OK if we switch the OSX client binaries over to use the cgo resolver, and leave the Linux ones using the go DNS resolver. When running the binaries in Compose, all of the routines will be running Linux builds in containers. I ran a LOT of tests with cgo binaries on the Mac this afternoon (native builds of fabric source), and all look OK great with the host network resolvers.

@denyeart would we actually need to run a Mac pipeline builder to accomplish this? Or can we put in a switch in the Makefiles to enable the cgo resolvers when cross-compiling the Darwin binaries?

@jkneubuh
Copy link
Contributor Author

After fiddling for aeons with the ARM builds, CGO, go DNS, ARP cache, DNS resolvers, cross-compilers, stepping through go DNS routines with a debugger, and sleeping on it for months... I finally "get" this bug. And the issues such as these links above...

What's sad is that it only manifests in very obscure environments that tickle the golang DNS implementation in a really weird edge condition (e.g. on some of my computers and on all of @denyeart 's dev systems.) The root issue has been papered over so many times that it's largely invisible.

But the core issue is that the Fabric binaries were being prepared as dynamically linked executables, or were statically linked but still included dependencies on the libc runtimes for DNS and net packages. The original solution involved switching from the libc DNS resolver to go's DNS re-implementation, which avoided the link problems but still had other issues that came up again when we started running the binaries on the arm64 architecture on alpine, which does not technically include a stable libc runtime.

The correct thing to do may seem a bit outlandish and like overkill, but it will put all of these link/runtime/libc related issues to bed for good. (Fortunately the world of portable cross-compilers and tool chain construction has advanced significantly in the last decade, as well as CGO, which is really quite a spectacular piece of software.) This also will set us up well for alternative architectures with a consistent approach.

  1. Local dev -> build whatever (go build ; go install; just go code as normal, using CGO for DNS, net, etc.)

  2. When preparing release binaries: cross-compile a static executable, including DNS. This avoids the runtime dependency on libc, but bundles in the platform-specific runtime for DNS, net, etc. packages. We used a similar approach for building the fabric-ca-server, and it seems to be working well for generating an embedded sqlite database.) E.g.:

CC=/usr/local/musl/bin/musl-gcc go build --ldflags '-linkmode external -extldflags "-static"'
  1. When preparing Docker images, copy the (statically linked) release artifacts into the container, rather than building / linking against the current libc.

This issue is tangentially related to PR #3856 and PR #3857

@jkneubuh
Copy link
Contributor Author

jkneubuh commented Feb 9, 2023

Looks like this is fixed with fabric 2.5.0-beta, the native M1 cgo static builds, and possibly some magic in the newer revs of golang.

2.5.0-beta looks great on the M1. The peer CLI no longer has a 5-10s delay! Huzzah!

@jkneubuh jkneubuh closed this as completed Feb 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants