Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a custom authority for each connection in a single ClientConn #3038

Closed
jpbetz opened this issue Sep 25, 2019 · 17 comments · Fixed by #3073
Closed

Use a custom authority for each connection in a single ClientConn #3038

jpbetz opened this issue Sep 25, 2019 · 17 comments · Fixed by #3073
Labels
P2 Type: Feature New features or improvements in behavior

Comments

@jpbetz
Copy link
Contributor

jpbetz commented Sep 25, 2019

Issue was originally reported in kubernetes/kubernetes#83028 and is also tracked in etcd via etcd-io/etcd#11180.

What version of gRPC are you using?

grpc v1.23.1

What version of Go are you using (go version)?

go1.12.2

What operating system (Linux, Windows, …) and version?

debian linux

What did you do?

Created 3 etcd servers, each with its own hostname as the Subject Alternative Name in its cert, e.g.:

Certificate:
    Data:
    ...
    X509v3 extensions:
        X509v3 Subject Alternative Name: 
                DNS:member1.etcd.local

And attempt to load balance across them with an etcd client that uses the grpc load balancer and has the correct CA cert.

The etcd issue is reproduced by: https://github.com/jpbetz/etcd/blob/etcd-lb-dnsname-failover/reproduction.md

But if we can come up with a grpc example that solves the problem, we can apply that to etcd. So we created the load_balancing example with TLS to highlight the grpc issue in isolation.

What did you expect to see?

Client load balances across the servers.

What did you see instead?

For the case of kubernetes and etcd case:

W0924 14:42:29.368784 248333 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://member3.etcd.local:32379 0 }. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for member3.etcd.local, not member1.etcd.local"

For the grpc load balancer example with TLS enabled:

--- calling helloworld.Greeter/SayHello with pick_first ---
2019/09/24 17:58:49 could not greet: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for member2.etcd.local, not lb.example.grpc.io"
exit status 1

cc @gyuho @liggitt

@jpbetz
Copy link
Contributor Author

jpbetz commented Sep 25, 2019

Looks like there might be a way to keep track of the hostnames in the originally registered addresses if I use a DialContext to capture and a custom TransportCredentials to replace the authority with it. It's ugly though.

@jpbetz
Copy link
Contributor Author

jpbetz commented Sep 25, 2019

Here's the potential workaround using the DialContext and TransportCredentials approach: https://github.com/etcd-io/etcd/pull/11184/files#diff-a852e6c5f9d2efdb2e0d58912aa25cc4R657

Could this be done more cleanly?

@ejona86
Copy link
Member

ejona86 commented Sep 25, 2019

How are the member*.etcd.local names discovered? Are they hard-coded on client-side? Where is address resolution done to look up the IP addresses?

So the problem is you need to load balance over multiple machines that share no common subject alternative name?

@jpbetz
Copy link
Contributor Author

jpbetz commented Sep 25, 2019

How are the member*.etcd.local names discovered? Are they hard-coded on client-side? Where is address resolution done to look up the IP addresses?

etcd clients are configured with a list of endpoints, e.g.: --endpoints=member1.etcd.local:2379,member2.etcd.local:2379,member3.etcd.local:2379 (the go client is configured in the same way, just with a go slice of endpoint strings). The address resolution is done by the dialer. In the case of etcd it's here: https://github.com/etcd-io/etcd/blob/efd1fc634b58a629903990e605f2cb9d5633706d/clientv3/client.go#L234

So the problem is you need to load balance over multiple machines that share no common subject alternative name?

Exactly. And it can be safer to use distinct subject alternative names for each etcd server in a cluster. We just put together a workaround (etcd-io/etcd#11184). The summary there might help explain as well.

@jpbetz
Copy link
Contributor Author

jpbetz commented Sep 25, 2019

@ejona86 Also, I would be great to have a cleaner approach where the authority could vary across the endpoints without having to capture data in the dialer and looking it up in the ClientHandshake since they're called in series when the client is created:

func newHTTP2Client(connectCtx, ctx context.Context, addr TargetInfo, opts ConnectOptions, onPrefaceReceipt func(), onGoAway func(GoAwayReason), onClose func()) (_ *http2Client, err error) {

@dfawley
Copy link
Member

dfawley commented Sep 25, 2019

Exactly. And it can be safer to use distinct subject alternative names for each etcd server in a cluster.

Why is that? What do you mean by "safer"? It seems in this case, the most natural option would be to specify a single SAN for all of the servers which references the service name the gRPC client uses as its dial target. The service name should ideally resolve to multiple addresses in DNS (one for each host which has this certificate deployed).

I.e.:

/etc/hosts:

1.2.3.4   etcd.local # member1.etcd.local
1.2.3.5   etcd.local # member2.etcd.local
1.2.3.6   etcd.local # member3.etcd.local

gRPC client:

  grpc.Dial("dns:///etcd.local", WithTransportCredentials(tlsCreds))

Assuming the server certs are all signed with "etcd.local" as their SAN, then everything else should "just work".

@jpbetz
Copy link
Contributor Author

jpbetz commented Sep 25, 2019

etcd's client configuration options predate etcd's gPRC load balancer adoption. There is a discovery option much like the "dns:///etcd.local" option above, which is very convenient. There is also a endpoint list option, e.g.:

--endpoints=member1.etcd.xyz:2379,member2.etcd.xyz:2379,member3.etcd.xyz:2379"

etcd supports normal TLS usage. Different users have different environments and security policies they need to adhere to, so etcd doesn't attempt to be too opinionated about how TLS should be used. We document configurations that are secure, but allow other configurations that TLS supports. If the user wishes to use a shared SAN like "etcd.local", that's fine. If they wish to use a per-server SAN that's fine too. etcd has supported this in the past, and currently supports per server IP SANs with the load balancer, but not per-server DNS name SANs.

FWIW Using per server SANs has prevented configuration mistakes; misconfigured networks where the certs were misplaced or where hostnames were mis-assigned that are only noticed because the cert checks then fail and provide a warning about the misconfiguration. But more commonly, it's just a security requirement.

@nerzhul
Copy link

nerzhul commented Sep 26, 2019

yes, sorry i'm definitively in that use case to have a per node DNS SAN, sorry :)

@dfawley dfawley changed the title gPRCs load balancer fails when servers use DNS names in cert SAN Use a custom authority for each connection in a single ClientConn Sep 26, 2019
@dfawley dfawley added Type: Feature New features or improvements in behavior and removed Type: Bug labels Sep 26, 2019
@dfawley
Copy link
Member

dfawley commented Sep 26, 2019

@markdroth what is C capable of here? Per @ejona86, Java resolvers have the ability to configure the authority for each address. But @ejona86 notes that this is a potential security hole if not used carefully. It is also a non-standard use-case for gRPC, so I'm not sure whether it's something we want to support. If so, we should be able to do something similar to the Java approach in Go.

@markdroth
Copy link
Member

C-core has the plumbing to do this, but the only reason we have it is that we needed it for grpclb, where it was possible for each balancer to have its own name. Basically, the resolver can return an optional "target authority table", which maps addresses to the corresponding authority. Then, when the transport goes to create a subchannel, it looks to see if this table is present and the subchannel address is present in the table, and if so it uses the corresponding authority instead of the default one based on the channel's target URI.

I agree with Eric that this could be a security issue. Until now, we haven't worried about this very much in C-core, because our resolver API is still not public. But we will need to figure out how to address this when we open up the API.

We should probably decide whether we want to (a) try to change the resolver API to prevent this somehow or (b) loudly document the fact that using this functionality can cause a security problem and make it the responsibility of the resolver author to do the right thing.

@ejona86
Copy link
Member

ejona86 commented Sep 26, 2019

This was done very recently in grpc-java, in grpc/grpc-java#6126. But I'll note it had been discussed much earlier than that. The main reason for the delay is there were no name resolvers that could use it securely. gsd was the first case where we considered it seriously.

We were going to use the documentation approach. Although in our case you have to go out of your way to do this. It is a separate API than the normal authority setting.

@markdroth
Copy link
Member

It's a different API in C-core as well, not something you would get by accident.

I'm fine with making this available but documenting that resolver implementations are responsible for the security implications.

@dfawley
Copy link
Member

dfawley commented Oct 3, 2019

Since Java & C allow this, we would be willing to accept a PR to add this, but it is a low priority for us given that it is a non-standard use case. I believe the implementation would be as simple as changing this line:

Authority: ac.cc.authority,

so that the authority is set from the address's ServerName field instead, if set.

@jpbetz
Copy link
Contributor Author

jpbetz commented Oct 4, 2019

Thanks @dfawley I'll give it a shot.

@nerzhul
Copy link

nerzhul commented Oct 4, 2019

i don't know if you see but etcd-io/etcd#11180 (comment) has been patched with a fix on the SAN

@jpbetz
Copy link
Contributor Author

jpbetz commented Oct 4, 2019

I’d prefer replace the current etcd patch (which is quite ugly) with an approach that uses direct grpc support if possible.

@jpbetz
Copy link
Contributor Author

jpbetz commented Oct 4, 2019

PR opened (incl. a test): #3073

@lock lock bot locked as resolved and limited conversation to collaborators Apr 15, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
P2 Type: Feature New features or improvements in behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants