Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't deploy primary DC of federated cluster with managed system ACLs #1873

Closed
HartS opened this issue Feb 2, 2023 · 4 comments
Closed

Can't deploy primary DC of federated cluster with managed system ACLs #1873

HartS opened this issue Feb 2, 2023 · 4 comments
Labels
type/bug Something isn't working

Comments

@HartS
Copy link

HartS commented Feb 2, 2023

edit: see last comment for root cause. TLDR; specifying the primaryDatacenter key on the primary datacenter causes the deployment to break.

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request. Searching for pre-existing feature requests helps us consolidate datapoints for identical requirements into a single place, thank you!
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.

Overview of the Issue

When consul is deployed from charts on a primary datacenter with managed system ACLs (+ createReplicationToken), internal grpc communication doesn't work, which causes initialization to fail.

Note: The hashicorp repo is only showing the 1.0.2 charts, which is what I'm using

Reproduction Steps

First, I'll show the working working values.yaml:

connectInject:
  enabled: true
  k8sAllowNamespaces:
  - '*'
  k8sDenyNamespaces:
  - default
  - kube-system
  - kube-public
global:
  acls:
    createReplicationToken: false
    manageSystemACLs: false
  federation:
    createFederationSecret: true
    enabled: true
    primaryDatacenter: dc1
  gossipEncryption:
    autoGenerate: true
  tls:
    enableAutoEncrypt: true
    enabled: true
    httpsOnly: true
    verify: true
  domain: consul
  datacenter: dc1
meshGateway:
  enabled: true
  consulServiceName: mesh-gateway
server:
  connect: true
  storageClass: "gp2"

Deployed as above, I'm able to properly deploy consul. However, for a production federated deployment, we want to set:

global:
  acls:
    createReplicationToken: true
    manageSystemACLs: true

With this change, I see the following pods:

NAME                                                  READY   STATUS      RESTARTS      AGE
consul-consul-connect-injector-68d694d94b-4j8f7       0/1     Running     1 (23s ago)   95s
consul-consul-create-federation-secret-h5vsq          1/1     Running     0             62s
consul-consul-mesh-gateway-98ffc8b79-xhkcl            0/1     Init:0/1    0             95s
consul-consul-server-0                                1/1     Running     0             94s
consul-consul-server-acl-init-cleanup-mggx2           0/1     Completed   0             90s
consul-consul-webhook-cert-manager-586dc85c8c-jwqlt   1/1     Running     0             95s

Logs from the mesh-gateway-init container of the mesh-gateway pod:

023-02-02T18:52:18.849Z [INFO]  consul-server-connection-manager: trying to connect to a Consul server
2023-02-02T18:52:18.880Z [DEBUG] consul-server-connection-manager: gRPC resolver failed to update connection address: error="bad resolver state"
2023-02-02T18:52:18.880Z [ERROR] consul-server-connection-manager: connection error: error="failed to discover Consul server addresses: failed to resolve DNS name: consul-consul-server.consul.svc: lookup consul-consul-server.consul.svc on 10.100.0.10:53: no such host"

( this seems like it's due to the set of servers being empty https://github.com/hashicorp/consul-server-connection-manager/blob/f9b5452b527e26e64d4606a9eeee334181ae3e4b/discovery/resolver.go#L55-L60 ... the "bad resolver state" mesage presumably comes from grpc-go when a resolver gets passed an empty list of addresses )

@HartS HartS added the type/bug Something isn't working label Feb 2, 2023
@HartS
Copy link
Author

HartS commented Feb 2, 2023

Setting loglevel to trace:

2023-02-02T20:02:35.337Z [INFO]  consul-server-connection-manager: trying to connect to a Consul server
2023-02-02T20:02:35.337Z [TRACE] consul-server-connection-manager: Watcher.nextServer: addrs=[192.168.58.168:8502]
2023-02-02T20:02:35.342Z [DEBUG] consul-server-connection-manager: Resolved DNS name: name=consul-consul-server.consul.svc ip-addrs=["{192.168.58.168 }"]
2023-02-02T20:02:35.342Z [INFO]  consul-server-connection-manager: discovered Consul servers: addresses=[192.168.58.168:8502]
2023-02-02T20:02:35.342Z [INFO]  consul-server-connection-manager: current prioritized list of known Consul servers: addresses=[192.168.58.168:8502]
2023-02-02T20:02:35.342Z [TRACE] consul-server-connection-manager: Watcher.connect: addr=192.168.58.168:8502
2023-02-02T20:02:35.342Z [DEBUG] consul-server-connection-manager: switching to Consul server: address=192.168.58.168:8502
2023-02-02T20:02:35.342Z [TRACE] consul-server-connection-manager: Watcher.switchServer: to=192.168.58.168:8502
2023-02-02T20:02:35.342Z [TRACE] consul-server-connection-manager: clientConnWrapper.NewSubConn: addrs=["{\n  \"Addr\": \"192.168.58.168:8502\",\n  \"ServerName\": \"\",\n  \"Attributes\": null,\n  \"BalancerAttributes\": null,\n  \"Type\": 0,\n  \"Metadata\": null\n}"]
2023-02-02T20:02:35.343Z [TRACE] consul-server-connection-manager: balancer.UpdateSubConnState: sc="&{{0 0} 0xc0003f6b00}" state="{CONNECTING <nil>}"
2023-02-02T20:02:35.345Z [TRACE] consul-server-connection-manager: balancer.UpdateSubConnState: sc="&{{0 0} 0xc0003f6b00}" state="{READY <nil>}"
2023-02-02T20:02:35.345Z [TRACE] consul-server-connection-manager: pickerBuilder.Build: sub-conns ready=1
2023-02-02T20:02:35.543Z [DEBUG] consul-server-connection-manager: switched to Consul server successfully: address=192.168.58.168:8502
2023-02-02T20:02:35.544Z [ERROR] consul-server-connection-manager: ACL auth method login failed: error="rpc error: code = InvalidArgument desc = auth method \"consul-consul-k8s-component-auth-method-dc1\" not found"
2023-02-02T20:02:35.544Z [TRACE] consul-server-connection-manager: balancer.UpdateSubConnState: sc="&{{0 0} 0xc0003f6b00}" state="{SHUTDOWN <nil>}"
2023-02-02T20:02:35.545Z [DEBUG] consul-server-connection-manager: gRPC resolver failed to update connection address: error="bad resolver state"
2023-02-02T20:02:35.545Z [ERROR] consul-server-connection-manager: connection error: error="rpc error: code = InvalidArgument desc = auth method \"consul-consul-k8s-component-auth-method-dc1\" not found"

@HartS
Copy link
Author

HartS commented Feb 3, 2023

After several days of debugging this and trying different combinations of values, I finally figured out what was causing this to break.

  federation:
    createFederationSecret: true
    enabled: true
    primaryDatacenter: dc1
  datacenter: dc1

Turns out, the primaryDatacenter value should only be specified on non-primary datacenters. Removing this causes the deploy to succeed.

Unless there's a valid configuration where global.datacenter and globals.federation.primaryDatacenter have the same value, it might be good to error out early if they're the same. This is currently undocumented as a failure point

@david-yu
Copy link
Contributor

@HartS It looks like this was addressed via #2652 which makes sure this does not happen.

@david-yu
Copy link
Contributor

Will go ahead and close let us know if that is not the case, and we'll re-open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants