403 (ACL not found) followed by successful deployment #862

mike-code · 2021-11-15T22:00:14Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request. Searching for pre-existing feature requests helps us consolidate datapoints for identical requirements into a single place, thank you!
Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.
If you are interested in working on this issue or have submitted a pull request, please leave a comment.

Overview of the Issue

(this issue looks very similar to the one I posted before though the previous issue apparently revolved around two different problems -- one being incorrect service which prevented registration whilst the ACL issue remained)

I am running a private kubernetes cluster on GKE. I deployed consul (app version 1.10.3, chart version 0.36.0) and I was able to access consul ui. I can see the servers and client agents (on each node) running. I disregard errors during bootstrap as these are most likely more warnings than errors, stating that something is not ready yet.

Every time I try to add a new pod with connect-inject, I get a bunch of Unexpected response code: 403 (ACL not found)
errors followed by successful(?) registration. I wonder actually whether this deployment should be considered successful knowing that this ACL check probably timed-out after 30 retries yet all checks are green.

The application is responsive and I can reach it from other pods, yet the pod deployment is slowed down by those ACL checks and surely something is not right so I'd rather not keep it that way :)

Reproduction Steps

After bootstrapping consul cluster (with manageSystemACLs set to true) I tried to deploy hashicorp/http-echo:latest pod with 'consul.hashicorp.com/connect-inject': 'true' annotation (see yaml at the bottom of this issue).

Logs

[INFO]  Consul login complete
[ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
[ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
[ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
(..repeat 30 times..)
[INFO]  Registered service has been detected: service=static-server-xyz
[INFO]  Registered service has been detected: service=static-server-xyz-sidecar-proxy
[INFO]  Connect initialization completed
        Successfully applied traffic redirection rules

Expected behavior

Those errors should not be present.

Environment details

values.yaml

global:
  image: "consul:1.10.3"
  datacenter: gcp
  recursors:
    - '1.1.1.1'
    - '8.8.8.8'
  acls:
    manageSystemACLs: true
  tls:
    enabled: false

server:
  replicas: 3

connectInject:
  enabled: true
  default: true
  replicas: 1

ui:
  metrics:
    enabled: false

sample.yaml

apiVersion: v1
kind: ServiceAccount
metadata:
  name: static-server-xyz
---
apiVersion: v1
kind: Service
metadata:
  name: static-server-xyz
spec:
  selector:
    app: static-server-xyz
  ports:
    - port: 80
      targetPort: 8080
---
apiVersion: v1
kind: Pod
metadata:
  name: static-server-xyz
  labels:
    app: static-server-xyz
  # annotations:
  #   'consul.hashicorp.com/connect-inject': 'true'
spec:
  containers:
    # This name will be the service name in Consul.
    - name: static-server-xyz
      image: hashicorp/http-echo:latest
      args:
        - -text="hello world"
        - -listen=:8080
      ports:
        - containerPort: 8080
          name: http
  # If ACLs are enabled, the serviceAccountName must match the Consul service name.
  serviceAccountName: static-server-xyz

The text was updated successfully, but these errors were encountered:

hamishforbes · 2021-11-24T02:31:32Z

I'm getting this same error too, but intermittently
Consul 1.10.4 and consul-k8s 0.37.0 on EKS.

I am wondering if this is possibly related to ACL token caching
The default value for this setting is 30s and @mike-code gets 30 retries (1s apart) and then success

My cluster has ACL token TTL set to 60s and I get ... 60 retries 1s apart.
If it's not related to token caching that's an incredible coincidence.

edit: I reduced my token TTL to 10s and now it only retries for 10s. Maybe Consul needs a separate negative cache TTL config

…onsul servers Fixes #862 A consul client may reach out to a follower instead of a leader to resolve the token during the call to get services below. This is because clients talk to servers in the stale consistency mode to decrease the load on the servers (see https://www.consul.io/docs/architecture/consensus#stale). In that case, it's possible that the token isn't replicated to that server instance yet. The client will then get an "ACL not found" error and subsequently cache this not found response. Then our call below to get services from the agent will keep hitting the same "ACL not found" error until the cache entry expires (determined by the `acl_token_ttl` which defaults to 30 seconds). This is not great because it will delay app start up time by 30 seconds in most cases (if you are running 3 servers, then the probability of ending up on a follower is close to 2/3). To help with that, we try to first read the token in the stale consistency mode until we get a successful response. This should not take more than 100ms because raft replication should in most cases take less than that (see https://www.consul.io/docs/install/performance#read-write-tuning) but we set the timeout to 2s to be sure. Note though that this workaround does not eliminate this problem completely. It's still possible for this call and the next call to reach different servers and those servers to have different states from each other. For example, this call can reach a leader and succeed, while the call below can go to a follower that is still behind the leader and get an "ACL not found" error. However, this is a pretty unlikely case because clients have sticky connections to a server, and those connections get rebalanced only every 2-3min. And so, this workaround should work in a vast majority of cases.

…onsul servers (#887) Fixes #862 A consul client may reach out to a follower instead of a leader to resolve the token during the call to get services. This is because clients talk to servers in the stale consistency mode to decrease the load on the servers (see https://www.consul.io/docs/architecture/consensus#stale). In that case, it's possible that the token isn't replicated to that server instance yet. The client will then get an "ACL not found" error and subsequently cache this not found response. Then our call to get services from the agent will keep hitting the same "ACL not found" error until the cache entry expires (determined by the `acl_token_ttl` which defaults to 30 seconds). This is not great because it will delay app start up time by 30 seconds in most cases (if you are running 3 servers, then the probability of ending up on a follower is close to 2/3). To help with that, we try to first read the token in the stale consistency mode until we get a successful response. This should not take more than 100ms because raft replication should in most cases take less than that (see https://www.consul.io/docs/install/performance#read-write-tuning) but we set the timeout to 2s to be sure. Note though that this workaround does not eliminate this problem completely. It's still possible for this call and the next call to reach different servers and those servers to have different states from each other. For example, get token call can reach a leader and succeed, while the call to get services can go to a follower that is still behind the leader and get an "ACL not found" error. However, this is a pretty unlikely case because clients have sticky connections to a server, and those connections get rebalanced only every 2-3min. And so, this workaround should work in a vast majority of cases.

mike-code added the type/bug Something isn't working label Nov 15, 2021

ishustava mentioned this issue Dec 1, 2021

Add a workaround to check that the ACL token is replicated to other Consul servers #887

Merged

2 tasks

ishustava closed this as completed in #887 Dec 1, 2021

ishustava mentioned this issue Dec 1, 2021

Caching of ACL error responses causes longer start times of applications on Kubernetes hashicorp/consul#11704

Open

rrondeau pushed a commit to rrondeau/consul-k8s that referenced this issue Dec 21, 2021

Release 0.31.0 (hashicorp#862)

05070d2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

403 (ACL not found) followed by successful deployment #862

403 (ACL not found) followed by successful deployment #862

mike-code commented Nov 15, 2021

hamishforbes commented Nov 24, 2021 •

edited

Loading

403 (ACL not found) followed by successful deployment #862

403 (ACL not found) followed by successful deployment #862

Comments

mike-code commented Nov 15, 2021

Community Note

Overview of the Issue

Reproduction Steps

Logs

Expected behavior

Environment details

hamishforbes commented Nov 24, 2021 • edited Loading

hamishforbes commented Nov 24, 2021 •

edited

Loading