Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

403 (ACL not found) followed by successful deployment #862

Closed
mike-code opened this issue Nov 15, 2021 · 1 comment · Fixed by #887
Closed

403 (ACL not found) followed by successful deployment #862

mike-code opened this issue Nov 15, 2021 · 1 comment · Fixed by #887
Labels
type/bug Something isn't working

Comments

@mike-code
Copy link

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request. Searching for pre-existing feature requests helps us consolidate datapoints for identical requirements into a single place, thank you!
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.

Overview of the Issue

(this issue looks very similar to the one I posted before though the previous issue apparently revolved around two different problems -- one being incorrect service which prevented registration whilst the ACL issue remained)

I am running a private kubernetes cluster on GKE. I deployed consul (app version 1.10.3, chart version 0.36.0) and I was able to access consul ui. I can see the servers and client agents (on each node) running. I disregard errors during bootstrap as these are most likely more warnings than errors, stating that something is not ready yet.

Every time I try to add a new pod with connect-inject, I get a bunch of Unexpected response code: 403 (ACL not found)
errors followed by successful(?) registration. I wonder actually whether this deployment should be considered successful knowing that this ACL check probably timed-out after 30 retries yet all checks are green.

The application is responsive and I can reach it from other pods, yet the pod deployment is slowed down by those ACL checks and surely something is not right so I'd rather not keep it that way :)

Reproduction Steps

After bootstrapping consul cluster (with manageSystemACLs set to true) I tried to deploy hashicorp/http-echo:latest pod with 'consul.hashicorp.com/connect-inject': 'true' annotation (see yaml at the bottom of this issue).

Logs

Logs
[INFO]  Consul login complete
[ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
[ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
[ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
(..repeat 30 times..)
[INFO]  Registered service has been detected: service=static-server-xyz
[INFO]  Registered service has been detected: service=static-server-xyz-sidecar-proxy
[INFO]  Connect initialization completed
        Successfully applied traffic redirection rules

Expected behavior

Those errors should not be present.

Environment details

values.yaml

global:
  image: "consul:1.10.3"
  datacenter: gcp
  recursors:
    - '1.1.1.1'
    - '8.8.8.8'
  acls:
    manageSystemACLs: true
  tls:
    enabled: false

server:
  replicas: 3

connectInject:
  enabled: true
  default: true
  replicas: 1

ui:
  metrics:
    enabled: false

sample.yaml

apiVersion: v1
kind: ServiceAccount
metadata:
  name: static-server-xyz
---
apiVersion: v1
kind: Service
metadata:
  name: static-server-xyz
spec:
  selector:
    app: static-server-xyz
  ports:
    - port: 80
      targetPort: 8080
---
apiVersion: v1
kind: Pod
metadata:
  name: static-server-xyz
  labels:
    app: static-server-xyz
  # annotations:
  #   'consul.hashicorp.com/connect-inject': 'true'
spec:
  containers:
    # This name will be the service name in Consul.
    - name: static-server-xyz
      image: hashicorp/http-echo:latest
      args:
        - -text="hello world"
        - -listen=:8080
      ports:
        - containerPort: 8080
          name: http
  # If ACLs are enabled, the serviceAccountName must match the Consul service name.
  serviceAccountName: static-server-xyz
@mike-code mike-code added the type/bug Something isn't working label Nov 15, 2021
@hamishforbes
Copy link
Contributor

hamishforbes commented Nov 24, 2021

I'm getting this same error too, but intermittently
Consul 1.10.4 and consul-k8s 0.37.0 on EKS.

I am wondering if this is possibly related to ACL token caching
The default value for this setting is 30s and @mike-code gets 30 retries (1s apart) and then success

My cluster has ACL token TTL set to 60s and I get ... 60 retries 1s apart.
If it's not related to token caching that's an incredible coincidence.

edit: I reduced my token TTL to 10s and now it only retries for 10s. Maybe Consul needs a separate negative cache TTL config

ishustava added a commit that referenced this issue Dec 1, 2021
…onsul servers

Fixes #862

A consul client may reach out to a follower instead of a leader to resolve the token during the
call to get services below. This is because clients talk to servers in the stale consistency mode
to decrease the load on the servers (see https://www.consul.io/docs/architecture/consensus#stale).
In that case, it's possible that the token isn't replicated
to that server instance yet. The client will then get an "ACL not found" error
and subsequently cache this not found response. Then our call below
to get services from the agent will keep hitting the same "ACL not found" error
until the cache entry expires (determined by the `acl_token_ttl` which defaults to 30 seconds).
This is not great because it will delay app start up time by 30 seconds in most cases
(if you are running 3 servers, then the probability of ending up on a follower is close to 2/3).

To help with that, we try to first read the token in the stale consistency mode until we
get a successful response. This should not take more than 100ms because raft replication
should in most cases take less than that (see https://www.consul.io/docs/install/performance#read-write-tuning)
but we set the timeout to 2s to be sure.

Note though that this workaround does not eliminate this problem completely. It's still possible
for this call and the next call to reach different servers and those servers to have different
states from each other.
For example, this call can reach a leader and succeed, while the call below can go to a follower
that is still behind the leader and get an "ACL not found" error.
However, this is a pretty unlikely case because
clients have sticky connections to a server, and those connections get rebalanced only every 2-3min.
And so, this workaround should work in a vast majority of cases.
ishustava added a commit that referenced this issue Dec 1, 2021
…onsul servers

Fixes #862

A consul client may reach out to a follower instead of a leader to resolve the token during the
call to get services below. This is because clients talk to servers in the stale consistency mode
to decrease the load on the servers (see https://www.consul.io/docs/architecture/consensus#stale).
In that case, it's possible that the token isn't replicated
to that server instance yet. The client will then get an "ACL not found" error
and subsequently cache this not found response. Then our call below
to get services from the agent will keep hitting the same "ACL not found" error
until the cache entry expires (determined by the `acl_token_ttl` which defaults to 30 seconds).
This is not great because it will delay app start up time by 30 seconds in most cases
(if you are running 3 servers, then the probability of ending up on a follower is close to 2/3).

To help with that, we try to first read the token in the stale consistency mode until we
get a successful response. This should not take more than 100ms because raft replication
should in most cases take less than that (see https://www.consul.io/docs/install/performance#read-write-tuning)
but we set the timeout to 2s to be sure.

Note though that this workaround does not eliminate this problem completely. It's still possible
for this call and the next call to reach different servers and those servers to have different
states from each other.
For example, this call can reach a leader and succeed, while the call below can go to a follower
that is still behind the leader and get an "ACL not found" error.
However, this is a pretty unlikely case because
clients have sticky connections to a server, and those connections get rebalanced only every 2-3min.
And so, this workaround should work in a vast majority of cases.
ishustava added a commit that referenced this issue Dec 1, 2021
…onsul servers

Fixes #862

A consul client may reach out to a follower instead of a leader to resolve the token during the
call to get services below. This is because clients talk to servers in the stale consistency mode
to decrease the load on the servers (see https://www.consul.io/docs/architecture/consensus#stale).
In that case, it's possible that the token isn't replicated
to that server instance yet. The client will then get an "ACL not found" error
and subsequently cache this not found response. Then our call below
to get services from the agent will keep hitting the same "ACL not found" error
until the cache entry expires (determined by the `acl_token_ttl` which defaults to 30 seconds).
This is not great because it will delay app start up time by 30 seconds in most cases
(if you are running 3 servers, then the probability of ending up on a follower is close to 2/3).

To help with that, we try to first read the token in the stale consistency mode until we
get a successful response. This should not take more than 100ms because raft replication
should in most cases take less than that (see https://www.consul.io/docs/install/performance#read-write-tuning)
but we set the timeout to 2s to be sure.

Note though that this workaround does not eliminate this problem completely. It's still possible
for this call and the next call to reach different servers and those servers to have different
states from each other.
For example, this call can reach a leader and succeed, while the call below can go to a follower
that is still behind the leader and get an "ACL not found" error.
However, this is a pretty unlikely case because
clients have sticky connections to a server, and those connections get rebalanced only every 2-3min.
And so, this workaround should work in a vast majority of cases.
ishustava added a commit that referenced this issue Dec 1, 2021
…onsul servers (#887)

Fixes #862

A consul client may reach out to a follower instead of a leader to resolve the token during the
call to get services. This is because clients talk to servers in the stale consistency mode
to decrease the load on the servers (see https://www.consul.io/docs/architecture/consensus#stale).
In that case, it's possible that the token isn't replicated
to that server instance yet. The client will then get an "ACL not found" error
and subsequently cache this not found response. Then our call
to get services from the agent will keep hitting the same "ACL not found" error
until the cache entry expires (determined by the `acl_token_ttl` which defaults to 30 seconds).
This is not great because it will delay app start up time by 30 seconds in most cases
(if you are running 3 servers, then the probability of ending up on a follower is close to 2/3).

To help with that, we try to first read the token in the stale consistency mode until we
get a successful response. This should not take more than 100ms because raft replication
should in most cases take less than that (see https://www.consul.io/docs/install/performance#read-write-tuning)
but we set the timeout to 2s to be sure.

Note though that this workaround does not eliminate this problem completely. It's still possible
for this call and the next call to reach different servers and those servers to have different
states from each other.
For example, get token call can reach a leader and succeed, while the call to get services can go to a follower
that is still behind the leader and get an "ACL not found" error.
However, this is a pretty unlikely case because
clients have sticky connections to a server, and those connections get rebalanced only every 2-3min.
And so, this workaround should work in a vast majority of cases.
rrondeau pushed a commit to rrondeau/consul-k8s that referenced this issue Dec 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
2 participants