-
Notifications
You must be signed in to change notification settings - Fork 323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
403 (ACL not found) followed by successful deployment #862
Comments
I'm getting this same error too, but intermittently I am wondering if this is possibly related to ACL token caching My cluster has ACL token TTL set to 60s and I get ... 60 retries 1s apart. edit: I reduced my token TTL to 10s and now it only retries for 10s. Maybe Consul needs a separate negative cache TTL config |
…onsul servers Fixes #862 A consul client may reach out to a follower instead of a leader to resolve the token during the call to get services below. This is because clients talk to servers in the stale consistency mode to decrease the load on the servers (see https://www.consul.io/docs/architecture/consensus#stale). In that case, it's possible that the token isn't replicated to that server instance yet. The client will then get an "ACL not found" error and subsequently cache this not found response. Then our call below to get services from the agent will keep hitting the same "ACL not found" error until the cache entry expires (determined by the `acl_token_ttl` which defaults to 30 seconds). This is not great because it will delay app start up time by 30 seconds in most cases (if you are running 3 servers, then the probability of ending up on a follower is close to 2/3). To help with that, we try to first read the token in the stale consistency mode until we get a successful response. This should not take more than 100ms because raft replication should in most cases take less than that (see https://www.consul.io/docs/install/performance#read-write-tuning) but we set the timeout to 2s to be sure. Note though that this workaround does not eliminate this problem completely. It's still possible for this call and the next call to reach different servers and those servers to have different states from each other. For example, this call can reach a leader and succeed, while the call below can go to a follower that is still behind the leader and get an "ACL not found" error. However, this is a pretty unlikely case because clients have sticky connections to a server, and those connections get rebalanced only every 2-3min. And so, this workaround should work in a vast majority of cases.
…onsul servers Fixes #862 A consul client may reach out to a follower instead of a leader to resolve the token during the call to get services below. This is because clients talk to servers in the stale consistency mode to decrease the load on the servers (see https://www.consul.io/docs/architecture/consensus#stale). In that case, it's possible that the token isn't replicated to that server instance yet. The client will then get an "ACL not found" error and subsequently cache this not found response. Then our call below to get services from the agent will keep hitting the same "ACL not found" error until the cache entry expires (determined by the `acl_token_ttl` which defaults to 30 seconds). This is not great because it will delay app start up time by 30 seconds in most cases (if you are running 3 servers, then the probability of ending up on a follower is close to 2/3). To help with that, we try to first read the token in the stale consistency mode until we get a successful response. This should not take more than 100ms because raft replication should in most cases take less than that (see https://www.consul.io/docs/install/performance#read-write-tuning) but we set the timeout to 2s to be sure. Note though that this workaround does not eliminate this problem completely. It's still possible for this call and the next call to reach different servers and those servers to have different states from each other. For example, this call can reach a leader and succeed, while the call below can go to a follower that is still behind the leader and get an "ACL not found" error. However, this is a pretty unlikely case because clients have sticky connections to a server, and those connections get rebalanced only every 2-3min. And so, this workaround should work in a vast majority of cases.
…onsul servers Fixes #862 A consul client may reach out to a follower instead of a leader to resolve the token during the call to get services below. This is because clients talk to servers in the stale consistency mode to decrease the load on the servers (see https://www.consul.io/docs/architecture/consensus#stale). In that case, it's possible that the token isn't replicated to that server instance yet. The client will then get an "ACL not found" error and subsequently cache this not found response. Then our call below to get services from the agent will keep hitting the same "ACL not found" error until the cache entry expires (determined by the `acl_token_ttl` which defaults to 30 seconds). This is not great because it will delay app start up time by 30 seconds in most cases (if you are running 3 servers, then the probability of ending up on a follower is close to 2/3). To help with that, we try to first read the token in the stale consistency mode until we get a successful response. This should not take more than 100ms because raft replication should in most cases take less than that (see https://www.consul.io/docs/install/performance#read-write-tuning) but we set the timeout to 2s to be sure. Note though that this workaround does not eliminate this problem completely. It's still possible for this call and the next call to reach different servers and those servers to have different states from each other. For example, this call can reach a leader and succeed, while the call below can go to a follower that is still behind the leader and get an "ACL not found" error. However, this is a pretty unlikely case because clients have sticky connections to a server, and those connections get rebalanced only every 2-3min. And so, this workaround should work in a vast majority of cases.
…onsul servers (#887) Fixes #862 A consul client may reach out to a follower instead of a leader to resolve the token during the call to get services. This is because clients talk to servers in the stale consistency mode to decrease the load on the servers (see https://www.consul.io/docs/architecture/consensus#stale). In that case, it's possible that the token isn't replicated to that server instance yet. The client will then get an "ACL not found" error and subsequently cache this not found response. Then our call to get services from the agent will keep hitting the same "ACL not found" error until the cache entry expires (determined by the `acl_token_ttl` which defaults to 30 seconds). This is not great because it will delay app start up time by 30 seconds in most cases (if you are running 3 servers, then the probability of ending up on a follower is close to 2/3). To help with that, we try to first read the token in the stale consistency mode until we get a successful response. This should not take more than 100ms because raft replication should in most cases take less than that (see https://www.consul.io/docs/install/performance#read-write-tuning) but we set the timeout to 2s to be sure. Note though that this workaround does not eliminate this problem completely. It's still possible for this call and the next call to reach different servers and those servers to have different states from each other. For example, get token call can reach a leader and succeed, while the call to get services can go to a follower that is still behind the leader and get an "ACL not found" error. However, this is a pretty unlikely case because clients have sticky connections to a server, and those connections get rebalanced only every 2-3min. And so, this workaround should work in a vast majority of cases.
Community Note
Overview of the Issue
(this issue looks very similar to the one I posted before though the previous issue apparently revolved around two different problems -- one being incorrect service which prevented registration whilst the ACL issue remained)
I am running a private kubernetes cluster on GKE. I deployed consul (app version
1.10.3
, chart version0.36.0
) and I was able to access consul ui. I can see the servers and client agents (on each node) running. I disregard errors during bootstrap as these are most likely more warnings than errors, stating that something is not ready yet.Every time I try to add a new pod with
connect-inject
, I get a bunch ofUnexpected response code: 403 (ACL not found)
errors followed by successful(?) registration. I wonder actually whether this deployment should be considered successful knowing that this ACL check probably timed-out after 30 retries yet all checks are green.
The application is responsive and I can reach it from other pods, yet the pod deployment is slowed down by those ACL checks and surely something is not right so I'd rather not keep it that way :)
Reproduction Steps
After bootstrapping consul cluster (with
manageSystemACLs
set totrue
) I tried to deployhashicorp/http-echo:latest
pod with'consul.hashicorp.com/connect-inject': 'true'
annotation (see yaml at the bottom of this issue).Logs
Logs
Expected behavior
Those errors should not be present.
Environment details
values.yaml
sample.yaml
The text was updated successfully, but these errors were encountered: