-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flapping consul servers emit empty data for keyprefix watches leading to KV data loss when used with consul-replicate #3975
Comments
This seems like an issue with consul-replicate specifically and not Consul, so I'm going to close this one; the discussion/fix can happen in the issue you opened in the consul-replicate repo (hashicorp/consul-replicate#82). |
Hey @kyhavlov Sorry that I should re-open this issue, but I am almost certain, that it is more related to consul s than to consul-replicate. I had very, very similar issue with using consul-template and patterns are starting to form. Hopefully my report will help you reach the core of this problem. I have a cluster of 5 consul servers version 1.2.1. I had to restart all of them for the upgrade. I have multiple agents that are clients of this cluster, and they are utilizing consul-template as well. I noticed that several of my machines that are using consul-template with stale options set successfully rendered templates to their outputs, with blank data, on places where they usually render dynamic data from the KV store. ctmpl file is looking something like that allow_from = {{range $index, $kv := ls "/pub/server/mastermachines/pub/addr"}}{{if ne $index 0}} {{end}}{{$kv.Key}}{{end}} From both of my reports (this and hashicorp/consul-replicate#82), it seems to me that:
As far as I know and from my experience and per documentation consul-template does NOT render data if it receives error from the consul agent for service unavailability. However in this case it rendered blank data, which mean that blank was returned with code for success. This is exactly the same pattern we experienced with consul-replicate which we "workarounded" by forcing consul-replicate to never use stale queries. This issue is proven to exist on all versions until 1.0.6 and I am also seeing it again in 1.2.1 with another tool consul-template Please let me know if I can provide you with more valuable data. |
Not sure if this helps but both configs that rendered empty data used the |
safels and safetree behave exactly like the native ls and tree with one exception. They will *refuse* to render template, if KV prefix query return blank/empty data. This is especially usefull for rendering mission critical files that do not tolerate ls/tree KV queries to return blank data. safels and safetree work in stale mode just as their ancestors but we get extra safety on top. safels and safetree commands were born as an attempt to mitigate issues described here: hashicorp#1131 hashicorp/consul#3975 hashicorp/consul-replicate#82
safels and safetree behave exactly like the native ls and tree with one exception. They will *refuse* to render template, if KV prefix query return blank/empty data. This is especially usefull for rendering mission critical files that do not tolerate ls/tree KV queries to return blank data. safels and safetree work in stale mode just as their ancestors but we get extra safety on top. safels and safetree commands were born as an attempt to mitigate issues described here: hashicorp#1131 hashicorp/consul#3975 hashicorp/consul-replicate#82
Might avoid doing hashicorp/consul-template#1132 And might fix the following bugs: * hashicorp/consul-replicate#82 * hashicorp#3975 * hashicorp/consul-template#1131
…#4554) Ensure that DB is properly initialized when performing stale queries Addresses: - hashicorp/consul-replicate#82 - #3975 - hashicorp/consul-template#1131
safels and safetree behave exactly like the native ls and tree with one exception. They will *refuse* to render template, if KV prefix query return blank/empty data. This is especially usefull for rendering mission critical files that do not tolerate ls/tree KV queries to return blank data. safels and safetree work in stale mode just as their ancestors but we get extra safety on top. safels and safetree commands were born as an attempt to mitigate issues described here: #1131 hashicorp/consul#3975 hashicorp/consul-replicate#82
Description of the Issue (and unexpected/desired result)
When consul servers in the main DC are flapping, consul may emit empty data for cross-dc keyprefix watches such as the one used by consul-replicate. As a result consul-replicate sync empty data to the following data center causing all old data to be erased from the KV store of the following dc, which effectively means data loss.
It appears to be happening when watched keyprefix hold more than ~150000 keys or at least in my case I have that much.
Detailed description of the issue with all required logs is available hashicorp/consul-replicate#82
I am not really sure if this is a consul or consul-replicate issue, but it is quite serious one, especially under certain conditions. In my case it lead to data loss and cascade failures across all following data centers. That's why I decided to post it here so one with more expertise can take a look at it.
UPDATE: My latest tests are showing that this issue can NOT be recreated if I configure consul-replicate without max_stale.
Reproduction steps
See hashicorp/consul-replicate#82
consul version
for both Client and ServerClient: tested and confirmed on both 0.7.5 and later on 1.0.6 - raft version 1
Server: tested and confirmed on both 0.7.5 and later on 1.0.6 - raft version 1
consul info
Parent DC servers:
Following DC servers:
Operating system and Environment details
centos 6
Log Fragments
See hashicorp/consul-replicate#82
The text was updated successfully, but these errors were encountered: