Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flapping consul servers emit empty data for keyprefix watches leading to KV data loss when used with consul-replicate #3975

Closed
vaLski opened this issue Mar 21, 2018 · 3 comments

Comments

@vaLski
Copy link
Contributor

vaLski commented Mar 21, 2018

Description of the Issue (and unexpected/desired result)

When consul servers in the main DC are flapping, consul may emit empty data for cross-dc keyprefix watches such as the one used by consul-replicate. As a result consul-replicate sync empty data to the following data center causing all old data to be erased from the KV store of the following dc, which effectively means data loss.

It appears to be happening when watched keyprefix hold more than ~150000 keys or at least in my case I have that much.

Detailed description of the issue with all required logs is available hashicorp/consul-replicate#82

I am not really sure if this is a consul or consul-replicate issue, but it is quite serious one, especially under certain conditions. In my case it lead to data loss and cascade failures across all following data centers. That's why I decided to post it here so one with more expertise can take a look at it.

UPDATE: My latest tests are showing that this issue can NOT be recreated if I configure consul-replicate without max_stale.

Reproduction steps

See hashicorp/consul-replicate#82

consul version for both Client and Server

Client: tested and confirmed on both 0.7.5 and later on 1.0.6 - raft version 1
Server: tested and confirmed on both 0.7.5 and later on 1.0.6 - raft version 1

consul info

Parent DC servers:

agent:
	check_monitors = 1
	check_ttls = 0
	checks = 1
	services = 2
build:
	prerelease = 
	revision = 9a494b5f
	version = 1.0.6
consul:
	bootstrap = false
	known_datacenters = 2
	leader = false
	leader_addr =146.66.x.x:8300
	server = true
raft:
	applied_index = 6229519168
	commit_index = 6229519168
	fsm_pending = 0
	last_contact = 39.692447ms
	last_log_index = 6229519168
	last_log_term = 346
	last_snapshot_index = 6229518718
	last_snapshot_term = 346
	latest_configuration = [{Suffrage:Voter ID:146.66.x.x:8300 Address:146.66.x.x:8300} {Suffrage:Voter ID:146.66.x.x:8300 Address:146.66.x.x8300} {Suffrage:Voter ID:146.66.x.x:8300 Address:146.66.x.x:8300}]
	latest_configuration_index = 6229505644
	num_peers = 2
	protocol_version = 1
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Follower
	term = 346
runtime:
	arch = amd64
	cpu_count = 48
	goroutines = 82
	max_procs = 48
	os = linux
	version = go1.9.3
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 312
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 9
	members = 3
	query_queue = 0
	query_time = 1
serf_wan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 474
	members = 6
	query_queue = 0
	query_time = 1

Following DC servers:

agent:
	check_monitors = 1
	check_ttls = 0
	checks = 1
	services = 2
build:
	prerelease = 
	revision = 9a494b5f
	version = 1.0.6
consul:
	bootstrap = false
	known_datacenters = 2
	leader = true
	leader_addr = 77.104.x.x:8300
	server = true
raft:
	applied_index = 1542247
	commit_index = 1542247
	fsm_pending = 0
	last_contact = 0
	last_log_index = 1542247
	last_log_term = 10
	last_snapshot_index = 1534553
	last_snapshot_term = 10
	latest_configuration = [{Suffrage:Voter ID:77.104.x.x:8300 Address:77.104.x.x:8300} {Suffrage:Voter ID:77.104.x.x:8300 Address:77.104.x.x:8300} {Suffrage:Voter ID:77.104.x.x:8300 Address:77.104.x.x:8300}]
	latest_configuration_index = 1535032
	num_peers = 2
	protocol_version = 1
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Leader
	term = 10
runtime:
	arch = amd64
	cpu_count = 56
	goroutines = 116
	max_procs = 56
	os = linux
	version = go1.9.3
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 7
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 33
	members = 3
	query_queue = 0
	query_time = 1
serf_wan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 474
	members = 6
	query_queue = 0
	query_time = 1

Operating system and Environment details

centos 6

Log Fragments

See hashicorp/consul-replicate#82

@kyhavlov
Copy link
Contributor

This seems like an issue with consul-replicate specifically and not Consul, so I'm going to close this one; the discussion/fix can happen in the issue you opened in the consul-replicate repo (hashicorp/consul-replicate#82).

@vaLski
Copy link
Contributor Author

vaLski commented Aug 1, 2018

Hey @kyhavlov

Sorry that I should re-open this issue, but I am almost certain, that it is more related to consul s than to consul-replicate.

I had very, very similar issue with using consul-template and patterns are starting to form. Hopefully my report will help you reach the core of this problem.

I have a cluster of 5 consul servers version 1.2.1. I had to restart all of them for the upgrade.

I have multiple agents that are clients of this cluster, and they are utilizing consul-template as well.

I noticed that several of my machines that are using consul-template with stale options set successfully rendered templates to their outputs, with blank data, on places where they usually render dynamic data from the KV store.

ctmpl file is looking something like that

allow_from = {{range $index, $kv := ls "/pub/server/mastermachines/pub/addr"}}{{if ne $index 0}} {{end}}{{$kv.Key}}{{end}}

From both of my reports (this and hashicorp/consul-replicate#82), it seems to me that:

  • in certain scenarios
  • when querying the KV store with stale flag set via ls (or maybe other options)
  • during consul raft leader switch caused by routine restart or flap/outages/whatever
  • consul server, that is handling the in-flight stale KV query, is legitimately returning "blank" data, where it should not, but rather return service unavailable error.

As far as I know and from my experience and per documentation consul-template does NOT render data if it receives error from the consul agent for service unavailability.

However in this case it rendered blank data, which mean that blank was returned with code for success.

This is exactly the same pattern we experienced with consul-replicate which we "workarounded" by forcing consul-replicate to never use stale queries. This issue is proven to exist on all versions until 1.0.6 and I am also seeing it again in 1.2.1 with another tool consul-template

Please let me know if I can provide you with more valuable data.

@vaLski
Copy link
Contributor Author

vaLski commented Aug 2, 2018

Not sure if this helps but both configs that rendered empty data used the range ls

vaLski added a commit to vaLski/consul-template that referenced this issue Aug 15, 2018
safels and safetree behave exactly like the native ls and tree with one exception. They will *refuse* to render template, if KV prefix query return blank/empty data.

This is especially usefull for rendering mission critical files that do not tolerate ls/tree KV queries to return blank data.

safels and safetree work in stale mode just as their ancestors but we get extra safety on top.

safels and safetree commands were born as an attempt to mitigate issues described here:

  hashicorp#1131
  hashicorp/consul#3975
  hashicorp/consul-replicate#82
vaLski added a commit to vaLski/consul-template that referenced this issue Aug 15, 2018
safels and safetree behave exactly like the native ls and tree with one exception. They will *refuse* to render template, if KV prefix query return blank/empty data.

This is especially usefull for rendering mission critical files that do not tolerate ls/tree KV queries to return blank data.

safels and safetree work in stale mode just as their ancestors but we get extra safety on top.

safels and safetree commands were born as an attempt to mitigate issues described here:

  hashicorp#1131
  hashicorp/consul#3975
  hashicorp/consul-replicate#82
freddygv pushed a commit that referenced this issue Aug 23, 2018
eikenb pushed a commit to hashicorp/consul-template that referenced this issue Sep 10, 2019
safels and safetree behave exactly like the native ls and tree with one exception. They will *refuse* to render template, if KV prefix query return blank/empty data.

This is especially usefull for rendering mission critical files that do not tolerate ls/tree KV queries to return blank data.

safels and safetree work in stale mode just as their ancestors but we get extra safety on top.

safels and safetree commands were born as an attempt to mitigate issues described here:

  #1131
  hashicorp/consul#3975
  hashicorp/consul-replicate#82
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants