Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consul reload - watch configs leaks file descriptors #3018

Closed
dhv opened this issue May 9, 2017 · 8 comments
Closed

consul reload - watch configs leaks file descriptors #3018

dhv opened this issue May 9, 2017 · 8 comments
Assignees
Labels
theme/internal-cleanup Used to identify tech debt, testing improvements, code refactoring, and non-impactful optimization type/bug Feature does not function as expected

Comments

@dhv
Copy link

dhv commented May 9, 2017

consul version for both Client and Server

Client: 0.7.1
Server: 0.7.1

consul info for both Client and Server

Client:

agent:
	check_monitors = 0
	check_ttls = 0
	checks = 0
	services = 0
build:
	prerelease = unknown
	revision =
	version = unknown
consul:
	known_servers = 2
	server = false
runtime:
	arch = amd64
	cpu_count = 4
	goroutines = 126
	max_procs = 4
	os = linux
	version = go1.7.3
serf_lan:
	encrypted = true
	event_queue = 0
	event_time = 53
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 263
	members = 4
	query_queue = 0
	query_time = 14

Server:

agent:
	check_monitors = 0
	check_ttls = 0
	checks = 0
	services = 1
build:
	prerelease = unknown
	revision =
	version = unknown
consul:
	bootstrap = false
	known_datacenters = 1
	leader = false
	leader_addr = <redacted>:8300
	server = true
raft:
	applied_index = 342855
	commit_index = 342855
	fsm_pending = 0
	last_contact = 56.079735ms
	last_log_index = 342855
	last_log_term = 13521
	last_snapshot_index = 339382
	last_snapshot_term = 13521
	latest_configuration = [{Suffrage:Voter ID:<redacted>:8300 Address:<redacted>:8300} {Suffrage:Voter ID:<redacted>:8300 Address:<redacted>:8300}]
	latest_configuration_index = 342687
	num_peers = 1
	protocol_version = 1
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Follower
	term = 13521
runtime:
	arch = amd64
	cpu_count = 8
	goroutines = 216
	max_procs = 4
	os = linux
	version = go1.7.3
serf_lan:
	encrypted = true
	event_queue = 0
	event_time = 53
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 263
	members = 4
	query_queue = 0
	query_time = 14
serf_wan:
	encrypted = true
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 8
	members = 1
	query_queue = 0
	query_time = 1

Operating system and Environment details

ubuntu 12.04 precise

Description of the Issue (and unexpected/desired result)

  • on consul reload, consul opens all new connections per watch, without closing the old connections
  • eventually consul hits the max file descriptor limit if we reload enough times with enough watch configs
  • "too many open files" log proliferates

From what I can tell the connections remain and do not get garbage collected.

I looked briefly at the source through the code and found that this is likely because the watch plan uses consulapi with cleanhttp.DefaultPooledTransport() to use the same connection per watch. Is there a way to close these connections on consul reload when the watch plan gets replaced?

https://github.com/hashicorp/consul/blob/master/watch/plan.go
https://github.com/hashicorp/go-cleanhttp/blob/master/cleanhttp.go#L19

Reproduction steps

# This example has 27 watch configs in multiple files

$ grep -r handler /etc/consul.d | wc -l
27

# 55 file descriptors on port 8500
$ sudo lsof | grep consul | grep 8500 | wc -l
55
$ consul reload
$ sudo lsof | grep consul | grep 8500 | wc -l
109

# file descriptors increased from 55 to 109, difference of 54
# which is 2 * 27 watches for, since lsof lists both incoming and outgoing directions

### Log Fragments or Link to [gist](https://gist.github.com/)
2017/04/20 23:34:38 [WARN] agent: http request failed 'http://localhost:1991/std': Get http://localhost:1991/std: dial tcp 127.0.0.1:1991: socket: too many open files
2017/04/20 23:34:48 [WARN] agent: http request failed 'http://localhost:1991/std': Get http://localhost:1991/std: dial tcp 127.0.0.1:1991: socket: too many open files
2017/04/20 23:34:49 [ERR] memberlist: Error accepting TCP connection: accept tcp <redacted>:8301: accept4: too many open files
2017/04/20 23:34:49 [ERR] memberlist: Error accepting TCP connection: accept tcp <redacted>:8301: accept4: too many open files
@magiconair
Copy link
Contributor

@dhv Do you see the same behavior with a later version 0.7.[2345] or 0.8.[01] ?

@dhv
Copy link
Author

dhv commented May 9, 2017

@magiconair I reproduced this with v0.8.2 and 3 watches. Connections are still building up.
I can't tell if these will eventually be cleaned up, but I don't think that's the case, and it seems that these connections are unnecessary since a new watch plan replaces the old one.

$ consul version
Consul v0.8.2
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)
$ sudo lsof -p 14999 | wc -l
27
$ consul reload
Configuration reload triggered
$ sudo lsof -p 14999 | wc -l
33
$ consul reload
Configuration reload triggered
$ sudo lsof -p 14999 | wc -l
39
$ consul reload
Configuration reload triggered
$ sudo lsof -p 14999 | wc -l
45
COMMAND   PID   USER   FD   TYPE DEVICE SIZE/OFF     NODE NAME
consul  14999 dhv  cwd    DIR    8,2     4096 11534338 /home/dhv
consul  14999 dhv  rtd    DIR    8,2     4096        2 /
consul  14999 dhv  txt    REG    8,2 36750462 11539023 /home/dhv/consul
consul  14999 dhv  mem-W  REG    8,2    65536 11928798 /home/dhv/tmp/raft/raft.db
consul  14999 dhv    0u   CHR  136,2      0t0        5 /dev/pts/2
consul  14999 dhv    1u   CHR  136,2      0t0        5 /dev/pts/2
consul  14999 dhv    2u   CHR  136,2      0t0        5 /dev/pts/2
consul  14999 dhv    3r   CHR    1,9      0t0     1034 /dev/urandom
consul  14999 dhv    4u  IPv4 742337      0t0      TCP dhv-ux:8300 (LISTEN)
consul  14999 dhv    5u  0000    0,9        0     7345 anon_inode
consul  14999 dhv    6uW  REG    8,2    65536 11928798 /home/dhv/tmp/raft/raft.db
consul  14999 dhv    7u   REG    8,2      735 11928793 /home/dhv/tmp/serf/local.snapshot
consul  14999 dhv    8u  IPv4 747830      0t0      TCP dhv-ux:8301 (LISTEN)
consul  14999 dhv    9u  IPv4 747831      0t0      UDP dhv-ux:8301
consul  14999 dhv   10u   REG    8,2      672 11928825 /home/dhv/tmp/serf/remote.snapshot
consul  14999 dhv   11u  IPv4 747832      0t0      TCP dhv-ux:8302 (LISTEN)
consul  14999 dhv   12u  IPv4 747833      0t0      UDP dhv-ux:8302
consul  14999 dhv   13u  IPv4 747834      0t0      TCP dhv-ux:8500 (LISTEN)
consul  14999 dhv   14u  IPv4 745835      0t0      UDP dhv-ux:8600
consul  14999 dhv   15u  IPv4 745870      0t0      TCP dhv-ux:38705->dhv-ux:8300 (ESTABLISHED)
consul  14999 dhv   16u  IPv4 747835      0t0      TCP dhv-ux:8600 (LISTEN)
consul  14999 dhv   17u  IPv4 745851      0t0      TCP dhv-ux:38408->dhv-ux:8500 (ESTABLISHED)
consul  14999 dhv   18u  IPv4 742791      0t0      TCP dhv-ux:38403->dhv-ux:8500 (ESTABLISHED)
consul  14999 dhv   19u  IPv4 745837      0t0      TCP dhv-ux:8500->dhv-ux:38403 (ESTABLISHED)
consul  14999 dhv   20u  IPv4 744131      0t0      TCP dhv-ux:38404->dhv-ux:8500 (ESTABLISHED)
consul  14999 dhv   21u  IPv4 742795      0t0      TCP dhv-ux:8500->dhv-ux:38404 (ESTABLISHED)
consul  14999 dhv   22u  IPv4 742800      0t0      TCP dhv-ux:38405->dhv-ux:8500 (ESTABLISHED)
consul  14999 dhv   23u  IPv4 745839      0t0      TCP dhv-ux:8500->dhv-ux:38405 (ESTABLISHED)
consul  14999 dhv   24u  IPv4 738154      0t0      TCP dhv-ux:38409->dhv-ux:8500 (ESTABLISHED)
consul  14999 dhv   25u  IPv4 746729      0t0      TCP dhv-ux:8500->dhv-ux:38408 (ESTABLISHED)
consul  14999 dhv   26u  IPv4 747841      0t0      TCP dhv-ux:38410->dhv-ux:8500 (ESTABLISHED)
consul  14999 dhv   27u  IPv4 746730      0t0      TCP dhv-ux:8500->dhv-ux:38409 (ESTABLISHED)
consul  14999 dhv   28u  IPv4 746731      0t0      TCP dhv-ux:8500->dhv-ux:38410 (ESTABLISHED)
consul  14999 dhv   29u  IPv4 738162      0t0      TCP dhv-ux:38412->dhv-ux:8500 (ESTABLISHED)
consul  14999 dhv   30u  IPv4 744172      0t0      TCP dhv-ux:38414->dhv-ux:8500 (ESTABLISHED)
consul  14999 dhv   31u  IPv4 742816      0t0      TCP dhv-ux:38413->dhv-ux:8500 (ESTABLISHED)
consul  14999 dhv   32u  IPv4 747845      0t0      TCP dhv-ux:8500->dhv-ux:38412 (ESTABLISHED)
consul  14999 dhv   33u  IPv4 747846      0t0      TCP dhv-ux:8500->dhv-ux:38413 (ESTABLISHED)
consul  14999 dhv   34u  IPv4 747847      0t0      TCP dhv-ux:8500->dhv-ux:38414 (ESTABLISHED)
consul  14999 dhv   35u  IPv4 745859      0t0      TCP dhv-ux:38417->dhv-ux:8500 (ESTABLISHED)
consul  14999 dhv   36u  IPv4 738168      0t0      TCP dhv-ux:38418->dhv-ux:8500 (ESTABLISHED)
consul  14999 dhv   37u  IPv4 742821      0t0      TCP dhv-ux:38419->dhv-ux:8500 (ESTABLISHED)
consul  14999 dhv   38u  IPv4 742822      0t0      TCP dhv-ux:8500->dhv-ux:38418 (ESTABLISHED)
consul  14999 dhv   39u  IPv4 742823      0t0      TCP dhv-ux:8500->dhv-ux:38417 (ESTABLISHED)
consul  14999 dhv   40u  IPv4 742824      0t0      TCP dhv-ux:8500->dhv-ux:38419 (ESTABLISHED)
consul  14999 dhv   41u  IPv4 746758      0t0      TCP dhv-ux:8300->dhv-ux:38705 (ESTABLISHED)

@slackpad slackpad added the type/bug Feature does not function as expected label May 18, 2017
@slackpad slackpad added the theme/internal-cleanup Used to identify tech debt, testing improvements, code refactoring, and non-impactful optimization label May 25, 2017
@preetapan
Copy link
Contributor

@dhv thanks for the report. I reproduced the above, and also noticed that the old watch handlers still stay alive. So in addition to the leaky file handles, it also executes old watch handlers. I had one watch handler sending output to stdout, and on three reloads after I updated the key being watched I saw the same output to stdout thrice, indicating that out of scope old watch handlers were still alive and triggered. This is a good bug find!

@slackpad
Copy link
Contributor

This should be fixed by #3189 - the reload handler now properly stops old watches before registering new ones. @preetapan do you have any concerns before we close this out?

@preetapan
Copy link
Contributor

preetapan commented Jun 24, 2017

Let's leave this one open for now, I want to reproduce the file handle leak again. On Friday I saw it even after the fix in #3189(except your last two commits). I think there's more to do inside the watch handler func to make sure and close file descriptors correctly. This is an older bug, and was happening before the regressions fixed by #3189

@preetapan
Copy link
Contributor

Fixed by PR #3048 . Note that established connections on the server take up to a minute to close after "consul reload" after the client disconnect their previously established watch connections

@toomasp
Copy link

toomasp commented Mar 16, 2018

This problem is present again in Consul versions 1.0.1+. Event watch connections leak until agent service runs out of file descriptors. No leaks after downgrading to 1.0.0.

$ consul version
Consul v1.0.6
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)

$ consul info
agent:
check_monitors = 0
check_ttls = 0
checks = 1
services = 9
build:
prerelease =
revision = 9a494b5
version = 1.0.6
consul:
known_servers = 3
server = false
runtime:
arch = amd64
cpu_count = 2
goroutines = 59
max_procs = 2
os = linux
version = go1.9.3
serf_lan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 76
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 991
members = 91
query_queue = 0
query_time = 16

@banks
Copy link
Member

banks commented Mar 19, 2018

@toomasp can you open a new issue with full details of what you are seeing and a link here? It's hard to conduct new investigation on an old PR - we didn't revert the change to my knowledge so it's likely to be an unrelated or subtly different code path causing your issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/internal-cleanup Used to identify tech debt, testing improvements, code refactoring, and non-impactful optimization type/bug Feature does not function as expected
Projects
None yet
Development

No branches or pull requests

6 participants