too many open files - and consul fills up log #688

balexx · 2015-02-10T10:46:55Z

Whenever a system reaches a situation with "too many open files" (which can happen very fast in some scenarios), consul starts throwing these errors into the logfile:

2015/02/09 21:25:42 [ERR] memberlist: Error accepting TCP connection: accept tcp [::]:8301: too many open files in system

The errors repeat tens (hundreds?) of times a second, filling up the filesystem before anyone can react. It would really be useful if repeated messages were aggregated, negating this issue.
Alternatively, logging this at a much lower frequency would also help.

The text was updated successfully, but these errors were encountered:

dpkirchner · 2015-02-11T17:03:34Z

In addition, consul does not (cannot?) pick a new leader if the leader is the one that is out of file descriptors.

In our scenario we have 50 some agents, some of which are connected to the leader once, others 100 times. Haven't figured out why yet.

armon · 2015-02-17T19:34:11Z

@balexx @dpkirchner There was a regression that caused a lot of connections to be maintained, which is now fixed in master for the 0.5 release. Sorry about that!

tgwizard · 2015-02-17T21:56:47Z

We just had the same issue. Great that it's fixed! @armon could you please link to the commit or PR that fixes it?

armon · 2015-02-18T00:16:04Z

@tgwizard Here it is: d92e5d3

It was actually a revert of the commit that caused the regression.

tgwizard · 2015-02-18T08:08:08Z

Ok, thanks!

dpkirchner · 2015-02-18T19:11:44Z

We're running a master build now and the problem has gone away. Sweet.

cglewis · 2015-04-16T21:37:59Z

I'm still experiencing this issue with 0.5.0, any other updates on a fix?

armon · 2015-04-17T23:52:37Z

@cglewis Can you provide more information? This should be a fixed issue as far as we know

cglewis · 2015-04-20T15:51:07Z

@armon I'm running on Ubuntu 14.04.1 with consul 0.5.0 and seeing the same error message as at the top of this issue. what other information would be helpful to you? for now I just increased the allowed open file count and rebooted the box.

ryanuber · 2015-04-20T17:34:39Z

To make sure I understand this - The standing issue is that Consul floods the logs when file handles are exhausted, not that Consul itself is leaking resources, correct? It sounds like @armon has fixed @dpkirchner's resource leaking problem, but if any program exhausts file handles Consul fills the disk with logs.

@cglewis are you just redirecting consul's log output to a file? If you use a syslog server, most often times there is a coalescer built in (like rsyslog's message reduction).

cglewis · 2015-04-20T17:40:53Z

@ryanuber so I only have two things running on that machine, and the former has been running fine for months without file handle issues, but only after adding consul agent to it has it started to have this issue. I don't care as much about the logs filling up, as that can easily be filtered, I was more concerned that Consul seems to be the culprit for the resource leak, though I can't verify that specifically, as I haven't isolated it yet.

armon · 2015-04-23T01:42:19Z

@cglewis How large is your cluster? There are a number of fixes in master now for causing excessive socket usage on large clusters. 0.5.1 should improve that situation. There is definitely an issue of excessive logging as well if we exhaust file handles, but the answer there is to just crank ulimit way up for now.

ChrisMcKenzie · 2015-11-17T00:16:24Z

I am still seeing this on 0.5.2?

ryanuber · 2015-11-17T00:53:43Z

@ChrisMcKenzie are you still experiencing this? The issue of aggressive logging when file handles are exhausted still exists, and it's a difficult-to-detect situation from inside of the application. The solution for now is to set high limits for ulimit, or using a syslog server to leverage its log throttling abilities, such as rsyslog. The latest release should contain the fixes mentioned above by @armon so this should be less of a problem now.

samprakos · 2015-12-15T13:57:43Z

I am seeing this issue this morning. I have 3 consul servers running 0.6.0 and my agents are still 0.5.2. My consul logs (see below) are logging a million "too many open files" and my disk space is exhausted. The consul process is the one with all the files open (if I "ls" /proc//fd, I see 1024 items, which is what my ulimit is).

Any ideas? Is it the agent/server version mismatch? This happened on 2 servers today.

Logs from consul.log:

==> Log data will now stream in as it occurs:

2015/12/14 21:22:33 [INFO] serf: EventMemberJoin: ip-10-2-4-143 10.2.4.143
2015/12/14 21:22:33 [ERR] agent: failed to sync remote state: No known Consul servers
2015/12/14 21:22:33 [INFO] agent: Joining cluster...
2015/12/14 21:22:33 [INFO] agent: (LAN) joining: [10.2.8.10 10.2.8.12 10.2.8.11]
2015/12/14 21:22:33 [INFO] serf: EventMemberJoin: ip-10-2-11-113 10.2.11.113
2015/12/14 21:22:33 [INFO] serf: EventMemberJoin: ip-10-2-4-24 10.2.4.24
2015/12/14 21:22:33 [INFO] serf: EventMemberJoin: ip-10-2-8-10 10.2.8.10
2015/12/14 21:22:33 [INFO] serf: EventMemberJoin: ip-10-2-3-149 10.2.3.149
2015/12/14 21:22:33 [INFO] serf: EventMemberJoin: ip-10-2-4-189 10.2.4.189
2015/12/14 21:22:33 [INFO] serf: EventMemberJoin: ip-10-2-8-11 10.2.8.11
2015/12/14 21:22:33 [INFO] serf: EventMemberJoin: ip-10-2-3-79 10.2.3.79
2015/12/14 21:22:33 [INFO] serf: EventMemberJoin: ip-10-2-8-12 10.2.8.12
2015/12/14 21:22:33 [INFO] serf: EventMemberJoin: ip-10-2-2-22 10.2.2.22
2015/12/14 21:22:33 [INFO] serf: EventMemberJoin: ip-10-2-5-21 10.2.5.21
2015/12/14 21:22:33 [INFO] serf: EventMemberJoin: ip-10-2-3-43 10.2.3.43
2015/12/14 21:22:33 [INFO] serf: EventMemberJoin: ip-10-2-5-154 10.2.5.154
2015/12/14 21:22:33 [INFO] consul: adding server ip-10-2-8-10 (Addr: 10.2.8.10:8300) (DC: us-west-2)
2015/12/14 21:22:33 [INFO] consul: adding server ip-10-2-8-11 (Addr: 10.2.8.11:8300) (DC: us-west-2)
2015/12/14 21:22:33 [INFO] consul: adding server ip-10-2-8-12 (Addr: 10.2.8.12:8300) (DC: us-west-2)
2015/12/14 21:22:33 [INFO] agent: (LAN) joined: 3 Err: <nil>
2015/12/14 21:22:33 [INFO] agent: Join completed. Synced with 3 initial agents
2015/12/14 21:22:47 [INFO] serf: EventMemberJoin: ip-10-2-4-103 10.2.4.103

==> Newer Consul version available: 0.6.0
2015/12/14 21:24:14 [INFO] agent: Synced service 'omaha-job-ip-10-2-4-143'
2015/12/14 21:24:14 [INFO] agent: Synced service 'omaha-job-ip-10-2-4-143'
2015/12/14 21:24:14 [INFO] agent: Synced check 'omaha-job-authorizations'
2015/12/14 21:24:14 [INFO] agent: Synced service 'omaha-job-ip-10-2-4-143'
2015/12/14 21:24:14 [INFO] agent: Synced service 'omaha-job-ip-10-2-4-143'
2015/12/14 21:24:14 [INFO] agent: Synced service 'omaha-job-ip-10-2-4-143'
2015/12/14 21:24:14 [INFO] agent: Synced check 'omaha-job-clearing'
2015/12/14 21:24:14 [INFO] agent: Synced service 'omaha-jobserver-monitor-ip-10-2-4-143'
2015/12/14 21:24:14 [INFO] agent: Synced check 'omaha-job-authorizations'
2015/12/14 21:24:14 [INFO] agent: Synced check 'omaha-job-encompass'
2015/12/14 21:24:14 [INFO] agent: Synced check 'omaha-job-expire-auths'
2015/12/14 21:24:14 [INFO] agent: Synced check 'omaha-job-check-limits'
2015/12/14 21:24:15 [INFO] agent: Synced check 'omaha-job-clearing'
2015/12/14 21:24:15 [INFO] agent: Synced check 'omaha-jobserver-monitor-jobserver-monitor'
2015/12/14 21:24:15 [INFO] agent: Synced check 'omaha-job-encompass'
2015/12/14 21:24:15 [INFO] agent: Synced check 'omaha-job-expire-auths'
2015/12/14 21:24:15 [INFO] agent: Synced check 'omaha-job-check-limits'
2015/12/14 21:24:15 [INFO] agent: Synced check 'omaha-jobserver-monitor-jobserver-monitor'
2015/12/14 21:24:26 [INFO] agent: Synced service 'omaha-jobserver-monitor-ip-10-2-4-143'
2015/12/14 21:24:26 [INFO] agent: Synced check 'omaha-jobserver-monitor-jobserver-monitor'
2015/12/14 21:24:26 [INFO] agent: Synced check 'omaha-jobserver-monitor-jobserver-monitor'
2015/12/14 21:24:36 [INFO] agent: Synced service 'omaha-jobserver-monitor-ip-10-2-4-143'
2015/12/14 21:24:36 [INFO] agent: Synced check 'omaha-jobserver-monitor-jobserver-monitor'
2015/12/14 21:24:36 [INFO] agent: Synced check 'omaha-jobserver-monitor-jobserver-monitor'
2015/12/14 21:24:46 [INFO] agent: Synced service 'omaha-jobserver-monitor-ip-10-2-4-143'
2015/12/14 21:24:46 [INFO] agent: Synced check 'omaha-jobserver-monitor-jobserver-monitor'
2015/12/14 21:24:47 [INFO] agent: Synced check 'omaha-jobserver-monitor-jobserver-monitor'
2015/12/14 23:39:47 [ERR] memberlist: Push/Pull with ip-10-2-4-24 failed: read tcp 10.2.4.24:8301: i/o timeout
2015/12/14 23:42:27 [ERR] memberlist: Push/Pull with ip-10-2-4-24 failed: read tcp 10.2.4.24:8301: i/o timeout
2015/12/14 23:49:07 [ERR] memberlist: Push/Pull with ip-10-2-4-24 failed: read tcp 10.2.4.24:8301: i/o timeout
2015/12/14 23:59:18 [ERR] memberlist: Push/Pull with ip-10-2-4-24 failed: read tcp 10.2.4.24:8301: i/o timeout
2015/12/15 00:00:28 [ERR] memberlist: Push/Pull with ip-10-2-4-189 failed: read tcp 10.2.4.189:8301: i/o timeout
2015/12/15 00:03:08 [ERR] memberlist: Push/Pull with ip-10-2-4-189 failed: read tcp 10.2.4.189:8301: i/o timeout
2015/12/15 00:04:18 [ERR] memberlist: Push/Pull with ip-10-2-4-189 failed: read tcp 10.2.4.189:8301: i/o timeout
2015/12/15 00:05:28 [ERR] memberlist: Push/Pull with ip-10-2-4-24 failed: read tcp 10.2.4.24:8301: i/o timeout
2015/12/15 00:07:08 [ERR] memberlist: Push/Pull with ip-10-2-4-189 failed: read tcp 10.2.4.189:8301: i/o timeout
2015/12/15 00:08:18 [ERR] memberlist: Push/Pull with ip-10-2-4-189 failed: read tcp 10.2.4.189:8301: i/o timeout
2015/12/15 00:08:58 [ERR] memberlist: Push/Pull with ip-10-2-4-189 failed: read tcp 10.2.4.189:8301: i/o timeout
2015/12/15 00:09:58 [ERR] memberlist: Push/Pull with ip-10-2-4-24 failed: dial tcp 10.2.4.24:8301: too many open files
2015/12/15 00:10:28 [ERR] memberlist: Push/Pull with ip-10-2-4-24 failed: dial tcp 10.2.4.24:8301: too many open files
2015/12/15 00:10:58 [ERR] memberlist: Push/Pull with ip-10-2-11-113 failed: dial tcp 10.2.11.113:8301: too many open files
2015/12/15 00:11:28 [ERR] memberlist: Push/Pull with ip-10-2-4-24 failed: dial tcp 10.2.4.24:8301: too many open files
2015/12/15 00:11:34 [ERR] memberlist: Error accepting TCP connection: accept tcp 10.2.4.143:8301: too many open files
2015/12/15 00:11:34 [ERR] memberlist: Error accepting TCP connection: accept tcp 10.2.4.143:8301: too many open files

slackpad · 2015-12-15T16:13:27Z

Hi @samprakos the mixed Consul versions should be fine. Can you run lsof -p <consul agent pid> on one of these boxes that's out of descriptors so we can see what type of things are using up the file descriptors?

samprakos · 2015-12-15T16:22:42Z

Sure...here is the output from that command (attached):

lsof.txt

slackpad · 2015-12-15T16:48:52Z

Hmm - these are very weird: TCP localhost:8500->localhost:38851 (ESTABLISHED) - is there an app on there that might be leaking connections to the Consul agent? If you run lsof -i tcp:38851 it might list the PID of the process on the other end of that connection, since it's to a process on the same host.

samprakos · 2015-12-15T18:12:01Z

Ah...great suggestion. I have tied the large number of open files to a particular process. It is a go program that is monitoring the health of all services on the consul agent. it is using the consul Go API. I unfortunately can't see where in the API we would call Close() or similar.

Here is our code:

...
    client, err := getClient(consulAddress)

    if err != nil {
        return checks, err
    }

    agentChecks, err := client.Agent().Checks()

    if err != nil {
        return checks, err
    }
...

func getClient(url string) (*api.Client, error) {
    config := api.DefaultConfig()
    config.Address = url

    return api.NewClient(config)
}

slackpad · 2015-12-15T18:31:30Z

Ideally, I think you'd make getClient() return an existing API client instance (or make a new one but re-use the same transport object for all of them) so that they make use of the connection that's kept alive.

Another solution would be to modify getClient to set config.HttpClient to an object you create in the same fashion as https://github.com/hashicorp/go-cleanhttp/blob/master/cleanhttp.go, except that you'd add DisableKeepAlives: true onto the transport. That should prevent it from holding these connections open.

There's a CloseIdleConnections() method on the transport, but you can't get at that from the client object, so I don't think you'll be able to close it that way.

samprakos · 2015-12-16T13:41:41Z

Thanks that gives us something to go on. Setting "DisableKeepAlives: true" didn't solve the issue for some reason so still working through it.

slackpad · 2015-12-17T01:35:53Z

@samprakos didn't realize this was already open but this is probably a good fix - #1499. I'll do a little more digging and probably merge this tonight - this would let your existing code work as-is.

samprakos · 2015-12-17T15:43:30Z

Good news...I'll follow #1499

slackpad · 2015-12-17T20:03:21Z

@samprakos even better this is being fixed in the upstream library in hashicorp/go-cleanhttp#2 - once that's done you should be able to just update go-cleanhttp and be good to go with no changes to your code.

armon closed this as completed Feb 17, 2015

frankfarmer mentioned this issue May 29, 2015

DNS queries with unknown datacenters can cause excessive load on consul servers and force agents to run out of file descriptors #807

Closed

brockoffdev mentioned this issue Dec 16, 2015

"Too many open files" on Vault hashicorp/vault#867

Closed

orivej mentioned this issue Dec 17, 2015

client: prevent leak of file descriptors with multiple clients fsouza/go-dockerclient#434

Merged

slackpad mentioned this issue Jan 30, 2016

agent hangs/loops with too many open files #907

Closed

slackpad mentioned this issue Aug 23, 2016

Consul "Too many open files" #2297

Closed

duckhan pushed a commit to duckhan/consul that referenced this issue Oct 24, 2021

Run acceptance tests against aks 1.19 nightly (hashicorp#688)

a793e70

duckhan pushed a commit to duckhan/consul that referenced this issue Oct 24, 2021

Update CHANGELOG.md (hashicorp#688)

ddc7567

agouin mentioned this issue Sep 23, 2022

Increase open file limit in systemd example strangelove-ventures/horcrux#101

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

too many open files - and consul fills up log #688

too many open files - and consul fills up log #688

balexx commented Feb 10, 2015

dpkirchner commented Feb 11, 2015

armon commented Feb 17, 2015

tgwizard commented Feb 17, 2015

armon commented Feb 18, 2015

tgwizard commented Feb 18, 2015

dpkirchner commented Feb 18, 2015

cglewis commented Apr 16, 2015

armon commented Apr 17, 2015

cglewis commented Apr 20, 2015

ryanuber commented Apr 20, 2015

cglewis commented Apr 20, 2015

armon commented Apr 23, 2015

ChrisMcKenzie commented Nov 17, 2015

ryanuber commented Nov 17, 2015

samprakos commented Dec 15, 2015

slackpad commented Dec 15, 2015

samprakos commented Dec 15, 2015

slackpad commented Dec 15, 2015

samprakos commented Dec 15, 2015

slackpad commented Dec 15, 2015

samprakos commented Dec 16, 2015

slackpad commented Dec 17, 2015

samprakos commented Dec 17, 2015

slackpad commented Dec 17, 2015

too many open files - and consul fills up log #688

too many open files - and consul fills up log #688

Comments

balexx commented Feb 10, 2015

dpkirchner commented Feb 11, 2015

armon commented Feb 17, 2015

tgwizard commented Feb 17, 2015

armon commented Feb 18, 2015

tgwizard commented Feb 18, 2015

dpkirchner commented Feb 18, 2015

cglewis commented Apr 16, 2015

armon commented Apr 17, 2015

cglewis commented Apr 20, 2015

ryanuber commented Apr 20, 2015

cglewis commented Apr 20, 2015

armon commented Apr 23, 2015

ChrisMcKenzie commented Nov 17, 2015

ryanuber commented Nov 17, 2015

samprakos commented Dec 15, 2015

slackpad commented Dec 15, 2015

samprakos commented Dec 15, 2015

slackpad commented Dec 15, 2015

samprakos commented Dec 15, 2015

slackpad commented Dec 15, 2015

samprakos commented Dec 16, 2015

slackpad commented Dec 17, 2015

samprakos commented Dec 17, 2015

slackpad commented Dec 17, 2015