Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

too many open files - and consul fills up log #688

Closed
balexx opened this issue Feb 10, 2015 · 24 comments
Closed

too many open files - and consul fills up log #688

balexx opened this issue Feb 10, 2015 · 24 comments

Comments

@balexx
Copy link

balexx commented Feb 10, 2015

Whenever a system reaches a situation with "too many open files" (which can happen very fast in some scenarios), consul starts throwing these errors into the logfile:

2015/02/09 21:25:42 [ERR] memberlist: Error accepting TCP connection: accept tcp [::]:8301: too many open files in system

The errors repeat tens (hundreds?) of times a second, filling up the filesystem before anyone can react. It would really be useful if repeated messages were aggregated, negating this issue.
Alternatively, logging this at a much lower frequency would also help.

@dpkirchner
Copy link

In addition, consul does not (cannot?) pick a new leader if the leader is the one that is out of file descriptors.

In our scenario we have 50 some agents, some of which are connected to the leader once, others 100 times. Haven't figured out why yet.

@armon
Copy link
Member

armon commented Feb 17, 2015

@balexx @dpkirchner There was a regression that caused a lot of connections to be maintained, which is now fixed in master for the 0.5 release. Sorry about that!

@armon armon closed this as completed Feb 17, 2015
@tgwizard
Copy link
Contributor

We just had the same issue. Great that it's fixed! @armon could you please link to the commit or PR that fixes it?

@armon
Copy link
Member

armon commented Feb 18, 2015

@tgwizard Here it is: d92e5d3

It was actually a revert of the commit that caused the regression.

@tgwizard
Copy link
Contributor

Ok, thanks!

@dpkirchner
Copy link

We're running a master build now and the problem has gone away. Sweet.

@cglewis
Copy link

cglewis commented Apr 16, 2015

I'm still experiencing this issue with 0.5.0, any other updates on a fix?

@armon
Copy link
Member

armon commented Apr 17, 2015

@cglewis Can you provide more information? This should be a fixed issue as far as we know

@cglewis
Copy link

cglewis commented Apr 20, 2015

@armon I'm running on Ubuntu 14.04.1 with consul 0.5.0 and seeing the same error message as at the top of this issue. what other information would be helpful to you? for now I just increased the allowed open file count and rebooted the box.

@ryanuber
Copy link
Member

To make sure I understand this - The standing issue is that Consul floods the logs when file handles are exhausted, not that Consul itself is leaking resources, correct? It sounds like @armon has fixed @dpkirchner's resource leaking problem, but if any program exhausts file handles Consul fills the disk with logs.

@cglewis are you just redirecting consul's log output to a file? If you use a syslog server, most often times there is a coalescer built in (like rsyslog's message reduction).

@cglewis
Copy link

cglewis commented Apr 20, 2015

@ryanuber so I only have two things running on that machine, and the former has been running fine for months without file handle issues, but only after adding consul agent to it has it started to have this issue. I don't care as much about the logs filling up, as that can easily be filtered, I was more concerned that Consul seems to be the culprit for the resource leak, though I can't verify that specifically, as I haven't isolated it yet.

@armon
Copy link
Member

armon commented Apr 23, 2015

@cglewis How large is your cluster? There are a number of fixes in master now for causing excessive socket usage on large clusters. 0.5.1 should improve that situation. There is definitely an issue of excessive logging as well if we exhaust file handles, but the answer there is to just crank ulimit way up for now.

@ChrisMcKenzie
Copy link

I am still seeing this on 0.5.2?

@ryanuber
Copy link
Member

@ChrisMcKenzie are you still experiencing this? The issue of aggressive logging when file handles are exhausted still exists, and it's a difficult-to-detect situation from inside of the application. The solution for now is to set high limits for ulimit, or using a syslog server to leverage its log throttling abilities, such as rsyslog. The latest release should contain the fixes mentioned above by @armon so this should be less of a problem now.

@samprakos
Copy link

I am seeing this issue this morning. I have 3 consul servers running 0.6.0 and my agents are still 0.5.2. My consul logs (see below) are logging a million "too many open files" and my disk space is exhausted. The consul process is the one with all the files open (if I "ls" /proc//fd, I see 1024 items, which is what my ulimit is).

Any ideas? Is it the agent/server version mismatch? This happened on 2 servers today.

Logs from consul.log:

==> Log data will now stream in as it occurs:

2015/12/14 21:22:33 [INFO] serf: EventMemberJoin: ip-10-2-4-143 10.2.4.143
2015/12/14 21:22:33 [ERR] agent: failed to sync remote state: No known Consul servers
2015/12/14 21:22:33 [INFO] agent: Joining cluster...
2015/12/14 21:22:33 [INFO] agent: (LAN) joining: [10.2.8.10 10.2.8.12 10.2.8.11]
2015/12/14 21:22:33 [INFO] serf: EventMemberJoin: ip-10-2-11-113 10.2.11.113
2015/12/14 21:22:33 [INFO] serf: EventMemberJoin: ip-10-2-4-24 10.2.4.24
2015/12/14 21:22:33 [INFO] serf: EventMemberJoin: ip-10-2-8-10 10.2.8.10
2015/12/14 21:22:33 [INFO] serf: EventMemberJoin: ip-10-2-3-149 10.2.3.149
2015/12/14 21:22:33 [INFO] serf: EventMemberJoin: ip-10-2-4-189 10.2.4.189
2015/12/14 21:22:33 [INFO] serf: EventMemberJoin: ip-10-2-8-11 10.2.8.11
2015/12/14 21:22:33 [INFO] serf: EventMemberJoin: ip-10-2-3-79 10.2.3.79
2015/12/14 21:22:33 [INFO] serf: EventMemberJoin: ip-10-2-8-12 10.2.8.12
2015/12/14 21:22:33 [INFO] serf: EventMemberJoin: ip-10-2-2-22 10.2.2.22
2015/12/14 21:22:33 [INFO] serf: EventMemberJoin: ip-10-2-5-21 10.2.5.21
2015/12/14 21:22:33 [INFO] serf: EventMemberJoin: ip-10-2-3-43 10.2.3.43
2015/12/14 21:22:33 [INFO] serf: EventMemberJoin: ip-10-2-5-154 10.2.5.154
2015/12/14 21:22:33 [INFO] consul: adding server ip-10-2-8-10 (Addr: 10.2.8.10:8300) (DC: us-west-2)
2015/12/14 21:22:33 [INFO] consul: adding server ip-10-2-8-11 (Addr: 10.2.8.11:8300) (DC: us-west-2)
2015/12/14 21:22:33 [INFO] consul: adding server ip-10-2-8-12 (Addr: 10.2.8.12:8300) (DC: us-west-2)
2015/12/14 21:22:33 [INFO] agent: (LAN) joined: 3 Err: <nil>
2015/12/14 21:22:33 [INFO] agent: Join completed. Synced with 3 initial agents
2015/12/14 21:22:47 [INFO] serf: EventMemberJoin: ip-10-2-4-103 10.2.4.103

==> Newer Consul version available: 0.6.0
2015/12/14 21:24:14 [INFO] agent: Synced service 'omaha-job-ip-10-2-4-143'
2015/12/14 21:24:14 [INFO] agent: Synced service 'omaha-job-ip-10-2-4-143'
2015/12/14 21:24:14 [INFO] agent: Synced check 'omaha-job-authorizations'
2015/12/14 21:24:14 [INFO] agent: Synced service 'omaha-job-ip-10-2-4-143'
2015/12/14 21:24:14 [INFO] agent: Synced service 'omaha-job-ip-10-2-4-143'
2015/12/14 21:24:14 [INFO] agent: Synced service 'omaha-job-ip-10-2-4-143'
2015/12/14 21:24:14 [INFO] agent: Synced check 'omaha-job-clearing'
2015/12/14 21:24:14 [INFO] agent: Synced service 'omaha-jobserver-monitor-ip-10-2-4-143'
2015/12/14 21:24:14 [INFO] agent: Synced check 'omaha-job-authorizations'
2015/12/14 21:24:14 [INFO] agent: Synced check 'omaha-job-encompass'
2015/12/14 21:24:14 [INFO] agent: Synced check 'omaha-job-expire-auths'
2015/12/14 21:24:14 [INFO] agent: Synced check 'omaha-job-check-limits'
2015/12/14 21:24:15 [INFO] agent: Synced check 'omaha-job-clearing'
2015/12/14 21:24:15 [INFO] agent: Synced check 'omaha-jobserver-monitor-jobserver-monitor'
2015/12/14 21:24:15 [INFO] agent: Synced check 'omaha-job-encompass'
2015/12/14 21:24:15 [INFO] agent: Synced check 'omaha-job-expire-auths'
2015/12/14 21:24:15 [INFO] agent: Synced check 'omaha-job-check-limits'
2015/12/14 21:24:15 [INFO] agent: Synced check 'omaha-jobserver-monitor-jobserver-monitor'
2015/12/14 21:24:26 [INFO] agent: Synced service 'omaha-jobserver-monitor-ip-10-2-4-143'
2015/12/14 21:24:26 [INFO] agent: Synced check 'omaha-jobserver-monitor-jobserver-monitor'
2015/12/14 21:24:26 [INFO] agent: Synced check 'omaha-jobserver-monitor-jobserver-monitor'
2015/12/14 21:24:36 [INFO] agent: Synced service 'omaha-jobserver-monitor-ip-10-2-4-143'
2015/12/14 21:24:36 [INFO] agent: Synced check 'omaha-jobserver-monitor-jobserver-monitor'
2015/12/14 21:24:36 [INFO] agent: Synced check 'omaha-jobserver-monitor-jobserver-monitor'
2015/12/14 21:24:46 [INFO] agent: Synced service 'omaha-jobserver-monitor-ip-10-2-4-143'
2015/12/14 21:24:46 [INFO] agent: Synced check 'omaha-jobserver-monitor-jobserver-monitor'
2015/12/14 21:24:47 [INFO] agent: Synced check 'omaha-jobserver-monitor-jobserver-monitor'
2015/12/14 23:39:47 [ERR] memberlist: Push/Pull with ip-10-2-4-24 failed: read tcp 10.2.4.24:8301: i/o timeout
2015/12/14 23:42:27 [ERR] memberlist: Push/Pull with ip-10-2-4-24 failed: read tcp 10.2.4.24:8301: i/o timeout
2015/12/14 23:49:07 [ERR] memberlist: Push/Pull with ip-10-2-4-24 failed: read tcp 10.2.4.24:8301: i/o timeout
2015/12/14 23:59:18 [ERR] memberlist: Push/Pull with ip-10-2-4-24 failed: read tcp 10.2.4.24:8301: i/o timeout
2015/12/15 00:00:28 [ERR] memberlist: Push/Pull with ip-10-2-4-189 failed: read tcp 10.2.4.189:8301: i/o timeout
2015/12/15 00:03:08 [ERR] memberlist: Push/Pull with ip-10-2-4-189 failed: read tcp 10.2.4.189:8301: i/o timeout
2015/12/15 00:04:18 [ERR] memberlist: Push/Pull with ip-10-2-4-189 failed: read tcp 10.2.4.189:8301: i/o timeout
2015/12/15 00:05:28 [ERR] memberlist: Push/Pull with ip-10-2-4-24 failed: read tcp 10.2.4.24:8301: i/o timeout
2015/12/15 00:07:08 [ERR] memberlist: Push/Pull with ip-10-2-4-189 failed: read tcp 10.2.4.189:8301: i/o timeout
2015/12/15 00:08:18 [ERR] memberlist: Push/Pull with ip-10-2-4-189 failed: read tcp 10.2.4.189:8301: i/o timeout
2015/12/15 00:08:58 [ERR] memberlist: Push/Pull with ip-10-2-4-189 failed: read tcp 10.2.4.189:8301: i/o timeout
2015/12/15 00:09:58 [ERR] memberlist: Push/Pull with ip-10-2-4-24 failed: dial tcp 10.2.4.24:8301: too many open files
2015/12/15 00:10:28 [ERR] memberlist: Push/Pull with ip-10-2-4-24 failed: dial tcp 10.2.4.24:8301: too many open files
2015/12/15 00:10:58 [ERR] memberlist: Push/Pull with ip-10-2-11-113 failed: dial tcp 10.2.11.113:8301: too many open files
2015/12/15 00:11:28 [ERR] memberlist: Push/Pull with ip-10-2-4-24 failed: dial tcp 10.2.4.24:8301: too many open files
2015/12/15 00:11:34 [ERR] memberlist: Error accepting TCP connection: accept tcp 10.2.4.143:8301: too many open files
2015/12/15 00:11:34 [ERR] memberlist: Error accepting TCP connection: accept tcp 10.2.4.143:8301: too many open files

@slackpad
Copy link
Contributor

Hi @samprakos the mixed Consul versions should be fine. Can you run lsof -p <consul agent pid> on one of these boxes that's out of descriptors so we can see what type of things are using up the file descriptors?

@samprakos
Copy link

Sure...here is the output from that command (attached):

lsof.txt

@slackpad
Copy link
Contributor

Hmm - these are very weird: TCP localhost:8500->localhost:38851 (ESTABLISHED) - is there an app on there that might be leaking connections to the Consul agent? If you run lsof -i tcp:38851 it might list the PID of the process on the other end of that connection, since it's to a process on the same host.

@samprakos
Copy link

Ah...great suggestion. I have tied the large number of open files to a particular process. It is a go program that is monitoring the health of all services on the consul agent. it is using the consul Go API. I unfortunately can't see where in the API we would call Close() or similar.

Here is our code:

...
    client, err := getClient(consulAddress)

    if err != nil {
        return checks, err
    }

    agentChecks, err := client.Agent().Checks()

    if err != nil {
        return checks, err
    }
...

func getClient(url string) (*api.Client, error) {
    config := api.DefaultConfig()
    config.Address = url

    return api.NewClient(config)
}

@slackpad
Copy link
Contributor

Ideally, I think you'd make getClient() return an existing API client instance (or make a new one but re-use the same transport object for all of them) so that they make use of the connection that's kept alive.

Another solution would be to modify getClient to set config.HttpClient to an object you create in the same fashion as https://github.com/hashicorp/go-cleanhttp/blob/master/cleanhttp.go, except that you'd add DisableKeepAlives: true onto the transport. That should prevent it from holding these connections open.

There's a CloseIdleConnections() method on the transport, but you can't get at that from the client object, so I don't think you'll be able to close it that way.

@samprakos
Copy link

Thanks that gives us something to go on. Setting "DisableKeepAlives: true" didn't solve the issue for some reason so still working through it.

@slackpad
Copy link
Contributor

@samprakos didn't realize this was already open but this is probably a good fix - #1499. I'll do a little more digging and probably merge this tonight - this would let your existing code work as-is.

@samprakos
Copy link

Good news...I'll follow #1499

@slackpad
Copy link
Contributor

@samprakos even better this is being fixed in the upstream library in hashicorp/go-cleanhttp#2 - once that's done you should be able to just update go-cleanhttp and be good to go with no changes to your code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants