Get metrics regarding opened file handlers #853

ssbarnea · 2014-03-06T21:53:41Z

Another common problem with systems is the number of opened files.
Data dog should provide metrics regarding their use and more important to present one metric that measure % use, allowing us to add alerts if usage is above, let's say 80%.

The numeric value is not a big use by itself, but when measured agains the maximum value, which is configurable its value grows considerably.

http://www.cyberciti.biz/tips/linux-procfs-file-descriptors.html

clutchski · 2014-03-06T21:57:37Z

This is a good idea. Thanks very much.

remh · 2014-03-06T23:25:45Z

The name is confusing but it already exists in the process check:

https://github.com/DataDog/dd-agent/blob/master/checks.d/process.py#L13

remh · 2014-03-07T00:02:09Z

Sorry misread your use case to get it as a percentage.
Reopening it.

clutchski · 2014-03-07T00:14:16Z

I think we also want the system limit, not just handles open per process.

ssbarnea · 2014-03-07T09:35:51Z

@clutchski you are right. Now we may have another problem, there is a system limit and a limit per process, probably we will need both. From my experience you may want to monitor file handler for the monitored processes, as these are the ones that do "surprise you from time to time.

Also, I have no idea on how to enable these metrics. I checked the process.yaml file and it contains information only on how to monitor different processes not on how to enable these metrics (obviously I tried to search them using the web UI and they are not there)

And regarding documentation, the best way to improve it is to improve the yaml templates and to include all supported parameters to them. If something is to hard / complex to explain in the yaml file, you can alway put an URL to a knowledge base article :)

ssbarnea · 2014-03-07T09:39:48Z

Just discovered that psutil was not installed. Should I open another bug as "the installed does not try to install psutil by default"?

I installed psutil but now what do I need to do? do I need to restart dd-agent, change someone in the config? ... i wasn't able to see any error related to psutil in the dd-agent logs.

remh · 2014-03-07T15:49:36Z

@ssbarnea

We don't bundle check dependencies with the agent to avoid conflicts with existing versions on the user's system.

But we are working towards a self contained agent which would actually install these dependencies. So there is no need to open another bug for that.

Regarding the process check, currently it doesn't collect the system limit but it collects the number of open files descriptors for your watched processes.

Can you get in touch with [email protected] to help you configure the check ?
Thanks

ssbarnea · 2014-03-22T13:40:14Z

I will contact support, they are really good and also quick :)

Now, just as a customer experience: I find annoying that by default only ⅓ of the functionality is available just because you do not have the required libraries installed. I hope the next installer will try to install them one way or another, I don't care how. The 2nd annoyance is that the default .yaml files are far from being extensive enough. I do think that you should make a rule of updating these with all available options, to use them as primary source of documentation. Most linux tools do have config files with commented options inside and most time this is all you need in order to configure the products. That's what I call self-documented. Thanks.

Also it would be great to build a list of metrics with a description for each one. So we would know which one we do want to track or not and also to know exactly what a metric means, sometimes the name is not explicit enough and you may not be aware of the range of values it will take, unit of measure, ....

ssbarnea · 2015-04-13T11:57:49Z

Please do try to install psutils when installing agent, otherwise you are just providing a bad user experience. It is ok to ignore if it fails but doing an apt-get install psutils would be a great UX improvement.

remh · 2015-04-13T14:49:01Z

Thanks for the feedback @ssbarnea

As of Agent 5.0.0, psutil is bundled in the deb, rpm and msi packages of the agent, and is installed on the fly with source installs.

$ /opt/datadog-agent/embedded/bin/python -c "import psutil; print psutil.__version__"
2.2.1

We will work on this issue to implement the count of opened file handlers, as it's an important metric but feel free to open a pull request if you've already done so!

Thanks again for the feedback!

ssbarnea · 2015-04-13T14:59:33Z

I am quite busy fixing other broken things around but be sure that if I implement something in datadog I will make pull requests. I prefer not to run my owned patched versions.

I had an outage due to file handlers being ousted for one of the monitored processes (nginx) and it took me some time to find out the cause.

So if Data dog can monitor the % of file handlers it would be perfect as we can have a single rule: if % open files (curr/max) is over 90% raise alarm.

I do like being able to have relative conditions as it is much easier to manage them and also you do not have to update the monitors when you tune the configurations on the server side.

On 13 Apr 2015, at 15:49, Remi Hakim [email protected] wrote:

Thanks for the feedback @ssbarnea https://github.com/ssbarnea
As of Agent 5.0.0, psutil is bundled in the deb, rpm and msi packages of the agent, and is installed on the fly with source installs.

$ /opt/datadog-agent/embedded/bin/python -c "import psutil; print psutil.version"
2.2.1
We will work on this issue to implement the count of opened file handlers, as it's an important metric but feel free to open a pull request if you've already done so!

Thanks again for the feedback!

—
Reply to this email directly or view it on GitHub #853 (comment).

remh · 2016-01-04T23:09:37Z

Looks like we could get that from /proc/sys/fs/file-nr

@ssbarnea what do you think ?

ssbarnea · 2016-01-18T16:18:02Z

This doesn't seem to fix the issue, we need to be able to read the number of file descriptors per user and this seems to return the same result for any user.

remh · 2016-01-20T22:34:22Z

Thanks for the feedback @ssbarnea

It's not possible to get the number of open FD per user without root access.
Getting the number of open FD per user can also potentially generates hundreds of different timeseries for a use case that's not pretty clear. It will also be way slower than just reading from /proc/sys/fs/file-nr as it will have to go through all running PIDs (that's basically what lsof does).

Reading the number of open FD and the limit in /proc/sys/fs/file-nr would be on the other hand pretty straightforward, fast to execute, doesn't require root access and will let you the visibility to detect FD leaks.

So it's likely the way we will go. What do you think ?

ssbarnea · 2016-01-21T11:48:25Z

We are running ~6 serious JVM applications on the same bare-metal machine, each o them under its own username, and they all have custom ulimits. We never went out of filehandlers for the system itself but every 3-4 months we have an issue releated to them, caused by either a bug or just normal usage increase.

If we would monitor only the global number of file handlers we would not be able to stop who is generating the problem.

As an workaround I could setup the same limits for all applicaitions, having them at 90% of total system limits for each of them and monitor only the total values.

I do agree that under no circumstance we should count all FD for each PID.

Needing root acees is not a problem from my point of view, doing a proper monitoring almost always required root access. There are way to secure this, allow datadog user to run a specific command that runs as root could be one option.

ssbarnea · 2016-01-21T14:38:15Z

I hope someone from DataDog will pull this DataDog/ansible-datadog#13 which is needed for this bug.

remh · 2016-01-29T17:53:33Z

@ssbarnea thanks for the feedback
We closed DataDog/ansible-datadog#13 as psutil is already included in the agent.

One way to do that would be for you to add lsof access to dd-agent in the sudoer file.

Then we could have the process check to call lsof on the pids found by the process check.
Would that work ?

ssbarnea · 2016-03-21T10:05:24Z

Yes it, in addition with the correct configuration of lsof via ansible this would work. Thanks!

alexef · 2016-08-24T13:17:06Z

Just for the record, and to save others the time searching for it, currently Datadog only supports open file handlers per process. For system, system.processes. open_file_descriptors is just a sum of the monitored processes open_fd(s).

lalarsson · 2016-09-15T08:39:53Z

@alexef Thanks for the information.

Is there any possibilty that the total open file handlers per system will be included in the future?

tomstockton · 2016-10-18T10:12:11Z

+1 - @remh what happened to monitoring the relevant values in /proc/sys/fs/file-nr? This would be really useful. I could write a custom check but I feel like it should be a core metric.

jippi · 2016-12-01T10:36:33Z

👍 the aggregate of /proc/sys/fs/file-nr would be super useful!

remh · 2017-01-25T14:45:59Z

Reopening we can indeed add the content of /proc/sys/fs/file-nr although it's not as precise.

abeluck · 2017-11-09T16:16:16Z

Any movement on this issue? This is quite an important metrics for us.

pdecat · 2017-11-09T16:42:38Z

@abeluck, I've got a PR at DataDog/integrations-core#715 but some changes were requested before it can be considered to be merged. I don't have time to implement them right now, though.

FWIW, we are using this patch as is since August.

olivielpeau · 2018-04-23T08:31:53Z

On Linux, the Agent now reports the total number of open file handles over the system limit (as the fraction system.fs.file_handles.in_use), by default. The value is collected from /proc/sys/fs/file-nr.
So I'll go ahead and close this issue.

root permissions are needed to collect per-process values. @pdeca's PR here DataDog/integrations-core#1235 implements a way to grab these metrics without making the whole agent run as root.

remh closed this as completed Mar 6, 2014

remh reopened this Mar 7, 2014

remh added this to the 4.3.x milestone Mar 7, 2014

remh modified the milestones: 5.1.0, 4.3.x May 8, 2014

remh modified the milestones: Future, 5.1.0 Sep 26, 2014

remh modified the milestones: 5.4.0, Future Apr 13, 2015

remh modified the milestones: 5.5.0, 5.4.0 May 11, 2015

remh modified the milestones: Contribution needed, 5.5.0 Jul 30, 2015

remh modified the milestones: 5.8.0, Contribution needed Jan 6, 2016

ssbarnea mentioned this issue Jan 21, 2016

Added python-psutil which is not included by default but needed for s… DataDog/ansible-datadog#13

Closed

ssbarnea mentioned this issue Mar 21, 2016

get lsof installed and configured in sudoers file DataDog/ansible-datadog#19

Open

ssbarnea closed this as completed Mar 21, 2016

remh modified the milestones: 5.9.0, 5.8.0 Mar 22, 2016

olivielpeau mentioned this issue Apr 29, 2016

[network] Tag sockets states with process name #2375

Closed

remh reopened this Jan 25, 2017

CharlyF pushed a commit that referenced this issue Nov 15, 2017

adding meta_metrics_mapper to comply with integrations-core[#853]

740d7b4

olivielpeau closed this as completed Apr 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get metrics regarding opened file handlers #853

Get metrics regarding opened file handlers #853

ssbarnea commented Mar 6, 2014

clutchski commented Mar 6, 2014

remh commented Mar 6, 2014

remh commented Mar 7, 2014

clutchski commented Mar 7, 2014

ssbarnea commented Mar 7, 2014

ssbarnea commented Mar 7, 2014

remh commented Mar 7, 2014

ssbarnea commented Mar 22, 2014

ssbarnea commented Apr 13, 2015

remh commented Apr 13, 2015

ssbarnea commented Apr 13, 2015

remh commented Jan 4, 2016

ssbarnea commented Jan 18, 2016

remh commented Jan 20, 2016

ssbarnea commented Jan 21, 2016

ssbarnea commented Jan 21, 2016

remh commented Jan 29, 2016

ssbarnea commented Mar 21, 2016

alexef commented Aug 24, 2016 •

edited

Loading

lalarsson commented Sep 15, 2016

tomstockton commented Oct 18, 2016

jippi commented Dec 1, 2016

remh commented Jan 25, 2017

abeluck commented Nov 9, 2017

pdecat commented Nov 9, 2017

olivielpeau commented Apr 23, 2018

Get metrics regarding opened file handlers #853

Get metrics regarding opened file handlers #853

Comments

ssbarnea commented Mar 6, 2014

clutchski commented Mar 6, 2014

remh commented Mar 6, 2014

remh commented Mar 7, 2014

clutchski commented Mar 7, 2014

ssbarnea commented Mar 7, 2014

ssbarnea commented Mar 7, 2014

remh commented Mar 7, 2014

ssbarnea commented Mar 22, 2014

ssbarnea commented Apr 13, 2015

remh commented Apr 13, 2015

ssbarnea commented Apr 13, 2015

remh commented Jan 4, 2016

ssbarnea commented Jan 18, 2016

remh commented Jan 20, 2016

ssbarnea commented Jan 21, 2016

ssbarnea commented Jan 21, 2016

remh commented Jan 29, 2016

ssbarnea commented Mar 21, 2016

alexef commented Aug 24, 2016 • edited Loading

lalarsson commented Sep 15, 2016

tomstockton commented Oct 18, 2016

jippi commented Dec 1, 2016

remh commented Jan 25, 2017

abeluck commented Nov 9, 2017

pdecat commented Nov 9, 2017

olivielpeau commented Apr 23, 2018

alexef commented Aug 24, 2016 •

edited

Loading