Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cgroup plugin support for aggregate metrics across multiple child cgroups #2945

Closed
oplehto opened this issue Jun 21, 2017 · 9 comments
Closed

Comments

@oplehto
Copy link
Contributor

oplehto commented Jun 21, 2017

Feature Request

Proposal:

Having the capability for the cgroups input plugin to collect aggregate metrics based on raw values and derived metrics from child cgroups. This would be particularly useful in situations where a large amounts of cgroups are created dynamically.

Current behavior:

At the moment the cgroups input plugin can glob child cgroups but they are collected individually and sent as individual metrics.

Desired behavior:

It should be possible to also aggregate the data collected from the globbed child cgroups with just a single metric.

The aggregation rules should include getting the sum/avg/min/max of a specific value or value defined by a formula (For example: max(ValueA / (ValueB+ValueC+ValueD)) ) across all the child cgroups. See the use case

Use case:

A good example is a compute cluster running a batch job queuing system such as Univa Grid Engine or SLURM. The scheduler generates and destroys cgroups constantly identified by their job id. In a busy cluster with millions of jobs per month this causes a cardinality explosion if everything is collected individually.

However, a lot of useful data can be obtained by just monitoring the proposed aggregated data, for example to identify resource exhaustion or overcommitment:

  • What is the sum of the memory limits of the child groups on a server sum(memory.memsw.limit_in_bytes)
  • What is the maximum amount of memory used by an individual child cgroup (max(inactive_anon + active_anon + unevictable))
  • What is the highest percentage of CPU utilization by an individual child cgroup ( max(memory.memsw.limit_in_bytes / (inactive_anon + active_anon + unevictable)) )

The final example demonstrates the need to have the capability to base the aggregations on a formula as the cgroup data itself does not always have the raw value that we'd like to obotain.

@danielnelson
Copy link
Contributor

Can you add an example of how the dynamic cgroups are named? Perhaps we can reduce the cardinality in some way and still allow the functions to be ran at query time, if possible this will be much more flexible for querying.

@oplehto
Copy link
Contributor Author

oplehto commented Jun 22, 2017

In Univa Grid Engine (UGE), the groups are named /cgroups/<subsystem>/UGE/<jobid>.<taskid> where the jobid and taskid are integers. One jobid may have one or several taskids. It should not be assumed that this IDs are ordered in any way as the scheduler may opt to run more recent tasks (with larger jobids) before older tasks.

In UGE a typical pattern would be:

/cgroup/memory/UGE/326476.380/
/cgroup/memory/UGE/326472.132/
/cgroup/memory/UGE/323472.222/
...

In a typical HPC cluster the number of simultaneous cgroups on a server varies constantly as tasks stop and start. It's not uncommon to have tasks which only run for a few seconds.

Typically the number of simultaneous cgroups on a server maxes out at the core count of a server. However, there are scenarios where servers are intentionally oversubscribed so there could be hundreds of these cgroups.

Other products in this space tend to follow a similar pattern but with variations. For example SLURM, which is another popular batch scheduler has a deeper hierarchy: https://slurm.schedmd.com/cgroups.html

@vlasad
Copy link
Contributor

vlasad commented Jun 26, 2017

@oplehto Could you provide your vision (examples) of plugin's configs for these cases:

What is the sum of the memory limits of the child groups on a server sum(memory.memsw.limit_in_bytes)
What is the maximum amount of memory used by an individual child cgroup (max(inactive_anon + active_anon + unevictable))
What is the highest percentage of CPU utilization by an individual child cgroup ( max(memory.memsw.limit_in_bytes / (inactive_anon + active_anon + unevictable)) )

@danielnelson
Copy link
Contributor

I recommend using Kapacitor for this. Telegraf aggregators, including some new ones in the works are not flexible enough to handle this because each field needs to be aggregated in a specific way.

@oplehto
Copy link
Contributor Author

oplehto commented Jul 5, 2017

We have discussed this internally, and in this case we really want to do this on each node, not centrally and then re-writing it back with Kapacitor; this computation per node isn’t very much but for all nodes becomes really significant. We are running this at a scale that this starts to cause us real problems (ignoring the challenge of computing these centrally, we are at the write limits of InfluxDB even without having to write metrics in twice!)

Other telegraf plugins already in effect calculate rates or do other simple transforms to make this data useful to query at source. The minimum viable version which would satisfy a lot of use cases would be just to have simple aggregations implemented.

In this case the syntax could simply be something like:

[[inputs.cgroup]]
 paths = [
   "/cgroup/UGE/*"             
   ]
aggregate_children = ["SUM"]
files = ["memory.limit_in_bytes"]

This would produce something like:

cgroup,sr=grid_compute,dc=cara,bu=jcs,env=grid,cls=server,host=cara-compute0698,path=/cgroup/memory/UGE/* sum_memory.limit_in_bytes=120504320i,aggregate_children=10i 1499262610000000000

Where aggregate_children is the amount of subdirectories that is used and sum_memory.limit_in_bytes just the sum of that value across all the subdirectories. Just having a palette of (SUM,MEAN,MIN,MAX) would go far.

That wildcard in the path field is a bit dodgy and if anyone has a better idea it would be greatly appreciated.

If we submit a patch that does this, will you consider merging it?

@danielnelson
Copy link
Contributor

It should be possible run both Telegraf and Kapacitor on the same host, to avoid the centralized collection.

Another way to do this might be with an aggregator like #2167, if the path tag was dropped or replaced with a shared tag then I think it could work.

@oplehto
Copy link
Contributor Author

oplehto commented Jul 6, 2017

Considering the scale of the estate and the performance-sensitive nature of the systems, I'd avoid adding another daemon to the mix just for the sake of this.

#2167 looks like a really good solution that I wasn't aware of. Thanks for pointing this out. I'm assuming it's possible to use the namepass and fieldpass filters in conjunction with this to limit the aggregation to just the measurement we are interested in?

@danielnelson
Copy link
Contributor

Yes, you will need to use the regular set of filters. Also keep in mind there is a potential deadlock with aggregators in the current version, but I hope to have this sorted out soon #2914

@danielnelson
Copy link
Contributor

@oplehto I'm closing this issue and will work on the basic stats aggregator on #2167. If that doesn't work out we can reopen this later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants