Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented ceph storage metrics collector #1172

Merged
merged 1 commit into from
May 18, 2016
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions etc/telegraf.conf
Original file line number Diff line number Diff line change
Expand Up @@ -369,6 +369,24 @@
# INPUT PLUGINS #
###############################################################################

# # Collects performance metrics from the MON and OSD nodes in a Ceph storage cluster.
# [[inputs.ceph]]
# ## All configuration values are optional, defaults are shown below
#
# ## location of ceph binary
# ceph_binary = "/usr/bin/ceph"
#
# ## directory in which to look for socket files
# socket_dir = "/var/run/ceph"
#
# ## prefix of MON and OSD socket files, used to determine socket type
# mon_prefix = "ceph-mon"
# osd_prefix = "ceph-osd"
#
# ## suffix used to identify socket files
# socket_suffix = "asok"


# Read metrics about cpu usage
[[inputs.cpu]]
## Whether to report per-cpu stats or not
Expand Down
1 change: 1 addition & 0 deletions plugins/inputs/all/all.go
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ import (
_ "github.com/influxdata/telegraf/plugins/inputs/apache"
_ "github.com/influxdata/telegraf/plugins/inputs/bcache"
_ "github.com/influxdata/telegraf/plugins/inputs/cassandra"
_ "github.com/influxdata/telegraf/plugins/inputs/ceph"
_ "github.com/influxdata/telegraf/plugins/inputs/cloudwatch"
_ "github.com/influxdata/telegraf/plugins/inputs/couchbase"
_ "github.com/influxdata/telegraf/plugins/inputs/couchdb"
Expand Down
109 changes: 109 additions & 0 deletions plugins/inputs/ceph/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# Ceph Storage Input Plugin

Collects performance metrics from the MON and OSD nodes in a Ceph storage cluster.

The plugin works by scanning the configured SocketDir for OSD and MON socket files. When it finds
a MON socket, it runs **ceph --admin-daemon $file perfcounters_dump**. For OSDs it runs **ceph --admin-daemon $file perf dump**

The resulting JSON is parsed and grouped into collections, based on top-level key. Top-level keys are
used as collection tags, and all sub-keys are flattened. For example:

```
{
"paxos": {
"refresh": 9363435,
"refresh_latency": {
"avgcount": 9363435,
"sum": 5378.794002000
}
}
}
```

Would be parsed into the following metrics, all of which would be tagged with collection=paxos:

- refresh = 9363435
- refresh_latency.avgcount: 9363435
- refresh_latency.sum: 5378.794002000


### Configuration:

```
# Collects performance metrics from the MON and OSD nodes in a Ceph storage cluster.
[[inputs.ceph]]
## All configuration values are optional, defaults are shown below

## location of ceph binary
ceph_binary = "/usr/bin/ceph"

## directory in which to look for socket files
socket_dir = "/var/run/ceph"

## prefix of MON and OSD socket files, used to determine socket type
mon_prefix = "ceph-mon"
osd_prefix = "ceph-osd"

## suffix used to identify socket files
socket_suffix = "asok"
```

### Measurements & Fields:

All fields are collected under the **ceph** measurement and stored as float64s. For a full list of fields, see the sample perf dumps in ceph_test.go.


### Tags:

All measurements will have the following tags:

- type: either 'osd' or 'mon' to indicate which type of node was queried
- id: a unique string identifier, parsed from the socket file name for the node
- collection: the top-level key under which these fields were reported. Possible values are:
- for MON nodes:
- cluster
- leveldb
- mon
- paxos
- throttle-mon_client_bytes
- throttle-mon_daemon_bytes
- throttle-msgr_dispatch_throttler-mon
- for OSD nodes:
- WBThrottle
- filestore
- leveldb
- mutex-FileJournal::completions_lock
- mutex-FileJournal::finisher_lock
- mutex-FileJournal::write_lock
- mutex-FileJournal::writeq_lock
- mutex-JOS::ApplyManager::apply_lock
- mutex-JOS::ApplyManager::com_lock
- mutex-JOS::SubmitManager::lock
- mutex-WBThrottle::lock
- objecter
- osd
- recoverystate_perf
- throttle-filestore_bytes
- throttle-filestore_ops
- throttle-msgr_dispatch_throttler-client
- throttle-msgr_dispatch_throttler-cluster
- throttle-msgr_dispatch_throttler-hb_back_server
- throttle-msgr_dispatch_throttler-hb_front_serve
- throttle-msgr_dispatch_throttler-hbclient
- throttle-msgr_dispatch_throttler-ms_objecter
- throttle-objecter_bytes
- throttle-objecter_ops
- throttle-osd_client_bytes
- throttle-osd_client_messages


### Example Output:

<pre>
telegraf -test -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d -input-filter ceph
* Plugin: ceph, Collection 1
> ceph,collection=paxos, id=node-2,role=openstack,type=mon accept_timeout=0,begin=14931264,begin_bytes.avgcount=14931264,begin_bytes.sum=180309683362,begin_keys.avgcount=0,begin_keys.sum=0,begin_latency.avgcount=14931264,begin_latency.sum=9293.29589,collect=1,collect_bytes.avgcount=1,collect_bytes.sum=24,collect_keys.avgcount=1,collect_keys.sum=1,collect_latency.avgcount=1,collect_latency.sum=0.00028,collect_timeout=0,collect_uncommitted=0,commit=14931264,commit_bytes.avgcount=0,commit_bytes.sum=0,commit_keys.avgcount=0,commit_keys.sum=0,commit_latency.avgcount=0,commit_latency.sum=0,lease_ack_timeout=0,lease_timeout=0,new_pn=0,new_pn_latency.avgcount=0,new_pn_latency.sum=0,refresh=14931264,refresh_latency.avgcount=14931264,refresh_latency.sum=8706.98498,restart=4,share_state=0,share_state_bytes.avgcount=0,share_state_bytes.sum=0,share_state_keys.avgcount=0,share_state_keys.sum=0,start_leader=0,start_peon=1,store_state=14931264,store_state_bytes.avgcount=14931264,store_state_bytes.sum=353119959211,store_state_keys.avgcount=14931264,store_state_keys.sum=289807523,store_state_latency.avgcount=14931264,store_state_latency.sum=10952.835724 1462821234814535148
> ceph,collection=throttle-mon_client_bytes,id=node-2,type=mon get=1413017,get_or_fail_fail=0,get_or_fail_success=0,get_sum=71211705,max=104857600,put=1413013,put_sum=71211459,take=0,take_sum=0,val=246,wait.avgcount=0,wait.sum=0 1462821234814737219
> ceph,collection=throttle-mon_daemon_bytes,id=node-2,type=mon get=4058121,get_or_fail_fail=0,get_or_fail_success=0,get_sum=6027348117,max=419430400,put=4058121,put_sum=6027348117,take=0,take_sum=0,val=0,wait.avgcount=0,wait.sum=0 1462821234814815661
> ceph,collection=throttle-msgr_dispatch_throttler-mon,id=node-2,type=mon get=54276277,get_or_fail_fail=0,get_or_fail_success=0,get_sum=370232877040,max=104857600,put=54276277,put_sum=370232877040,take=0,take_sum=0,val=0,wait.avgcount=0,wait.sum=0 1462821234814872064
</pre>
249 changes: 249 additions & 0 deletions plugins/inputs/ceph/ceph.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,249 @@
package ceph

import (
"bytes"
"encoding/json"
"fmt"
"github.com/influxdata/telegraf"
"github.com/influxdata/telegraf/plugins/inputs"
"io/ioutil"
"log"
"os/exec"
"path/filepath"
"strings"
)

const (
measurement = "ceph"
typeMon = "monitor"
typeOsd = "osd"
osdPrefix = "ceph-osd"
monPrefix = "ceph-mon"
sockSuffix = "asok"
)

type Ceph struct {
CephBinary string
OsdPrefix string
MonPrefix string
SocketDir string
SocketSuffix string
}

func (c *Ceph) setDefaults() {
if c.CephBinary == "" {
c.CephBinary = "/usr/bin/ceph"
}

if c.OsdPrefix == "" {
c.OsdPrefix = osdPrefix
}

if c.MonPrefix == "" {
c.MonPrefix = monPrefix
}

if c.SocketDir == "" {
c.SocketDir = "/var/run/ceph"
}

if c.SocketSuffix == "" {
c.SocketSuffix = sockSuffix
}
}

func (c *Ceph) Description() string {
return "Collects performance metrics from the MON and OSD nodes in a Ceph storage cluster."
}

var sampleConfig = `
## All configuration values are optional, defaults are shown below

## location of ceph binary
ceph_binary = "/usr/bin/ceph"

## directory in which to look for socket files
socket_dir = "/var/run/ceph"

## prefix of MON and OSD socket files, used to determine socket type
mon_prefix = "ceph-mon"
osd_prefix = "ceph-osd"

## suffix used to identify socket files
socket_suffix = "asok"
`

func (c *Ceph) SampleConfig() string {
return sampleConfig
}

func (c *Ceph) Gather(acc telegraf.Accumulator) error {
c.setDefaults()
sockets, err := findSockets(c)
if err != nil {
return fmt.Errorf("failed to find sockets at path '%s': %v", c.SocketDir, err)
}

for _, s := range sockets {
dump, err := perfDump(c.CephBinary, s)
if err != nil {
log.Printf("error reading from socket '%s': %v", s.socket, err)
continue
}
data, err := parseDump(dump)
if err != nil {
log.Printf("error parsing dump from socket '%s': %v", s.socket, err)
continue
}
for tag, metrics := range *data {
acc.AddFields(measurement,
map[string]interface{}(metrics),
map[string]string{"type": s.sockType, "id": s.sockId, "collection": tag})
}
}
return nil
}

func init() {
inputs.Add(measurement, func() telegraf.Input { return &Ceph{} })
}

var perfDump = func(binary string, socket *socket) (string, error) {
cmdArgs := []string{"--admin-daemon", socket.socket}
if socket.sockType == typeOsd {
cmdArgs = append(cmdArgs, "perf", "dump")
} else if socket.sockType == typeMon {
cmdArgs = append(cmdArgs, "perfcounters_dump")
} else {
return "", fmt.Errorf("ignoring unknown socket type: %s", socket.sockType)
}

cmd := exec.Command(binary, cmdArgs...)
var out bytes.Buffer
cmd.Stdout = &out
err := cmd.Run()
if err != nil {
return "", fmt.Errorf("error running ceph dump: %s", err)
}

return out.String(), nil
}

var findSockets = func(c *Ceph) ([]*socket, error) {
listing, err := ioutil.ReadDir(c.SocketDir)
if err != nil {
return []*socket{}, fmt.Errorf("Failed to read socket directory '%s': %v", c.SocketDir, err)
}
sockets := make([]*socket, 0, len(listing))
for _, info := range listing {
f := info.Name()
var sockType string
var sockPrefix string
if strings.HasPrefix(f, c.MonPrefix) {
sockType = typeMon
sockPrefix = monPrefix
}
if strings.HasPrefix(f, c.OsdPrefix) {
sockType = typeOsd
sockPrefix = osdPrefix

}
if sockType == typeOsd || sockType == typeMon {
path := filepath.Join(c.SocketDir, f)
sockets = append(sockets, &socket{parseSockId(f, sockPrefix, c.SocketSuffix), sockType, path})
}
}
return sockets, nil
}

func parseSockId(fname, prefix, suffix string) string {
s := fname
s = strings.TrimPrefix(s, prefix)
s = strings.TrimSuffix(s, suffix)
s = strings.Trim(s, ".-_")
return s
}

type socket struct {
sockId string
sockType string
socket string
}

type metric struct {
pathStack []string // lifo stack of name components
value float64
}

// Pops names of pathStack to build the flattened name for a metric
func (m *metric) name() string {
buf := bytes.Buffer{}
for i := len(m.pathStack) - 1; i >= 0; i-- {
if buf.Len() > 0 {
buf.WriteString(".")
}
buf.WriteString(m.pathStack[i])
}
return buf.String()
}

type metricMap map[string]interface{}

type taggedMetricMap map[string]metricMap

// Parses a raw JSON string into a taggedMetricMap
// Delegates the actual parsing to newTaggedMetricMap(..)
func parseDump(dump string) (*taggedMetricMap, error) {
data := make(map[string]interface{})
err := json.Unmarshal([]byte(dump), &data)
if err != nil {
return nil, fmt.Errorf("failed to parse json: '%s': %v", dump, err)
}

tmm := newTaggedMetricMap(data)

if err != nil {
return nil, fmt.Errorf("failed to tag dataset: '%v': %v", tmm, err)
}

return tmm, nil
}

// Builds a TaggedMetricMap out of a generic string map.
// The top-level key is used as a tag and all sub-keys are flattened into metrics
func newTaggedMetricMap(data map[string]interface{}) *taggedMetricMap {
tmm := make(taggedMetricMap)
for tag, datapoints := range data {
mm := make(metricMap)
for _, m := range flatten(datapoints) {
mm[m.name()] = m.value
}
tmm[tag] = mm
}
return &tmm
}

// Recursively flattens any k-v hierarchy present in data.
// Nested keys are flattened into ordered slices associated with a metric value.
// The key slices are treated as stacks, and are expected to be reversed and concatenated
// when passed as metrics to the accumulator. (see (*metric).name())
func flatten(data interface{}) []*metric {
var metrics []*metric

switch val := data.(type) {
case float64:
metrics = []*metric{&metric{make([]string, 0, 1), val}}
case map[string]interface{}:
metrics = make([]*metric, 0, len(val))
for k, v := range val {
for _, m := range flatten(v) {
m.pathStack = append(m.pathStack, k)
metrics = append(metrics, m)
}
}
default:
log.Printf("Ignoring unexpected type '%T' for value %v", val, val)
}

return metrics
}
Loading