Skip to content

TritonDataCenter/kang

Repository files navigation

kang: distributed system observability

Kang is a facility for debugging networked software services by exposing internal state via a simple HTTP API. Service state is organized into a two-level hierarchy of "objects" organized by "type". For example, a simple HTTP server may have two types of objects: "requests" and "connections", and there may be many objects of each type. Each server defines its own types and the structures of its objects.

Demo

First, install the command-line tool:

# npm install -g kang

Then run the example server in this repo:

# node examples/server.js
server listening at http://0.0.0.0:8080

Now run the kang debugger:

# kang -hlocalhost:8080

Run "help" for some suggested examples and try them out.

kang tool

Usage: kang [-h host1[host2...]]

Remote servers are specified using the following format:

[http[s]://]host[:port][/uri]

All fields other than the host are optional. Nearly any combination may be specified, as in:

  REMOTE HOST              MEANS
  localhost                http://localhost:80/status/snapshot
  localhost:8080           http://localhost:8080/status/snapshot
  localhost:8080/kang      http://localhost:8080/kang
  https://localhost/kang   https://localhost:443/kang

Multiple servers may be specified in a comma-separated list. Servers are specified using the -h option or (if none is present) the KANG_SOURCES environment variable.

When you run kang, it creates a snapshot of the distributed system's state by querying each of the servers. You can browse the state interactively. Type "help" for more information.

Background

While interactive program execution is a useful feature during development, the most important feature for debuggers in both development and production environments is the presentation of current program state. Program state is often examined on an ad-hoc basis by engineers debugging a particular problem, but it's often useful to build tools to automatically analyze this state as well, either to summarize it for humans or to automatically look for certain classes of problems. In this regard, kang is a debugger for distributed systems: it fetches, aggregates, and presents program state for consumption by both humans and automated tools. The goal is to allow each component of the distributed system to describe the objects it knows about (and potentially a small amount of metadata suggesting what to do with this information) so that the kang system can fetch, aggregate, and present this information usefully.

In debugging distributed systems of heterogeneous components, it's critical to be able to quickly understand the internal state of each component. We have https://github.com/trentm/node-bunyan and https://github.com/joyent/node-panic to understand explicit errors and fatal failures, but you need more to understand why a service is simply behaving wrong.

Most of the time, the internal state takes the form of just a few important types of objects. It would be really useful if each service provided a standard way of extracting this state for the purpose of debugging.

API

kang defines a single HTTP entry point, /kang/snapshot, that returns a snapshot of the service's internal state in the form of a JSON object that looks like this:

{
        /* service identification information */
        "service": {
                "name": "ca",
                "component": "configsvc",
                "ident": "us-sw-1.headnode",
                "version": "0.1.0vmaster-20120126-2-g92bf718"
        },

        /* arbitrary service stats */
        "stats": {
                "started": "2012-03-20T17:03:59.221Z",
                "uptime": 86403217,
                "memory": {
                        "rss": 10850304,
                        "heaptotal": 2665280,
                        "heapused": 1700788
                },
                "http": {
                    "nrequests": 1709,
                    "nrequestsbycode": {
                      "200": 1705,
                      "201": 1,
                      "204": 1,
                      "503": 1
                    }
                }
        },

        /* extra service-specific information */
        "types": [ 'instrumentation', 'instrumenter' ],

        "instrumentation": {
                "cust:12345;1": {
		    "creation_time": "2012-01-26t19:20:30.450z",
		    "label": "12345/1"
                        "module": "node",
                        "stat": "httpd_ops",
                        "decomposition": "latency",
                        "granularity": 1,
                        "instrumenters": {
                                "instrumenter:instr1": "enabled",
                                "instrumenter:instr2": "enabled",
                                "instrumenter:instr3": "disabled"
                        }
                }
        },

        "instrumenter": {
                "instr1": {
                        "creation_time": "2012-01-26t19:20:30.450z",
                        "instrumentations": [ "instrumentation:cust:12345;1" ],
                        "last_contact": "2012-01-26t19:20:30.450z"
                },
                "instr2": {
                        "creation_time": "2012-01-26t19:20:30.450z",
                        "instrumentations": [ "instrumentation:cust:12345;1" ],
                        "last_contact": "2012-01-26t19:20:30.450z"
                },
                "instr3": {
                        "creation_time": "2012-01-26t19:20:30.450z",
                        "instrumentations": [ ],
                        "last_contact": "2012-01-10t19:20:30.450z"
                }
        }
}

Note that many of the above field names match the corresponding fields used in Bunyan for logging. Clients can link objects reported by multiple components (or even services) by assuming any given (type, id) tuple is unique. Clients can also link any string of the form "type:id" (for a known object type and id) to the corresponding object. For example, the "instrumenter:instr1" key in the instrumentation above can be linked directly to that object.

In the future we may define semantics for some fields like "label", and "creation_time" so that the tools can present this information more usefully.

Server library

kang includes a server library for implementing the above API. Any project that wants to take advantage need only implement a few entry points:

  • report service identification information
  • report stats
  • list object types
  • list objects for a given type
  • serialize one object

Services can add information incrementally as desired. The library takes care of formatting this data appropriately.

Client library

kang includes a client library for listing and browsing objects from a set of services. See cmd/kang.js for example usage.

CLI

See above for details.

Future work

  • Remove prefixes on library function names