[WIP] Add initial OpenTracing instrumentation #5425

bg451 · 2016-05-20T22:36:22Z

This PR adds some basic OpenTracing instrumentation to etcd.

Background

The idea behind OpenTracing is to provide a standardized interface across languages to various distributed tracing systems(such as zipkin). Once you instrument your system, you can swap the underlying tracer with an O(1) switch. In case you aren't familiar with distributed tracing, the basic idea is that you can trace a request as it branches out to multiple machines and services, making it very easy to diagnose various problems such as latency, limping hardware, and events that lead up to an error deep in the stack

Goal

The goal behind the pr is to a) show what a request trace looks like in an etcd cluster and b) to start a discussion about what components of etcd need to be traced. I'm not extremely familiar with how etcd works, so there's a lot left to be desired when it comes to the number of possible endpoints that can be traced and noise reduction. I'm probably creating spans for parts that don't really matter and not creating them for key things that I don't know about, etc.

What's really neat is that gRPC allows for metadata, which makes sending traces across process boundaries using gRPC a breeze. This can be seen in the v3 KV client and server. Here's a gist of what a simple client using opentracing and the etcd client looks like.

Screenshot

Appdash

There is currently an issue open due to the fact that Appdash can't render spans that are less than 1ms. However, if you look at the table, you can see all of the various spans from a specific trace.

Lighstep

Since a trace in a 3 node cluster is pretty big, I couldn't fit it all in in a screenshot so I saved the page as a PDF, which you can view here: https://drive.google.com/file/d/0BxWuGMMFc4K0RDZVcGYzUWhkNkk/view?usp=sharing
Colors aren't printing correctly, but any time there's an encode->decode span set, the message is being decoded on a different node
cc @philips

bg451 · 2016-05-20T23:01:42Z

etcdmain/etcd.go

@@ -108,6 +113,27 @@ func startEtcdOrProxyV2() {
 		plog.Warningf("no data-dir provided, using default data-dir ./%s", cfg.Dir)
 	}

+	// The global tracer is noop by default.
+	if cfg.TracingLightstepToken != "" {


This is in my opinion one of the most important parts about opentracing: You don't have to modify the entire codebase to use a different tracing backend. It's a quick change in the configuration and everything else stays the same.

xiang90 · 2016-05-20T23:10:54Z

@bg451 I have difficulty understanding the tracing from google drive. Is there a node identifier? It seems the events happened on different nodes?

heyitsanthony · 2016-05-20T23:18:30Z

If this is going in, I think all the clientv3 and api/v3rpc additions should be interface wrappers similar to etcdserver/api/v3rpc/quota.go so the instrumentation calls aren't baked into the core implementation.

bg451 · 2016-05-20T23:23:05Z

@xiang90 Hopefully this should be better, the different nodes are in different colors, light blue is the leader in this case and dark blue is a client program. example.pdf Node identifies would be added as a span tag. In general span names should be of low cardinality.

xiang90 · 2016-05-21T03:00:28Z

@bg451 I like the idea of distributed tracing. But I am also wondering if we really need this fine grained tracing inside etcd boundary. We already have pretty good idea about what is happening inside etcd with a bunch of metrics. How about just trace when a request comes into etcd and when a response goes out of etcd?

bhs · 2016-05-21T19:54:56Z

@xiang90 the value of tracing like this is probably as much for etcd users as it is for etcd core contributors; those who aren't familiar with the codebase will be able to gain an intuitive understanding of why operations are slow (or fast) by examining the path that requests take through the etcd cluster. We saw this pattern over and over again at Google (using Dapper) with users of Bigtable, Chubby, etc.

bhs · 2016-05-21T19:56:07Z

One other thing: "metrics" in the sense of timeseries counters/gauges do not tell the story of how individual requests propagate through a distributed system. They are valuable but complementary to tracing (which is probably more comparable to logging than it is to timeseries monitoring).

bhs · 2016-06-16T22:03:39Z

@xiang90 I've looked at some of these traces live and they are quite useful... timing diagrams, of course, but also wire payloads showing how specific transactions proceed through the etcd/raft workflows.

IMO it would be great to unblock this PR – it adds a lot of value at very little cost.

xiang90 · 2016-06-16T22:06:58Z

@bensigelman Sorry for the delay. We are trying to release etcd 3.0 at the end of this month. Then we will have time to look into this in more depth. I would LOVE to play with distributed tracing for sure. Just need to figure out the best way for integration.

bhs · 2016-06-16T22:08:12Z

Got it, thanks – good luck with the release!

aybabtme · 2018-01-26T17:20:44Z

@xiang90 @bg451 I know that this is pretty old but it's the only reference to OpenTracing that I've seen in the project. What would the take be today about adding tracing to etcd?

gyuho · 2018-01-26T17:27:33Z

Tracing will certainly be useful, but we don't have bandwidth to look into this right now. If we do, other tracing libraries should be evaluated as well.

aybabtme · 2018-01-26T18:32:08Z

@gyuho do you have other libraries in mind than OpenTracing? Or did you mean the LightStep integration?

gyuho · 2018-01-26T18:35:44Z

@aybabtme Haven't heard of LightStep :)

https://www.cncf.io/projects/

There are open tracing, jaeger, https://github.com/census-instrumentation/opencensus-go.

We may evaluate these in next release cycle (3.5).

bhs · 2018-01-27T06:06:56Z

@gyuho Jaeger's instrumentation library is OpenTracing. (OpenTracing is a project specifically designed to provide a single instrumentation API that couples to a variety of tracing systems... zipkin, jaeger, various commercial vendors, etc)

gyuho · 2018-01-27T06:14:56Z

@bhs Thanks! Just created a separate issue to track this #9242.

Add initial OpenTracing instrumentation

f852814

bg451 reviewed May 20, 2016
View reviewed changes

Add traced KVServer wrapper

e97f152

smarterclayton mentioned this pull request May 29, 2016

Consider adding tracing support for OpenTracing (or similar) kubernetes/kubernetes#26507

Closed

bg451 closed this Nov 11, 2016

gyuho mentioned this pull request Jan 27, 2018

Experiment/evaluate distributed tracing frameworks #9242

Closed

dashpole mentioned this pull request Nov 6, 2020

Add Distributed Tracing using OpenTelemetry #12460

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add initial OpenTracing instrumentation #5425

[WIP] Add initial OpenTracing instrumentation #5425

bg451 commented May 20, 2016 •

edited

Loading

bg451 May 20, 2016

xiang90 commented May 20, 2016

heyitsanthony commented May 20, 2016

bg451 commented May 20, 2016 •

edited

Loading

xiang90 commented May 21, 2016

bhs commented May 21, 2016

bhs commented May 21, 2016

bhs commented Jun 16, 2016

xiang90 commented Jun 16, 2016

bhs commented Jun 16, 2016

aybabtme commented Jan 26, 2018 •

edited

Loading

gyuho commented Jan 26, 2018

aybabtme commented Jan 26, 2018

gyuho commented Jan 26, 2018

bhs commented Jan 27, 2018

gyuho commented Jan 27, 2018

[WIP] Add initial OpenTracing instrumentation #5425

[WIP] Add initial OpenTracing instrumentation #5425

Conversation

bg451 commented May 20, 2016 • edited Loading

Background

Goal

Screenshot

Appdash

Lighstep

bg451 May 20, 2016

Choose a reason for hiding this comment

xiang90 commented May 20, 2016

heyitsanthony commented May 20, 2016

bg451 commented May 20, 2016 • edited Loading

xiang90 commented May 21, 2016

bhs commented May 21, 2016

bhs commented May 21, 2016

bhs commented Jun 16, 2016

xiang90 commented Jun 16, 2016

bhs commented Jun 16, 2016

aybabtme commented Jan 26, 2018 • edited Loading

gyuho commented Jan 26, 2018

aybabtme commented Jan 26, 2018

gyuho commented Jan 26, 2018

bhs commented Jan 27, 2018

gyuho commented Jan 27, 2018

bg451 commented May 20, 2016 •

edited

Loading

bg451 commented May 20, 2016 •

edited

Loading

aybabtme commented Jan 26, 2018 •

edited

Loading