-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add initial OpenTracing instrumentation #5425
Conversation
@@ -108,6 +113,27 @@ func startEtcdOrProxyV2() { | |||
plog.Warningf("no data-dir provided, using default data-dir ./%s", cfg.Dir) | |||
} | |||
|
|||
// The global tracer is noop by default. | |||
if cfg.TracingLightstepToken != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in my opinion one of the most important parts about opentracing: You don't have to modify the entire codebase to use a different tracing backend. It's a quick change in the configuration and everything else stays the same.
@bg451 I have difficulty understanding the tracing from google drive. Is there a node identifier? It seems the events happened on different nodes? |
If this is going in, I think all the clientv3 and api/v3rpc additions should be interface wrappers similar to |
@xiang90 Hopefully this should be better, the different nodes are in different colors, light blue is the leader in this case and dark blue is a client program. example.pdf Node identifies would be added as a span tag. In general span names should be of low cardinality. |
@bg451 I like the idea of distributed tracing. But I am also wondering if we really need this fine grained tracing inside etcd boundary. We already have pretty good idea about what is happening inside etcd with a bunch of metrics. How about just trace when a request comes into etcd and when a response goes out of etcd? |
@xiang90 the value of tracing like this is probably as much for etcd users as it is for etcd core contributors; those who aren't familiar with the codebase will be able to gain an intuitive understanding of why operations are slow (or fast) by examining the path that requests take through the etcd cluster. We saw this pattern over and over again at Google (using Dapper) with users of Bigtable, Chubby, etc. |
One other thing: "metrics" in the sense of timeseries counters/gauges do not tell the story of how individual requests propagate through a distributed system. They are valuable but complementary to tracing (which is probably more comparable to logging than it is to timeseries monitoring). |
@xiang90 I've looked at some of these traces live and they are quite useful... timing diagrams, of course, but also wire payloads showing how specific transactions proceed through the etcd/raft workflows. IMO it would be great to unblock this PR – it adds a lot of value at very little cost. |
@bensigelman Sorry for the delay. We are trying to release etcd 3.0 at the end of this month. Then we will have time to look into this in more depth. I would LOVE to play with distributed tracing for sure. Just need to figure out the best way for integration. |
Got it, thanks – good luck with the release! |
Tracing will certainly be useful, but we don't have bandwidth to look into this right now. If we do, other tracing libraries should be evaluated as well. |
@gyuho do you have other libraries in mind than OpenTracing? Or did you mean the LightStep integration? |
@aybabtme Haven't heard of LightStep :) There are open tracing, jaeger, https://github.com/census-instrumentation/opencensus-go. We may evaluate these in next release cycle (3.5). |
@gyuho Jaeger's instrumentation library is OpenTracing. (OpenTracing is a project specifically designed to provide a single instrumentation API that couples to a variety of tracing systems... zipkin, jaeger, various commercial vendors, etc) |
This PR adds some basic OpenTracing instrumentation to etcd.
Background
The idea behind OpenTracing is to provide a standardized interface across languages to various distributed tracing systems(such as zipkin). Once you instrument your system, you can swap the underlying tracer with an O(1) switch. In case you aren't familiar with distributed tracing, the basic idea is that you can trace a request as it branches out to multiple machines and services, making it very easy to diagnose various problems such as latency, limping hardware, and events that lead up to an error deep in the stack
Goal
The goal behind the pr is to a) show what a request trace looks like in an etcd cluster and b) to start a discussion about what components of etcd need to be traced. I'm not extremely familiar with how etcd works, so there's a lot left to be desired when it comes to the number of possible endpoints that can be traced and noise reduction. I'm probably creating spans for parts that don't really matter and not creating them for key things that I don't know about, etc.
What's really neat is that gRPC allows for metadata, which makes sending traces across process boundaries using gRPC a breeze. This can be seen in the v3 KV client and server. Here's a gist of what a simple client using opentracing and the etcd client looks like.
Screenshot
Appdash
There is currently an issue open due to the fact that Appdash can't render spans that are less than 1ms. However, if you look at the table, you can see all of the various spans from a specific trace.
Lighstep
Since a trace in a 3 node cluster is pretty big, I couldn't fit it all in in a screenshot so I saved the page as a PDF, which you can view here: https://drive.google.com/file/d/0BxWuGMMFc4K0RDZVcGYzUWhkNkk/view?usp=sharing
Colors aren't printing correctly, but any time there's an
encode
->decode
span set, the message is being decoded on a different nodecc @philips