-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Distributed Tracing semantics #1066
base: main
Are you sure you want to change the base?
Conversation
- ISemanticContext - for all events emitted by ILogger - ability to use env_dt aliases on individual events - semantic context projection for C++/CL/CX apps on Windows - add test that emits two events with DT semantics
/// Set the W3C TraceContext TraceFlags information of telemetry event. | ||
/// </summary> | ||
/// <param name="TraceFlags">TraceContext TraceFlags</param> | ||
virtual void SetTraceFlags(uint8_t traceFlags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the current spec only supports one flag (sampled
) so I wonder if we should rather:
- add a
SetSampled(bool)
method, or - change this to take a bitfield, or
- provide an enum or a TRACEFLAGS_SAMPLED constant to use here instead of '1'
(the last is probably best since it doesn't constrain the caller)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
W3C spec and Common Schema support traceFlags
:
An integer representation of the W3C TraceContext trace-flags.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I am saying that we could have
struct TraceFlags {
unsigned char sampled: 1,
unsigned char reserved: 7
};
...
virtual void SetTraceFlags(TraceFlags traceFlags)
...
TraceFlags flags;
flags.sampled = 1;
context.SetTraceFlags(flags);
or rather
enum TraceFlags {
Sampled: 1,
};
...
context.SetTraceFlags(TraceFlags::Sampled);
rather than passing a raw uint8_t.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could later on add any additional method overloads we want. Indeed, we could add yet another method what you're proposing. I would rather go with an additional method that just does SetTraceContext
. My goal was to expose the bare min set of flags, as per spec:
- in Common Schema spec this is
int
- in W3C TraceContext spec this described as
byte
/uint8_t
Thus, I followed the W3C TraceContext guidelines in this regard. You can see that other implementations, e.g. Elastic Node Traceparent package - did not elaborate on the meaning of the flags
field. It's just flags
, number eventually encoded as hex. No more no less than what the 1DS (being the transport layer) is expecting to get.
We can add more decoration later. It's add-ons, not the necessity. I'd like to avoid the semantics that could possibly be later on borrowed from elsewhere, e.g. these semantics all exists in finer detail in OpenTelemetry API. The goal of this PR is to cover the bare min set of transport level detail, in order to enable the POC work that we're doing.. Incremental changes could come later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok sounds good not to over-specify.
#define COMMONFIELDS_DT_TRACEID "env_dt_traceId" | ||
#define COMMONFIELDS_DT_SPANID "env_dt_spanId" | ||
#define COMMONFIELDS_DT_TRACEFLAGS "env_dt_traceFlags" | ||
#define COMMONFIELDS_DT_PARENTID "parentId" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that parentId
is meant to be used in events with Span semantics. Here we are adding it to every single event instead, which looks a bit weird. Are we doing this because supporting explicit Span semantics is more complicated? Would it cause problems if we decide to support explicit Span semantics later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two separate logical destinations here:
-
first three element: traceId, spanId, traceFlags - is our current TraceContext, what we would've sent in
traceparent
header. That's what spec is describing. -
the last one: parentId - is what we receive as our parent, describes current-to-parent relationship, i.e. what initiated us to run. Our span's parent. Developer focusing on client telemetry debugging would see this on one event. This is not part of
traceparent
sent to the remote end, but part of client telemetry that the our code emits. That field allows us to stitch together the caller (our parent) and our current span into tree view.. Without this information attached on event - the Trace View explorer UX won't be able to stitch the two operations on time ladder chart..
So:
spanid
-- is our current span id. Part of W3C TraceContext spec.parentId
-- is our parent, our remote span id.
parentId
(prev) -> spanId
(current) -> then remote service would generate its own $spanId(of service)
(next)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How you would use it:
-
Obtain the
ILogger
associated with logging some service information, i.e.
auto apiLogger = GetLogger("CatalogServiceAPILogger")
This is similar to howILogger
gets allocated for a component in .NET. -
Then you
SetContext(...)
on that logger's semantic context. All events coming from that logger are associated with the service interaction. This is all that's necessary for the Distributed Trace Explorer view to show your events coming from that logger on a time ladder chart. -
When your client starts interacting with catalog service via REST API call, you'd initiate your transaction with some initial seed for the
traceparent
header: yourtraceId
,spanId
,traceFlags
- the ones you set on context... And you pass it to remote service. -
Your client emits events using
apiLogger->LogEvent
-- all of these events get indexed by correlation platform and contain the necessary semantics to render them in a trace view. -
If you need to re-initialize the session, i.e. user logged out / logged in, you'd have the new transaction -
traceId
, and you'd update the context on that logger. All further events now have a new differenttraceId
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we are adding it to every single event instead
Not every single event, not at LogManager level. Only for events on specific component logger, that are relevant to specific service interaction. However, if a client controls the initial seed of traceId
, then the same traceId
could be stamped on different logger contexts, for example:
auto loggerA = GetLogger("serviceA-REST-Logger")
auto loggerB = GetLogger("serviceB-REST-Logger")
auto loggerC = GetLogger("serviceC-REST-Logger")
.
Then you use SetContext
at logger-level to stamp the same traceId
, and their client-side interaction events would show up in one distributed tracing call tree / time ladder chart. Correspondingly, the far end, e.g. C# service - always stamps the same set of fields on all of Log
and Span
events coming from C# land. That's how it works, and it does stamp it on most (all?) service events too.
Second example in APITest
actually shows how you can selectively append the context directly on just one event without using SetContext
.. So both ways are possible, logger context - all events for specific named logger (this would be the preferred model). And local (event property) example - stamping on just one event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed more offline. Having a parentId
for an event is a valid model even if the system doesn't follow the span model (see also comment in TraceContext standard). Supporting explicit Span semantics is sort of overlapping with what OpenTelemetry does and out of scope for now.
Most of the attributes/fields defined in the Should we instead have an overloaded |
@maxgolov - Do you plan to address the concerns raised in this PR, and make it ready for merge. Thanks. |
@lalitb
Apologies for keeping it open for so long. It's coming soon. |
Add Distributed Tracing semantics to the following APIs:
ISemanticContext
- allows to set (optional) Distributed Tracing context on all events emitted byILogger
env_dt
aliases to customize TraceContext of individual events running in a given scope. That allows manual nesting of a child span to a parent span.PR adds "common" fields:
env_dt_traceId
env_dt_spanId
env_dt_traceFlags
parentId
And the set of corresponding high-level API setters at context-level:
SetTraceId
SetSpanId
SetTraceFlags
SetParentId
which are styled in the same way as other common context setters are presently styled, in uniform fashion.
How to use it:
ext.dt
extension semantics, just enough for the Trace Explorer view to work.. Less duration semantics: since 1DS in principle does not require client events to have "duration", start / end / scope. Events are timepoints that carrytimestamp
andname
. How exactly to remap these, i.e. what would/could/should be considered duration - is out-of-scope of this PR. A separate routing example would be provided elsewhere, out-of-scope for Client Telemetry repo.How this could be improved further:
Example showing how events look when traced from client to service in one Kusto table:
APITest
verifies the basic functionality. More complex usage examples are not presently a priority for our team, since in our case we use the higher-level .NETStartActivity
API while interacting with remote service instrumented with OpenTelemetry, and supplement the existing client telemetry events with correlation identifiers that follow Common Schema naming, specifically Common Schema -to- Geneva mapping.Separate additional work could be done later - to add
Span
andLog
semantics, allowing to use 1DS SDK as a lower-level transport channel / exporter for OpenTelemetry SDK. Presently this additional work is out-of-scope.