Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trace Context Extensibility #573

Open
dyladan opened this issue May 7, 2024 · 11 comments
Open

Trace Context Extensibility #573

dyladan opened this issue May 7, 2024 · 11 comments

Comments

@dyladan
Copy link
Member

dyladan commented May 7, 2024

Currently (level 1, version specifier 00), if a trace participant encounters trace flags it does not recognize it is required to set those flags to 0 before forwarding the trace header. For example, a traceparent with flags 02 will result in flags set to 00 before propagation. This was done in order to avoid a trace participant making claims that it doesn't know are true. For example, if a new flag is defined that means the agent is a browser, a server-side trace participant forwarding the flag blindly would also incorrectly and unknowingly claim that it itself is a browser agent.

This causes a problem because in order to make use of any new flag which affects the whole trace, all participants in the trace must be updated to properly forward the flag. An example is the newly introduced random flag which claims that at least the right-most 7 bytes of the Trace ID are randomly random. Any participant in the trace which encounters this flag and has not yet been updated will set the flag to 0 even though it would still be correct for it to be 1. This causes a long delay in when a newly introduced flag is likely to be useful in complex orgs.

Solution Proposal 1: propagation bit-mask

  1. Split the trace-flags byte into 2 sets of 4 bits, left and right
  2. Use the left bit-set as a mask to determine if a bit in the right bit-set should be propagated blindly or set to 0

Example 00-0123456789012345-01234567-27:

Flags : 2    7
Binary: 0010 0111

Mask  : 0010
Flags : 0111

Output: 001x # set x to 1 or 0 to represent sampled or not

This example assumes a trace participant only "understands" the least-significant bit (sampled flag). In this example, a trace participant would blindly propagate the second-least-significant bit (Flag 2, random bit), but not the third-least-significant bit (Flag 3) because the bit mask only identifies Flag 2 as a bit that should be blindly propagated.

Solution Proposal 2: New Field Identification

In the current specification, any new fields should be dropped if they are unknown to the trace participant. This solution proposes to amend that recommendation to the following:

  1. If most significant bit of field is 1, propagate the field unchanged
  2. Else, set all bits in field to 0

Example: 00-0123456789012345-01234567-00-0123-FF00

Field #: 0  1                2        3  4    5
Header:  00-0123456789012345-01234567-00-0123-FF00
Output:  00-0123456789012345-89012345-00-0000-FF00

In this example, fields 4 and 5 are both unknown to the trace participant. The participant set all fields in field 4 to 0 because the most significant bit of the field is 0, but the participant forwarded field 5 unchanged because the most significant field of the bit was 1.

Proposal 3: Do both

I propose we adopt both of these new semantics for the next release of Trace Context. The bit-mask option cuts the number of possible flags to 4 (with 2 already taken, 2 are available), but improves future extensibility of the protocol. The new field identification proposal improves the prospects for extensibility in the future.

@dyladan
Copy link
Member Author

dyladan commented May 22, 2024

Proposal 4: Trace State for new persistent fields

tracestate format is a set of key-value pairs. It looks like this: k1=v1,k2=v2. tracestate entries unknown to a trace participant are typically propagated unchanged unless the list becomes too long, or a participant decides not to forward the header for performance reasons. Propagating tracestate is optional, but in practice it is done by most tracing participants.

This opens up the possibility to use tracestate to propagate fields that should survive the whole trace. One example of such a field might be the sampling selectivity of the head of the trace. Any trace participant which is not updated to the new standard should still propagate the value unchanged. It would be possible to either use a single field for multiple cases, multiple fields, or reserve a whole vendor key namespace for future use.

Advantages:

  • This is already the behavior of existing implementations, meaning new fields could be used immediately without fear of being dropped (usually).

Concerns:

  • Currently there are no reserved tracestate keys. This would require reserving keys, which is technically a breaking change. Work would need to be done to ensure any reserved keys are not already in use.
  • This would require any trace participant looking to introduce one of these fields to use 2 headers. In many cases it is possible and even likely that there is no existing tracestate header, so this would require adding a full new header for these cases.

Proposal 4 part 2: combine proposal 4 and 1

This is similar to (3) however instead of new fields in the traceparent header it uses new fields in tracestate if they are required to be persisted for the whole trace. It has similar advantages to 3 and 4. See those proposals for details.

Proposal 5: do nothing

This is here for completeness, however I do not feel we should take this option. We do have the option to do nothing, and to rely on the semantics we already have. This would mean new flags or fields that should be persisted for the whole trace (like the random flag) require all trace participants to be updated before it can be assured that the flag or field is persisted. Flags and fields which only describe a single trace participant would work fine in this scenario.

@yurishkuro
Copy link
Member

how about Proposal 0: rollback the rule "it is required to set those flags to 0". It was clearly an oversight, and while rolling it back does not solve the immediate problem, it future-proofs the spec.

@dyladan
Copy link
Member Author

dyladan commented May 23, 2024

Proposal 0: Roll back set-0 requirement

This proposal is to roll back the rule requiring unknown flags to be set to 0. It would require that all unknown flags are propagated unchanged. The primary disadvantage of this proposal is that some flags may describe only a single participant in the trace, not the whole trace. In this case, it is possible to propagate a false claim. For example, a flag which indicates that the parent participant is a browser client. Take a trace with 3 participants A -> B -> C where A and C are updated but B is not. If A sets the browser flag, B will propagate it to C causing C to incorrectly interpret its parent as a browser.

@yurishkuro
Copy link
Member

The primary disadvantage of this proposal is that some flags may describe only a single participant in the trace, not the whole trace.

My argument would be that this is a false use case. This refers to information that only makes sense in a single hop, which can be sent via multitude of other ways, not via trace context that is meant to make sense across the whole distributed workflow (and therefore is optimized for repeated transmission, whereas one-hop data can afford to spend a few more bytes). You would have the same problem if we did not have bitmap in the context at all and you tried to send this "I am a browser" signal via tracestate - it would also be incorrectly propagated throughout, because again the mechanism is not designed for one-hop data.

@dyladan
Copy link
Member Author

dyladan commented May 24, 2024

The sampling bit is an example of this. It is propagated, but it is meant to describe only the parent. It could be flipped to 0 if one participant decided to prune its branch of the trace for some reason.

@dyladan
Copy link
Member Author

dyladan commented May 24, 2024

Question: can anyone think of a bit that would be harmful if propagated through the full trace by out-of-date participants?

@kalyanaj
Copy link
Contributor

kalyanaj commented Jun 4, 2024

Thanks @dyladan for putting together this summary! Here are my current thoughts/questions:

  1. Regarding the discussion on Proposal 0, I see two potential concerns:
  • Wouldn't that option imply a breaking change on an official W3C recommendation spec?
  • Sampling flag is a good example of a flag that fits in the traceparent header (even if it is for single hop validity), I am not sure if we want to close the door on not having such situations in the future.
  1. Regarding Proposal 4, I like the aspect that it is already a solution that can take effect right away (rather than waiting ~5 years for all current implementations to update to a newer version of tracecontext). A couple of questions here:
  • Is it a must to register any prefix here, or can we do this on demand depending on any use case that emerges? Since we have this registry https://w3c.github.io/tracestate-ids-registry/, I feel there's a good handle on who are all using tracestate.
  • Regarding the overhead of the new header, in the fullness of time, as more vendors (including new OTel samplers) start using it, wouldn't the cost of this header be amortized?

@jmacd
Copy link

jmacd commented Jul 31, 2024

I see the value of Proposal 1 and 4. Not sure about 2 -- I'm not aware of what, if any, other flags or fields may be under consideration. There is an argument that the Sampled flag is appropriately characterized as a claim about the single hop, and I agree the Random flag doesn't qualify.

Proposal 6

Introduce a new 8-bit field for flags that are designed as claims about the trace and defined to propagate, even the unknown flags. That would add 3 bytes to the traceparent. A new propagating Random flag could be introduced in the new field with appropriate semantics, and this leaves 6 bits in trace flags. This would leave two random flags, an admittedly confusing situation, but as it stands today the Random flag can still be useful.

In OpenTelemetry we have spoken about "Presumption of TraceID Randomness" which is to assert that when a TraceState value is not present, it is safe to assume the TraceID was random. This can be verified after the fact by checking whether, in fact, the root span has the 0x2 original random flag set.

@jmacd
Copy link

jmacd commented Jul 31, 2024

On a related note, open-telemetry/oteps#247 was a proposal in OpenTelemetry. If you are inserting details about yourself and mean them to propagate only once, then the TraceState field should not be automatically propagated. I wonder if there is an opportunity to define single-hop tracestate fields.

@dyladan
Copy link
Member Author

dyladan commented Aug 7, 2024

On a related note, open-telemetry/oteps#247 was a proposal in OpenTelemetry. If you are inserting details about yourself and mean them to propagate only once, then the TraceState field should not be automatically propagated. I wonder if there is an opportunity to define single-hop tracestate fields.

There is no specified behavior for tracestate metadata but TTL has been discussed several times and been received favorably. It hasn't been implemented simply because there's not been anyone really pushing for it.

@kalyanaj
Copy link
Contributor

kalyanaj commented Oct 3, 2024

We (@dyladan , @SergeyKanzhelev , and I) discussed this in the last DT working group meeting. Each of us felt that Proposal 4 (using tracestate) is the best way forward here.

Here's our reasoning:

  • It is already available and can be used right away to propagate any such information. In the alternative proposals, one has to wait multiple years for adoption (with no guarantee) of a new version of the tracecontext specification to happen before they could rely on that mechanism.
  • There's no proposal in the backlog for any new capabilities (flags or otherwise) on the table -- in the last 6 years, there have been only two flags.
  • There's no need to even register a custom prefix now - when it comes to the point where we need to use tracestate we could define a new prefix for w3c at that point.
  • There's other usage of tracestate (e.g., OpenTelemetry's consistent probability samplers), so any cost of the second header will get amortized in many cases.

We want to invite further feedback on the above and if there are any counter-arguments for going with a different proposal than proposal 4. If so, please feel free to reply here, and/or join a W3C DT working group meeting (https://www.w3.org/groups/wg/distributed-tracing/calendar/).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants