Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inter-operability of tracing systems operating with shorter trace-id #349

Closed
SergeyKanzhelev opened this issue Oct 28, 2019 · 16 comments · Fixed by #356
Closed

Inter-operability of tracing systems operating with shorter trace-id #349

SergeyKanzhelev opened this issue Oct 28, 2019 · 16 comments · Fixed by #356

Comments

@SergeyKanzhelev
Copy link
Member

SergeyKanzhelev commented Oct 28, 2019

This issue is tracking an overall resolution of inter-operability of tracing systems operating with shorter trace-id.


@bogdandrutu @adriancole @tylerbenson @jcchavezs @yurishkuro @nicmunroe @tedsuo @reyang @danielkhan (people who was involved in PR discussion) and others, please review this issue. It is written based on my understanding of the problems identified in a spec. Let's agree that this is the list of problems we have to address and then split this single issue into smaller issues to discuss how to resolve them.

Please comment/react on this issue to indicate that it reflects your understanding of a problems identified in spec.


There are existing tracing systems that currently operate with the 64bit trace-id, but still want to use the Trace Context headers for cross-process communication. The spec attempts to give an advice on how 128-bit, compliant to the spec tracing systems may cooperate wit the 64bit systems.

Let's call the logic of generating 128bit trace-id by 64bit system a backfill logic. While addressing concern of adding more clarity into the spec regarding trace-id backfill logic (see #337), and whether this backfill logic must apply to new traces or traces "in transition", more problems with the current spec was highlighted. See discussion in PR #344.

Here are the issues that were identified.

1. Split trace problem

If app A calls two apps, B and C, with a 64-bit trace, and both B and C each backfill using different algorithm, then the trace become split. It now has two separate trace IDs.

This is an original problem raised in #337. Spec must be very clear on how to backfill trace-id in case of new trace-id generation comparing to the backfill logic on trace-id propagation.

2. Backfilling on left vs. right

Many existing systems dealing with the 64bit to 128bit transition already. These systems typically have a logic to backfill trace-id and look up traces given the subset of a trace-id bytes. It is common that these systems implement trace-id look ups based on right-most characters as a typical backfill logic puts zeroes in the first bytes of trace-id.

This backfill logic contradict with the other paragraph of a spec asking to put the randomness to the left side of the trace-id. So backfill logic, in order to follow this requirement, must backfill the rightmost bytes.

This creates confusion and breaks inter-operability with the existing 64bit tracing systems.

3. Left padding of randomness.

Left-padding of randomness requirement creates another challenge for existing 64bit tracing systems. Keeping rightmost bytes constant will break compatibility with the tracing systems that operates on these rightmost characters.

This requirement must be either rewritten to make sure it works with 64bit systems or removed from the specification.

4. Zero vs. random backfill

We can differentiate tracing systems into three types:

  • The 128bit system: operating with 128bit trace-id (fully compliant).
  • The 64bit system: operating with 64bit trace-id. Not capable to propagate tracestate and 64 extra bits of trace-id
  • The 128on64bit system: records 64 bit trace-id. But capable to propagate extra 64 bit of trace-id from incoming request to outgoing request.

Backfill logic that is currently described in a spec may work for 128on64bit systems. But it will not work for the 64bit systems. Even on a first trace-id generation, 64bit systems has no capabilities to preserve extra bytes and re-use them for subsequent outgoing calls from a single incoming call with the newly generated trace-id. Thus the requirements for 64bit systems may be different. Or the current requirement can be changed to simple zero backfill.

5. Spec should prescribe the behavior of a 64bit-only system

64bit systems (as oppose to 128on64) are quite harmful for inter-operability of tracing systems. Not only these systems fail to propagate an entire trace-id, they also fail to propagate tracestate.

Specification currently declares these systems non-compliant - uses the "MUST" language in trace-id propagation. However inter-operability with these tracing systems is quite important. So the note in specification on how 64bit systems should operate will go a long way of ensuring a better interoperability of various tracing systems.

The spec will benefit from being more explicit on how 64bit tracing systems must operate.

@tedsuo
Copy link

tedsuo commented Oct 28, 2019

Thank you @SergeyKanzhelev, this looks good. I think the thing we've learned from recent experience is that providing clear guidance on interoperability with 64bit legacy systems is important. Changing the spec to focus on this scenario is very helpful.

On recommendation: for clarity, I suggest we focus solely on the legacy 64bit scenario, as there has been some confusion when the 128on64 scenario is introduced in the same context. Once we're done adding 64bit compatibility, there may be little need to address 128on64 directly in the spec, except to suggest that systems which are capable of correctly propagating a 128-bit ID should do so, even if they only use 64-bits internally.

@SergeyKanzhelev
Copy link
Member Author

@tedsuo your comment is that problem 5 in the list is very important. Not a new problem, correct? Thank you!

@mtwo
Copy link
Contributor

mtwo commented Oct 30, 2019

Are we planning on addressing this prior to the in-person meeting, or should we save this as a topic for discussion then?

@danielkhan
Copy link
Contributor

@mtwo if we do, people that have a stake in that should be present so that we can reach a broad consent. Alternatively, I would hope that @adriancole @tylerbenson @jcchavezs @yurishkuro @nicmunroe @tedsuo @reyang join the weekly call next week.

As much I like the process and others weighing in, I find it a bit unfortunate that this normative change will now delay the standard. So having everyone at the table and sorting it out for good would be a goal.

@codefromthecrypt
Copy link

fyi I'm not planning on joining any call. I don't think there is new information to present that I've not presented over the last years and also recently.

@danielkhan
Copy link
Contributor

@adriancole can I ask you to still do a final read-through after the change to make sure that that we
reach a broad consensus?

@codefromthecrypt
Copy link

please have an end user who's piped up read through. nic did a great job last time. you don't need me if you have more end users involved.

@SergeyKanzhelev
Copy link
Member Author

I think we should attempt fixing it in small targeted PRs before the call. Unfortunately so far only @tedsuo confirmed that this list seems to be complete list of problems. Also @tedsuo suggests that this is in fact important problem to address in spec and (as I read it) we shouldn't just remove this paragraph.

@nicmunroe @tylerbenson can you please comment on the completeness of the list.

@SergeyKanzhelev
Copy link
Member Author

@danielkhan one option to approach the problem of normative change vs. completeness of a doc is to move the entire paragraph in non-normative section. This way recommendation will still be in a spec, but we can get into more details on how systems operate without contradicting "MUST" language of propagating the entire trace-id.

@nicmunroe
Copy link

@SergeyKanzhelev the list items and general descriptions here look pretty good. I largely agree with the list.

I am confused on a few sentences though.


• In item 2 ("Backfilling on left vs. right"):

This backfill logic contradict with the other paragraph of a spec asking to put the randomness to the left side of the trace-id. So backfill logic, in order to follow this requirement, must backfill the rightmost bytes.

I'm not sure how it follows that the backfill logic must do right-backfill, ever. On a 64bit-only system that needs to propagate, it could still backfill left with randomness to match the spec requirement and keep its normal 64 bit trace ID on the right. I don't think this is a good idea at all when compared with backfilling left with zeros (split traces, etc), but it would still be way better than moving the 64 bit trace ID left and backfilling right with zeros (like the example from #344 ). As someone else mentioned in the other PR we're dealing with numbers here. And legacy systems always look at the rightmost bytes from what I've heard when dealing with 64bit<-->128bit transitions. Backfilling right with anything is incredibly surprising and confusing.

It's possible I'm missing something here, so assuming there are reasons that the conflict in item 2 forces backfilling right, then ultimately I think the proper way to solve it is to avoid the contradiction by deterministically backfilling left with zeros, not randomness. If we have to do surprising, confusing, and even more broken things to meet the spec requirement of left-side randomness, then let's change that spec requirement.

• In item 3 ("Left padding of randomness"):

Keeping rightmost bytes constant will break compatibility with the tracing systems that operates on these rightmost characters.

I don't understand what this means. When would having constant rightmost bytes ever be a problem, even if the leftmost bytes are random instead of zeros? I may need a concrete example to understand this sentence.


But other than those two things ☝️ , I think I'm on board with how this is framed. And I agree with @tedsuo and item 5 that there should be a focus on the 64bit-only scenario, and the spec should provide solid guidance on those cases. I also agree that the 128on64bit scenario probably doesn't need too much attention, at least initially. If a system can propagate the original incoming 128 bit trace ID then it should do so. The 64bit-only scenario is the one that really needs strong guidance to ensure maximum possible interoperability and reduce confusion.

I'm not going to be able to be heavily involved in this unfortunately, but I don't think I really need to be. IMO the best way to resolve the major sticking points is to have the spec recommend that 64bit-only systems SHOULD backfill left with zeros, for both ID generation and propagation.

@tylerbenson
Copy link

I think @nicmunroe summarized things very well and concur with his response above.

(Nice job @nicmunroe!)

@justinfoote
Copy link
Member

Agreed. The solution of left-padding zeros will work well for us at New Relic.
I fully agree with this comment from @nicmunroe:

the best way to resolve the major sticking points is to have the spec recommend that 64bit-only systems SHOULD backfill left with zeros, for both ID generation and propagation.

@codefromthecrypt
Copy link

codefromthecrypt commented Nov 1, 2019 via email

@tedsuo
Copy link

tedsuo commented Nov 5, 2019

Sorry for the radio silence, I'll join the call today to discuss.

I don't believe item 5 (prescribe the behavior of a 64bit-only system) is complex; we should simply specify how 64-bit IDs should be padded and truncated; and avoid confusion about random-backfill.

@morrisonlevi
Copy link

morrisonlevi commented Nov 7, 2019

I'm just finding this discussion now. To make sure I understand the current status, I'll try to summarize it; please let me know where I've misunderstood:

  • Some existing systems use 64 bit trace ids. As a transition step towards full 128 bit ids in Open Telemetry, left padding zeroes is technically allowed in the W3C Trace Context spec, but is not recommended. Padding zeroes is probably the easiest thing to add to existing codebases and infrastructure -- an important adoption detail.
  • On the reverse side, if a system receives a propagation header with a full 128 bit trace id and currently only support 64 bits, it should act as if it has received an invalid trace id. If I understood the W3c Trace Context spec correctly, the system would then act as if it never received the header.
  • As more systems support the full 128 bits the connectedness increases. The situation is not ideal, but allows migration.

Edit: However, this is not what the latest Open Telemetry PR proposes. It instead recommends truncation of the trace id. Truncation is fine if the left-most bits are all 0, but otherwise I think it should instead proceed as if it's encountered an invalid trace id, which would mean to drop it rather than truncate it. This seems like a viable adoption step; was there a discussion thread I missed on truncating? What's the rationale for that?

@SergeyKanzhelev
Copy link
Member Author

@morrisonlevi this comment is important for the understanding and getting on the same page: #344 (comment)

Also I realized we may need a names to differentiate systems:

  • 128bit system - system operating with 128bit trace-id (fully compliant).
  • 64bit system - system operating with 64bit trace-id. Not capable to propagate tracestate and 64 extra bits of trace-id
  • 128on64bit system - system that records 64 bit trace-id. But capable to propagate extra 64 bit of trace-id from incoming request to outgoing request.

If you are using OpenTelemetry - propagation of full tarce-id will be implemented. However you can be using only right most part as an ID. Basically you operate as 128on64 system.

Truncation is for truly 64bit systems. The rest is a correct understanding.

Please check out the PR addressing these concerns: #356

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants