Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automate population of peer.service via tracestate #439

Open
tylerbenson opened this issue Oct 23, 2023 · 8 comments
Open

Automate population of peer.service via tracestate #439

tylerbenson opened this issue Oct 23, 2023 · 8 comments
Assignees

Comments

@tylerbenson
Copy link
Member

tylerbenson commented Oct 23, 2023

Problem statement

Knowing the service on the other side of a remote call is valuable troubleshooting information. The semantic conventions represent this via peer.service. Unfortunately this currently requires some degree of manual effort to populate. For example, the Java Agent has a config to map hostname to service name. If a new service is deployed, this map must be updated in the config for all peer services.

We end up with a circular problem where users don't know of or don't populate this field because it's error prone, and vendors aren't able to rely on it being present to build more advanced features for the same reason.

This is generally not a big deal because the parent can be determined when the trace is reconstructed. However, this doesn't help with other signals like metrics or logs where it would be beneficial to know the peer service. Even with spans, it is often desirable to know the peer service as it could be used for sampling purposes or other other processing before the trace is reconstructed. (For example when generating a metric based on peer service for a span with a low occurrence frequency and low sampling rate where few traces are kept.)

Proposed Solution

If we propagate the current service's name via tracestate, we can then read that on the other end and populate peer.service accordingly.

Based on OTel's trace state format we should use something like sn:<service name> as our sub map definition with the key of ot.

Note: This solution only works on the request (client -> server) side since tracestate is only defined for request headers. If we decide to populate in response headers, we can populate the other side.

Considerations

  • To avoid overloading the headers, we might consider truncating the service name to a fixed length before setting in tracestate.
  • In some environments, service name might be considered sensitive. We might consider a setting to hash the service name (instead of truncating them) before sending while still allowing a vendor with spans from both side to associate them correctly.
@AlexanderWert
Copy link
Member

AlexanderWert commented Oct 24, 2023

@tylerbenson Thanks for the proposal! I have some questions, comments on that:

If we propagate the current service's name via tracestate, we can then read that on the other end and populate peer.service accordingly.

Why using the tracestate for this, instead of using baggage?

This solution only works on the request (client -> server) side since tracestate is only defined for request headers.

That's right, this approach (and also with baggage) would only work with down-propagating the service name. So, only the downstream service will be able to set the peer.service to the service name of the upstream service, but not vice versa.

Unfortunately this currently requires some degree of manual effort to populate. For example, the Java Agent has a config to map hostname to service name. If a new service is deployed, this map must be updated in the config for all peer services.

The config in the JavaAgent aims at accomplishing the opposite (i.e. setting the service name of the called service as the peer.service attribute on a client span). So this proposal here wouldn't replace that part, right?

If we decide to populate in response headers, we can populate the other side.

This would only work with synchronous requests. With async requests or communication through messaging this is not possible or at least challenging. Also, what about failed requests that even do not reach the server-side (e.g. network issues)? In that case the attribute would be missing, right? Hence, causing fragmented data.

In general, I personally like the idea of defining a standard baggage key for this use case, that would be read by instrumentations automatically (e.g. by server-side instrumentations) to populate peer.service by default on the server side.

The other way round (i.e. setting peer.service on the client side) would be even more valuable, but because of the above I fear there's no simple, general solution to it.

@tylerbenson
Copy link
Member Author

Why using the tracestate for this, instead of using baggage?

My understanding is that tracestate is intended for telemetry-internal purposes where baggage is more for user/application driven usage. Any reason you think baggage should be preferred?

RE: java agent config...

I forgot that config applied to just the client side, so until we get the response propagation working then yes, this proposal would not replace that.

This would only work with synchronous requests.

Agreed... response side will always be more difficult and can't guarantee. I think it's still worth doing for the request side though.

It sounds like you mostly agree with the idea. You just have a preference for using baggage instead of tracestate. Please elaborate on that topic.

@tedsuo
Copy link
Contributor

tedsuo commented Oct 26, 2023

If it is tracer-specific information set by the tracer, use tracestate. If it is set, read, and managed outside of the tracer, use baggage.

@tigrannajaryan
Copy link
Member

Peer service name is also accessible as parent Span's Resource's service.name attribute, right?

@tylerbenson
Copy link
Member Author

@tigrannajaryan that is an implicit relationship that is only available when the trace is reconstructed. Adding a peer.service enables that explicit relationship to be determined with just the single span.

@tigrannajaryan
Copy link
Member

@tigrannajaryan that is an implicit relationship that is only available when the trace is reconstructed. Adding a peer.service enables that explicit relationship to be determined with just the single span.

It would be useful to explain in the issue description why it is important to be able to determine this from a single span.

@tylerbenson
Copy link
Member Author

tylerbenson commented Dec 4, 2023

@tigrannajaryan I updated the description with better explanation. Specifically, I added the following:

This is generally not a big deal because the parent can be determined when the trace is reconstructed. However, this doesn't help with other signals like metrics or logs where it would be beneficial to know the peer service. Even with spans, it is often desirable to know the peer service as it could be used for sampling purposes or other other processing before the trace is reconstructed. (For example when generating a metric based on peer service for a span with a low occurrence frequency and low sampling rate where few traces are kept.)

Does this help give sufficient justification?

@tigrannajaryan
Copy link
Member

logs where it would be beneficial to know the peer service.

If someone ends up finding themselves in this situation I think we should advise them to use tracing instead of logging.

I am not sure adding information about peers that is already reconstructable in a trace is the right approach. It seems to me this is a slippery slope that can lead to gradually adding more and more attributes about the peer in the current span because they may be useful. This leads to data duplication, increase in payload sizes and more complicated instrumentation code.

I would want to see a strong user/community need for this capability before we add it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants