Automate population of `peer.service` via `tracestate` #439

tylerbenson · 2023-10-23T22:02:22Z

Problem statement

Knowing the service on the other side of a remote call is valuable troubleshooting information. The semantic conventions represent this via peer.service. Unfortunately this currently requires some degree of manual effort to populate. For example, the Java Agent has a config to map hostname to service name. If a new service is deployed, this map must be updated in the config for all peer services.

We end up with a circular problem where users don't know of or don't populate this field because it's error prone, and vendors aren't able to rely on it being present to build more advanced features for the same reason.

This is generally not a big deal because the parent can be determined when the trace is reconstructed. However, this doesn't help with other signals like metrics or logs where it would be beneficial to know the peer service. Even with spans, it is often desirable to know the peer service as it could be used for sampling purposes or other other processing before the trace is reconstructed. (For example when generating a metric based on peer service for a span with a low occurrence frequency and low sampling rate where few traces are kept.)

Proposed Solution

If we propagate the current service's name via tracestate, we can then read that on the other end and populate peer.service accordingly.

Based on OTel's trace state format we should use something like sn:<service name> as our sub map definition with the key of ot.

Note: This solution only works on the request (client -> server) side since tracestate is only defined for request headers. If we decide to populate in response headers, we can populate the other side.

Considerations

To avoid overloading the headers, we might consider truncating the service name to a fixed length before setting in tracestate.
In some environments, service name might be considered sensitive. We might consider a setting to hash the service name (instead of truncating them) before sending while still allowing a vendor with spans from both side to associate them correctly.

The text was updated successfully, but these errors were encountered:

AlexanderWert · 2023-10-24T06:20:03Z

@tylerbenson Thanks for the proposal! I have some questions, comments on that:

If we propagate the current service's name via tracestate, we can then read that on the other end and populate peer.service accordingly.

Why using the tracestate for this, instead of using baggage?

This solution only works on the request (client -> server) side since tracestate is only defined for request headers.

That's right, this approach (and also with baggage) would only work with down-propagating the service name. So, only the downstream service will be able to set the peer.service to the service name of the upstream service, but not vice versa.

Unfortunately this currently requires some degree of manual effort to populate. For example, the Java Agent has a config to map hostname to service name. If a new service is deployed, this map must be updated in the config for all peer services.

The config in the JavaAgent aims at accomplishing the opposite (i.e. setting the service name of the called service as the peer.service attribute on a client span). So this proposal here wouldn't replace that part, right?

If we decide to populate in response headers, we can populate the other side.

This would only work with synchronous requests. With async requests or communication through messaging this is not possible or at least challenging. Also, what about failed requests that even do not reach the server-side (e.g. network issues)? In that case the attribute would be missing, right? Hence, causing fragmented data.

In general, I personally like the idea of defining a standard baggage key for this use case, that would be read by instrumentations automatically (e.g. by server-side instrumentations) to populate peer.service by default on the server side.

The other way round (i.e. setting peer.service on the client side) would be even more valuable, but because of the above I fear there's no simple, general solution to it.

tylerbenson · 2023-10-24T15:56:38Z

Why using the tracestate for this, instead of using baggage?

My understanding is that tracestate is intended for telemetry-internal purposes where baggage is more for user/application driven usage. Any reason you think baggage should be preferred?

RE: java agent config...

I forgot that config applied to just the client side, so until we get the response propagation working then yes, this proposal would not replace that.

This would only work with synchronous requests.

Agreed... response side will always be more difficult and can't guarantee. I think it's still worth doing for the request side though.

It sounds like you mostly agree with the idea. You just have a preference for using baggage instead of tracestate. Please elaborate on that topic.

tedsuo · 2023-10-26T20:21:13Z

If it is tracer-specific information set by the tracer, use tracestate. If it is set, read, and managed outside of the tracer, use baggage.

tigrannajaryan · 2023-11-14T16:48:54Z

Peer service name is also accessible as parent Span's Resource's service.name attribute, right?

tylerbenson · 2023-11-14T17:18:31Z

@tigrannajaryan that is an implicit relationship that is only available when the trace is reconstructed. Adding a peer.service enables that explicit relationship to be determined with just the single span.

tigrannajaryan · 2023-11-14T20:01:32Z

@tigrannajaryan that is an implicit relationship that is only available when the trace is reconstructed. Adding a peer.service enables that explicit relationship to be determined with just the single span.

It would be useful to explain in the issue description why it is important to be able to determine this from a single span.

tylerbenson · 2023-12-04T19:16:43Z

@tigrannajaryan I updated the description with better explanation. Specifically, I added the following:

This is generally not a big deal because the parent can be determined when the trace is reconstructed. However, this doesn't help with other signals like metrics or logs where it would be beneficial to know the peer service. Even with spans, it is often desirable to know the peer service as it could be used for sampling purposes or other other processing before the trace is reconstructed. (For example when generating a metric based on peer service for a span with a low occurrence frequency and low sampling rate where few traces are kept.)

Does this help give sufficient justification?

tigrannajaryan · 2023-12-05T16:33:12Z

logs where it would be beneficial to know the peer service.

If someone ends up finding themselves in this situation I think we should advise them to use tracing instead of logging.

I am not sure adding information about peers that is already reconstructable in a trace is the right approach. It seems to me this is a slippery slope that can lead to gradually adding more and more attributes about the peer in the current span because they may be useful. This leads to data duplication, increase in payload sizes and more complicated instrumentation code.

I would want to see a strong user/community need for this capability before we add it.

github-actions bot assigned arminru Oct 23, 2023

carlosalberto mentioned this issue Jan 8, 2024

Automatic propagation of peer.service open-telemetry/oteps#247

Closed

github-actions bot added the Stale label Feb 11, 2024

joaopgrassi removed the Stale label Feb 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automate population of `peer.service` via `tracestate` #439

Automate population of `peer.service` via `tracestate` #439

tylerbenson commented Oct 23, 2023 •

edited

Loading

AlexanderWert commented Oct 24, 2023 •

edited

Loading

tylerbenson commented Oct 24, 2023

tedsuo commented Oct 26, 2023

tigrannajaryan commented Nov 14, 2023

tylerbenson commented Nov 14, 2023

tigrannajaryan commented Nov 14, 2023

tylerbenson commented Dec 4, 2023 •

edited

Loading

tigrannajaryan commented Dec 5, 2023

Automate population of peer.service via tracestate #439

Automate population of peer.service via tracestate #439

Comments

tylerbenson commented Oct 23, 2023 • edited Loading

Problem statement

Proposed Solution

Considerations

AlexanderWert commented Oct 24, 2023 • edited Loading

tylerbenson commented Oct 24, 2023

tedsuo commented Oct 26, 2023

tigrannajaryan commented Nov 14, 2023

tylerbenson commented Nov 14, 2023

tigrannajaryan commented Nov 14, 2023

tylerbenson commented Dec 4, 2023 • edited Loading

tigrannajaryan commented Dec 5, 2023

Automate population of `peer.service` via `tracestate` #439

Automate population of `peer.service` via `tracestate` #439

tylerbenson commented Oct 23, 2023 •

edited

Loading

AlexanderWert commented Oct 24, 2023 •

edited

Loading

tylerbenson commented Dec 4, 2023 •

edited

Loading