-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce Mandatory Unique Identifier For Telemetry Sources #194
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,66 @@ | ||
# Mandatory unique identifier for telemetry sources | ||
|
||
Provide an explicit mandatory unique identifier for telemetry sources. | ||
|
||
## Motivation | ||
|
||
Having a way to uniquely identify a telemetry source is helpful in many ways, like in processing and storing data from that source, visualizing them in a backend UI or debugging issues with that source and it's data. | ||
|
||
As of now `service.name` (and related attributes `service.namespace` and `service.instance_id`) are the implicit standard for that due to `service.name` being enforced as mandatory by the [Resource SDK specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/resource/sdk.md#sdk-provided-resource-attributes) and [Resource Semantic Conventions](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/resource/semantic_conventions/README.md#semantic-attributes-with-sdk-provided-default-value). | ||
|
||
Due to the fact that those attributes are not **explicitly** available to uniquely identify a telemetry source, multiple approaches have been suggested: | ||
|
||
1. [opentelemetry-specification/issues#1034]( | ||
https://github.com/open-telemetry/opentelemetry-specification/issues/1034) is suggesting that `service.instance.id`is poorly defined and should be removed and be replaced by something different like an `telemetry.sdk.instance_id`. An attribute like `telemetry.sdk.instance_id` could serve as the sole unique identifier. | ||
|
||
2. [open-telemetry/opentelemetry-specification#2111](https://github.com/open-telemetry/opentelemetry-specification/pull/2111) is proposing to provide a broad definition for the term _Service_, which would mean that (almost) every telemetry source is a service and `service.name` (and `namespace` and `instance_id`) could be used as unique identifier. | ||
|
||
3. [open-telemetry/opentelemetry-specification#2115](https://github.com/open-telemetry/opentelemetry-specification/pull/2115) is proposing to introduce `app.name` as mandatory attribute for client side telemetry sources like browser apps or mobile apps, which then would not be treated as service (and with that would not have a `service.name`). `(app|service).name` (and `namespace` and `instance_id`) could be used as unique identifier. | ||
|
||
4. [open-telemetry/opentelemetry-specification#2192](https://github.com/open-telemetry/opentelemetry-specification/pull/2192) is proposing to introduce `telemetry.source.*` attributes as a super-set to `service.*` and `app.*`. | ||
|
||
This OTEP is proposing to choose from those approaches to uniquely identifying a telemetry source, or to find a unifying approach, since not all proposals are mutually exclusive.) | ||
|
||
## Explanation | ||
|
||
As stated in the Motivation with that unique identifier in place, it can be used at different places: | ||
|
||
* Backend developers will have certainty which attributes they can use as unique identifier for the source when storing telemetry data. | ||
* An UI can use it for visualization, especially as fallback if no other attribute is provided for that. | ||
* The collector (and other processors) can use that identifier while processing traces, metrics, logs. | ||
* An end-user could use that identifier for error handling and debugging, e.g. when a telemetry source is mis-configured, it's easier to identify it among others. | ||
|
||
## Internal details | ||
|
||
As stated above, there are multiple approaches to obtain that common unique identifier. Depending on the approach, there are different ways to accomplish it: | ||
|
||
1. Introduce `telemetry.sdk.instance_id` (or similar) and make it mandatory. Make `service.name` only mandatory for backend services. Other telemetry sources can make different attributes mandatory, like `app.name`. Optionally, remove `service.instance_id` from `service.*` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One goal we should have here, is that this is not some machine-generated-id, but a human-readable name that allows simple filtering for users on telemetry generated for their "idea" of an observable unit. E.g. if I'm running a checkout service, this name should be used across ALL instances of components I'm using related to that checkout service. Similarly if I'm running a "Coffee Rewards Mobile Application", this id should be the same across all rollouts and instances of that application I'm observing. I want to make sure we don't loose that, and having a name There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am also not a big fan of only having the machine-generated-id, at the end you want to have a combination of both, e.g. if you have 10 instances of your "Checkout Service" and one of them is in an error state, you want to identify it uniquely. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this tries to deviate from Otel's current philosophy of identification, which is:
For example:
From what I see this tries to introduce the concept of universal and globally unique ID for all telemetry sources and mandates one ID per source. I fail to see how this is possible at all. A couple problems I see:
-- While I generally agree that it is a good goal to make telemetry sources identifiable I fail to see how the premise of a single globally unique id per telemetry source can work. I think the best we were able to do so far was to allow individual source types to solve the identification problem within their scope of operation and decide what sets of attributes they want to define in the form of semantic conventions and designate as their identifiers. I would welcome a solution that is more uniform than the current approach but I do not see it in any of the proposed variations in this OTEP. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I might need to change my wording here: the main argument is around unique identification for SDK-based telemetry sources (backend services, frontend services or anything coming in the future using an OTel SDK to emit telemetry).
In open-telemetry/opentelemetry-specification#1034 @Oberon00 was arguing that
Applying this, this would make service.namespace,service.name,telemetry.sdk.instance_id the unique identifier.
From my point of view, the telemetry source is the emitter of the metric (e.g. the OTel GO SDK). This does not stop you from associating the metric with the process, the container or the pod additionally. But if things go wrong and you get -5,000,000% CPU usage reported, you want to figure out who is emitting that metric and fix it.
I think this approach here (1) is not explained correctly, see above: this telemetry.sdk.id is for telemetry coming from an otel sdk.
You're right, this is not proposed in this OTEP. However I am wondering if this is possible: You wrote that for every source the semantic conventions are specifically defined to say which attributes are used for identification purposes in a particular scope. If I understand this correctly, this would mean that there is always a group of attributes that could be merged (in the SDK, in the collector, in the backend) into a unique identifier (like ecommerce-checkout-)?( I am not suggesting that this should be mandated) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks. Many of my objections come from that fact that the OTEP appears to be talking about all telemetry sources. If this is about Otel SDKs then that's a different story. I think in fact it is very useful for each Otel SDK to have a unique runtime instance id and emit it. We have There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes. We may have not been fully diligent in this, so some of the sources may lack this identification ingredient, but I believe this was the general sentiment for semantic conventions that describe telemetry sources. |
||
|
||
2. Introduce a broad definition of the term _Service_ in the glossary. Unique identification could be achieved by (1) or making `service.name`, `service.namespace`, `service.instance_id` mandatory for all telemetry sources. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would like to better understand why this doesn't work. Is it merely a presentation issue in the backends/UIs? What prevents the frontend or client-side applications to emit an additional attribute and for backends to look for this attribute and present that particular Service in a different way? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The argument against this option is that frontend-developers (and others) do not think of their applications as "service" and so the Client Telemetry SIG was proposing There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @tigrannajaryan In a comment above, you mentioned that OTel's philosophy of identification is to specify a list of attributes that describe the source. One example of this is a Service that is identified by the set of service.* attributes. We argue that client-side telemetry is different enough that it should be identified as separate from backend services. Therefore, we proposed introducing a different set of attributes (app.*) to identify client-side telemetry. I think that aligns with that principle, while using The core issue perhaps is the definition of service. I would interpret it as a backend service, or a service within a private infrastructure, as opposed to running on client devices. I think there is an argument that it could have a bigger scope and include client apps. I think that this would be counterintuitive and possibly confusing to client-side developers. Also, there will be additional attributes coming from client-side resources that would not make sense in the service namespace (e.g. service.bundle). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This sounds reasonable to me. From what I understand the problem that prevents this from happening is that we mandated "service.name" to be always present (I missed the moment when that change was done the spec and I think it was not a right decision). The rationale for this requirement appears to be that some backends require it. I think the solution shouldn't be that the SDKs also require "service.name". Perhaps instead the solution should be that backend-specific exporters set some default value for "service.name" if it is missing, purely as a means to satisfy the particular backends. Backend-specific exporters can also make more complicated decisions like using one of "service.name" or "app.name" depending on which one is set. This would make it possible again to put other sources, like client-side apps on equal footing with the Service in the SDKs.
I don't mind against this, provided that we can clearly explain why the client-side apps need to be specified differently from Services. I would prefer that we make a reasonable effort and try to fit client-side apps into the definition of the Service, but if we find that it creates too much semantic mismatch in the naming of attributes and in the definitions of the concepts then I think client-side apps should be allowed to use their own set of attributes. |
||
|
||
3. Narrow down the definition for the term _Service_ to backend services. Make `service.name` only mandatory for backend services. Other telemetry sources can make different attributes mandatory, like `app.name` and provide a definition for their term, like `App` in the glossary. Unique identification could be achieved by (1) or having `(service|app).instance_id` and `(service|app).namespace` made mandatory as well. | ||
|
||
4. Introduce `telemetry.source.name`, `telemetry.source.namespace` and `telemetry.source.instance_id`. Make some or all of them mandatory for all telemetry sources. Different telemetry sources can add additional attributes in namespaces like `service.*` and `app.*`. | ||
|
||
## Trade-offs and mitigations | ||
|
||
All potential approaches provide different trade-offs: | ||
|
||
1. This will not introduce any breaking changes. | ||
|
||
2. This will not introduce any breaking changes, but end-users might get confused by calling their telemetry a service while they think of it as an app or different (see future possibilities) | ||
|
||
3. This may introduce a breaking change with `service.name` being not mandatory anymore in that broad sense. This would need further investigation. Also, this approach might lead to further additional sets of attributes which will be used by different telemetry sources for unique identification (devices, cronjobs, bots, ...) | ||
|
||
4. This will introduce a breaking change because `service.name` will be replaced with `telemetry.source.name`. This could be mitigated by a fallback mechanism, e.g. if `telemetry.source.name` is not provided check `service.name`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Related proposal for solving this: #161 |
||
|
||
This list is not exhaustive, There are potentially more trade-offs per approach. | ||
|
||
## Open questions | ||
|
||
* What approach provides the most benefit and the least breaking changes to the current specification? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you propose one approach within this OTEP and list the other approaches as Alternatives considered? It'll be hard for folks to "approve" this without an approach chosen. This is a great rundown of options, tradeoffs issues. I think if you pick the option you find best, you'll see people comment pros/cons and find consensus in comments anyway. If you don't take a position, you're unlikely to see that feedback. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point, let me take the one that most people hate, so they will bring their arguments. Seriously: I'll find some time to rewrite the proposal in such a way! thanks! |
||
* Are there further approaches missed by the author? | ||
|
||
## Future possibilities | ||
|
||
While the discussion right now is between backend and frontend services, in the future additional telemetry sources like different kinds of devices could be introduced and run into a similar situation that `service` is not the appropriate term. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this fails to take into account already existing other types of sources which are neither backend services nor frontend services. For example: K8s nodes, k8s pods, OS processes, FaaS (Lambda). These do not necessarily fall clearly into the frontend or backend bucket (e.g. I can have an OS process both on the frontend and on the backend). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is generally speaking not true if we speak about the entire OpenTelemetry. It is only true for telemetry emitted by Otel SDKs. There are other sources of telemetry which are not Otel SDKs. A good example is Otel Collector. It emits telemetry on behalf of many interesting sources which are not services, for example Processes or K8s pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I have to update this.