Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce Mandatory Unique Identifier For Telemetry Sources #194

Closed
wants to merge 2 commits into from
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 66 additions & 0 deletions text/0000-mandatory-unique-identifier-for-telemetry-sources.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# Mandatory unique identifier for telemetry sources

Provide an explicit mandatory unique identifier for telemetry sources.

## Motivation

Having a way to uniquely identify a telemetry source is helpful in many ways, like in processing and storing data from that source, visualizing them in a backend UI or debugging issues with that source and it's data.

As of now `service.name` (and related attributes `service.namespace` and `service.instance_id`) are the implicit standard for that due to `service.name` being enforced as mandatory by the [Resource SDK specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/resource/sdk.md#sdk-provided-resource-attributes) and [Resource Semantic Conventions](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/resource/semantic_conventions/README.md#semantic-attributes-with-sdk-provided-default-value).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is generally speaking not true if we speak about the entire OpenTelemetry. It is only true for telemetry emitted by Otel SDKs. There are other sources of telemetry which are not Otel SDKs. A good example is Otel Collector. It emits telemetry on behalf of many interesting sources which are not services, for example Processes or K8s pods.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I have to update this.


Due to the fact that those attributes are not **explicitly** available to uniquely identify a telemetry source, multiple approaches have been suggested:

1. [opentelemetry-specification/issues#1034](
https://github.com/open-telemetry/opentelemetry-specification/issues/1034) is suggesting that `service.instance.id`is poorly defined and should be removed and be replaced by something different like an `telemetry.sdk.instance_id`. An attribute like `telemetry.sdk.instance_id` could serve as the sole unique identifier.

2. [open-telemetry/opentelemetry-specification#2111](https://github.com/open-telemetry/opentelemetry-specification/pull/2111) is proposing to provide a broad definition for the term _Service_, which would mean that (almost) every telemetry source is a service and `service.name` (and `namespace` and `instance_id`) could be used as unique identifier.

3. [open-telemetry/opentelemetry-specification#2115](https://github.com/open-telemetry/opentelemetry-specification/pull/2115) is proposing to introduce `app.name` as mandatory attribute for client side telemetry sources like browser apps or mobile apps, which then would not be treated as service (and with that would not have a `service.name`). `(app|service).name` (and `namespace` and `instance_id`) could be used as unique identifier.

4. [open-telemetry/opentelemetry-specification#2192](https://github.com/open-telemetry/opentelemetry-specification/pull/2192) is proposing to introduce `telemetry.source.*` attributes as a super-set to `service.*` and `app.*`.

This OTEP is proposing to choose from those approaches to uniquely identifying a telemetry source, or to find a unifying approach, since not all proposals are mutually exclusive.)

## Explanation

As stated in the Motivation with that unique identifier in place, it can be used at different places:

* Backend developers will have certainty which attributes they can use as unique identifier for the source when storing telemetry data.
* An UI can use it for visualization, especially as fallback if no other attribute is provided for that.
* The collector (and other processors) can use that identifier while processing traces, metrics, logs.
* An end-user could use that identifier for error handling and debugging, e.g. when a telemetry source is mis-configured, it's easier to identify it among others.

## Internal details

As stated above, there are multiple approaches to obtain that common unique identifier. Depending on the approach, there are different ways to accomplish it:

1. Introduce `telemetry.sdk.instance_id` (or similar) and make it mandatory. Make `service.name` only mandatory for backend services. Other telemetry sources can make different attributes mandatory, like `app.name`. Optionally, remove `service.instance_id` from `service.*`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One goal we should have here, is that this is not some machine-generated-id, but a human-readable name that allows simple filtering for users on telemetry generated for their "idea" of an observable unit. E.g. if I'm running a checkout service, this name should be used across ALL instances of components I'm using related to that checkout service. Similarly if I'm running a "Coffee Rewards Mobile Application", this id should be the same across all rollouts and instances of that application I'm observing.

I want to make sure we don't loose that, and having a name instance_id I think doesn't convey what we really are asking users to provide.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also not a big fan of only having the machine-generated-id, at the end you want to have a combination of both, e.g. if you have 10 instances of your "Checkout Service" and one of them is in an error state, you want to identify it uniquely.

Copy link
Member

@tigrannajaryan tigrannajaryan Dec 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this tries to deviate from Otel's current philosophy of identification, which is:

  • Sources of telemetries are defined in semantic conventions by specifying a list of attributes that describe them.
  • For every source the semantic conventions are specifically defined to say which attributes are used for identification purposes in a particular scope.

For example:

  • we have Service which is identified by (service.namespace,service.name,service.instance.id) tuple globally.
  • we have Kubernetes Node which is identified by (k8s.node.uid) within its cluster.
  • we have Kubernetes Namespace which is identified by (k8s.namespace.name) within its cluster.
  • we have OS Process which is identified by its (process.pid) within its host.

From what I see this tries to introduce the concept of universal and globally unique ID for all telemetry sources and mandates one ID per source. I fail to see how this is possible at all. A couple problems I see:

  • How do you guarantee global uniqueness? Are IDs randomly generated? Do we rely on lack of collisions of IDs because the generators are good and ID is wide enough to make collision probability negligible? If not randomly generated how do you ensure global uniqueness?
  • Which of the associated entities is the source of a particular telemetry when you have a stack of technologies? For example if I emit CPU usage of an application using Otel GO SDK, running as an OS Process inside a Container on a Kubernetes Pod, what is my source? Is it the Application? Go SDK? OS Process? Container? Pod? I can attribute CPU usage to any of these equally well and even if I choose one I still likely want to record the fact that these 5 different kinds of sources are associated with that metrics. Do we allow telemetry.sdk.instance_id to be an array of values?

--

While I generally agree that it is a good goal to make telemetry sources identifiable I fail to see how the premise of a single globally unique id per telemetry source can work.

I think the best we were able to do so far was to allow individual source types to solve the identification problem within their scope of operation and decide what sets of attributes they want to define in the form of semantic conventions and designate as their identifiers.

I would welcome a solution that is more uniform than the current approach but I do not see it in any of the proposed variations in this OTEP.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I see this tries to introduce the concept of universal and globally unique ID for all telemetry sources and mandates one ID per source. I fail to see how this is possible at all. A couple problems I see:

I might need to change my wording here: the main argument is around unique identification for SDK-based telemetry sources (backend services, frontend services or anything coming in the future using an OTel SDK to emit telemetry).

we have Service which is identified by (service.namespace,service.name,service.instance.id) tuple globally.

In open-telemetry/opentelemetry-specification#1034 @Oberon00 was arguing that service.instance.id is not well-defined and should be replaced with telemetry.sdk.instance_id

IMHO this attribute is poorly defined right now as it may or may not be the same across service restarts, which IMHO can make quite a difference. It would be easiest if it MUST be the different for each restart, that way it could be used as primary key for all resources (not only service.*) sent by the same telemetry instance. On the other hand, maybe such an attribute would better be named telemetry.sdk.intance.id.

Applying this, this would make service.namespace,service.name,telemetry.sdk.instance_id the unique identifier.

Which of the associated entities is the source of a particular telemetry when you have a stack of technologies? For example if I emit CPU usage of an application using Otel GO SDK, running as an OS Process inside a Container on a Kubernetes Pod, what is my source? Is it the Application? Go SDK? OS Process? Container? Pod?

From my point of view, the telemetry source is the emitter of the metric (e.g. the OTel GO SDK). This does not stop you from associating the metric with the process, the container or the pod additionally.

But if things go wrong and you get -5,000,000% CPU usage reported, you want to figure out who is emitting that metric and fix it.

While I generally agree that it is a good goal to make telemetry sources identifiable I fails to see how the premise of a single globally unique id per telemetry source can work.

I think this approach here (1) is not explained correctly, see above: this telemetry.sdk.id is for telemetry coming from an otel sdk.

I would welcome a solution that is more uniform than the current approach but I do not see it in any of the proposed variations in this OTEP.

You're right, this is not proposed in this OTEP. However I am wondering if this is possible: You wrote that for every source the semantic conventions are specifically defined to say which attributes are used for identification purposes in a particular scope. If I understand this correctly, this would mean that there is always a group of attributes that could be merged (in the SDK, in the collector, in the backend) into a unique identifier (like ecommerce-checkout-)?( I am not suggesting that this should be mandated)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Many of my objections come from that fact that the OTEP appears to be talking about all telemetry sources. If this is about Otel SDKs then that's a different story. I think in fact it is very useful for each Otel SDK to have a unique runtime instance id and emit it. We have telemetry.sdk.name, telemetry.sdk.version, etc. I think in addition to that we also need a globally unique telemetry.sdk.instance.id which can be autogenerated or supplied via an env var to the SDK. This will be a necessity if we want to add a remote configuration capability to the Otel SDKs, such that it uses the same management protocol as the Otel Collector.

Copy link
Member

@tigrannajaryan tigrannajaryan Dec 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this correctly, this would mean that there is always a group of attributes that could be merged (in the SDK, in the collector, in the backend) into a unique identifier (like ecommerce-checkout-)

Yes.
Important: it is not always global uniqueness, for some sources it is merely uniqueness within a particular scope. I would love to have a stronger global uniqueness but that is more difficult to achieve and so far we refrained from making it a requirement. For example for Kubernetes Pods we describe which attributes uniquely identify it within a Kubernetes cluster.

We may have not been fully diligent in this, so some of the sources may lack this identification ingredient, but I believe this was the general sentiment for semantic conventions that describe telemetry sources.


2. Introduce a broad definition of the term _Service_ in the glossary. Unique identification could be achieved by (1) or making `service.name`, `service.namespace`, `service.instance_id` mandatory for all telemetry sources.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to better understand why this doesn't work. Is it merely a presentation issue in the backends/UIs? What prevents the frontend or client-side applications to emit an additional attribute and for backends to look for this attribute and present that particular Service in a different way?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The argument against this option is that frontend-developers (and others) do not think of their applications as "service" and so the Client Telemetry SIG was proposing app.name as alternative to not confuse the end-user of the SDK. This means an SDK for frontend applications (in Java, WebJS, Swift) would send no service.name but app.name, like in option (3).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tigrannajaryan In a comment above, you mentioned that OTel's philosophy of identification is to specify a list of attributes that describe the source. One example of this is a Service that is identified by the set of service.* attributes.

We argue that client-side telemetry is different enough that it should be identified as separate from backend services. Therefore, we proposed introducing a different set of attributes (app.*) to identify client-side telemetry. I think that aligns with that principle, while using service attributes for both backend and client telemetry would not allow identifying one from the other.

The core issue perhaps is the definition of service. I would interpret it as a backend service, or a service within a private infrastructure, as opposed to running on client devices. I think there is an argument that it could have a bigger scope and include client apps. I think that this would be counterintuitive and possibly confusing to client-side developers. Also, there will be additional attributes coming from client-side resources that would not make sense in the service namespace (e.g. service.bundle).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We argue that client-side telemetry is different enough that it should be identified as separate from backend services. Therefore, we proposed introducing a different set of attributes (app.*) to identify client-side telemetry. I think that aligns with that principle, while using service attributes for both backend and client telemetry would not allow identifying one from the other.

This sounds reasonable to me. From what I understand the problem that prevents this from happening is that we mandated "service.name" to be always present (I missed the moment when that change was done the spec and I think it was not a right decision). The rationale for this requirement appears to be that some backends require it. I think the solution shouldn't be that the SDKs also require "service.name". Perhaps instead the solution should be that backend-specific exporters set some default value for "service.name" if it is missing, purely as a means to satisfy the particular backends. Backend-specific exporters can also make more complicated decisions like using one of "service.name" or "app.name" depending on which one is set. This would make it possible again to put other sources, like client-side apps on equal footing with the Service in the SDKs.

The core issue perhaps is the definition of service. I would interpret it as a backend service, or a service within a private infrastructure, as opposed to running on client devices. I think there is an argument that it could have a bigger scope and include client apps. I think that this would be counterintuitive and possibly confusing to client-side developers. Also, there will be additional attributes coming from client-side resources that would not make sense in the service namespace (e.g. service.bundle).

I don't mind against this, provided that we can clearly explain why the client-side apps need to be specified differently from Services. I would prefer that we make a reasonable effort and try to fit client-side apps into the definition of the Service, but if we find that it creates too much semantic mismatch in the naming of attributes and in the definitions of the concepts then I think client-side apps should be allowed to use their own set of attributes.


3. Narrow down the definition for the term _Service_ to backend services. Make `service.name` only mandatory for backend services. Other telemetry sources can make different attributes mandatory, like `app.name` and provide a definition for their term, like `App` in the glossary. Unique identification could be achieved by (1) or having `(service|app).instance_id` and `(service|app).namespace` made mandatory as well.

4. Introduce `telemetry.source.name`, `telemetry.source.namespace` and `telemetry.source.instance_id`. Make some or all of them mandatory for all telemetry sources. Different telemetry sources can add additional attributes in namespaces like `service.*` and `app.*`.

## Trade-offs and mitigations

All potential approaches provide different trade-offs:

1. This will not introduce any breaking changes.

2. This will not introduce any breaking changes, but end-users might get confused by calling their telemetry a service while they think of it as an app or different (see future possibilities)

3. This may introduce a breaking change with `service.name` being not mandatory anymore in that broad sense. This would need further investigation. Also, this approach might lead to further additional sets of attributes which will be used by different telemetry sources for unique identification (devices, cronjobs, bots, ...)

4. This will introduce a breaking change because `service.name` will be replaced with `telemetry.source.name`. This could be mitigated by a fallback mechanism, e.g. if `telemetry.source.name` is not provided check `service.name`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related proposal for solving this: #161


This list is not exhaustive, There are potentially more trade-offs per approach.

## Open questions

* What approach provides the most benefit and the least breaking changes to the current specification?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you propose one approach within this OTEP and list the other approaches as Alternatives considered?

It'll be hard for folks to "approve" this without an approach chosen.

This is a great rundown of options, tradeoffs issues. I think if you pick the option you find best, you'll see people comment pros/cons and find consensus in comments anyway. If you don't take a position, you're unlikely to see that feedback.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, let me take the one that most people hate, so they will bring their arguments.

Seriously: I'll find some time to rewrite the proposal in such a way! thanks!

* Are there further approaches missed by the author?

## Future possibilities

While the discussion right now is between backend and frontend services, in the future additional telemetry sources like different kinds of devices could be introduced and run into a similar situation that `service` is not the appropriate term.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this fails to take into account already existing other types of sources which are neither backend services nor frontend services. For example: K8s nodes, k8s pods, OS processes, FaaS (Lambda). These do not necessarily fall clearly into the frontend or backend bucket (e.g. I can have an OS process both on the frontend and on the backend).