Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ephemeral Resource Attributes #208

Closed
wants to merge 5 commits into from
Closed

Conversation

tedsuo
Copy link
Contributor

@tedsuo tedsuo commented Jun 22, 2022

This OTEP is part of the RUM/Client initiative.

Currently, we are missing a place to put important client information which applies to all telemetry emitted by an SDK. This information includes attributes such as session ID, language preference, locality/timezone, and other types of user data.

Normally, these attributes would be recorded as resources. However, on client processes, there are times when this information changes without the SDK re-initializing. For example:

  • The browser is idle for 15 minutes, ending a session.
  • The process is put to sleep, and awakened at a later date.
  • The user logs in or out, or other state changes which affect both the behavior and the reporting needs for the application.

In all of these cases, the application/SDK is not restarted. Currently, the resource associated with the SDK cannot be changed after it is started. This makes it very difficult to record these needed attributes.

This OTEP proposes a mechanism for updating the SDK with a new resource, which will be applied to all future telemetry created by the SDK. The proposal attempts to do this while preserving important characteristics already defined for resources:

  • The resource object itself remains immutable, and accessing the resource object when creating telemetry does not introduce a lock. The proposed ResourceProvider concept preserves these characteristics.
  • Existing resource attributes have a requirement for being present at SDK start time. They are not allowed to be updated or added to the resource once the SDK has started. This OTEP proposes that resource attributes be labeled as either "permanent" or "ephemeral" in the semantic conventions. Permanent attributes may not be updated after the SDK freezes the ResourceProvider.

If there are other backwards compatibility requirements for resources that I have missed, please let me know.

Cheers,
-Ted

@tedsuo tedsuo requested a review from a team June 22, 2022 15:02

There are two types of resource attributes, **permanent** and **ephemeral**. Attributed which are labeled as permanent in the semantic conventions must be present when the SDK is initialized. They cannot be added or updated at a later date.

Resources are managed via a ResourceProvider. Setting an attribute on a ResourceProvider will cause that attribute value to be included in the resource attached to any signal generated in the future. Spans which have already been started, along with any telemetry which has already been passed to the export pipeline, will not have the new attribute value. Optionally, a check can be added to ensure that permanent resources are not modified after the SDK has started
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optionally, a check can be added to ensure that permanent resources are not modified after the SDK has started

If nested attributes proposal is accepted, then one way to simplify ephemeral resources validation is to have just one attribute called ephemeral - the ResourceProvider then allows any modification to the value of this attribute and does not need to look up for which attributes are permanent. This also avoids the need to mark the resource attributes permanent in the semantic conventions yaml files.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how this would simplify things? You then still have an attribute that needs special handling. Whether it is by name or with an explicit label would not make things more/less simple, would it?

The nested attributes proposal also does not require SDKs to implement them. If we want ephemeral attributes to depend on that, it would mean that SDKs could also not implement ephemeral attributes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand the proposal correctly, it requires that the permanent attributes be marked so in the semantic conventions. This is the part that will not be required if we limit the special handling to only one attribute with a known name.

Consider the following resource. The ResourceProvider can allow anytime modifications to the key-value pairs within the ephemeral attribute.

{
  service.name: foo,
  service.instance.id: 123,
  browser.user_agent: bar,
  ephemeral: {
    session.id: 456
  }
}

Anyway, this is an optimization step. Let's ignore this initially until the larger proposal gets acceptance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking something in the semantic conventions is just that: A convention. If we want something to be conventionally ephemeral, we still need to have a note about that in the semantic conventions one way or another.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it would simply things if ephemeral resources were kept separate from other resources.

Validator is also something which can be run in development, but disabled in production, which would work as an optimization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, one aside on nested attributes: my assumption is that attribute values wouldn't be merged, they would be replaced.

In other words, there is still only a single string key per attribute, but with the option of storing an object, map, or array as the value for that attribute. If you set a new value for the key, it would throw the old value away.


An alternative to ephemeral resources would be to create span, metrics, and log processors which attach these ephemeral attributes to every instance of every signal. This would not require a modification to the specification.

There are two problems to this approach. One is that the duplication of attributes is very inefficient. This is a problem on clients, which have a limited newtwork bandwidth. This problem is compounded by a lack of support for gzip and other compression algorithms on the browser.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to quantify this. How inefficient is it? A benchmark demonstrating this would be a strong argument in favour of the proposed approach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If processors can change scope attributes, they might be a good candidate to solve this as well.

lack of support for gzip and other compression algorithms on the browser

I'm not an expert on browser stuff but can you expand on this? On its surface it seems wrong since gzipped static resources show up everywhere on the internet and there are js implementations of gzip (like this). This stackoverflow post suggests that a part of it is because a browser client can't know if the server can accept gzipped data, but OTLP requires gzip support.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's uncommon because clients may not know if the server can accept compressed data. It is not clear to me if the gzip support in the OTLP spec refers to responses only (common for web services to provide) or requests as well (uncommon).

I think there is also a danger of an attack on the server - compressed data could be expanded to a very large content. And lastly gzip compression is not native to browsers, so there is CPU overhead, which is important to consider for impact on user experience, especially when sending data while the page is unloading.

Aside from that, I think that session ID specifically does not belong on individual signals. The session is a context for many signals in a given time period; it does not vary from signal to signal.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind on the OTLP gzip support, I see it says that clients MAY gzip the content.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tigrannajaryan Regarding the limited network bandwidth, the sendBeacon() API has a payload limit of 64KB. Assuming session.id attribute that looks like this when sent over the wire

{"key":"session.id","value":{"stringValue":"8fded6726f630a327ee3be41174a8a91"}}

It adds 79 bytes per each signal. The number of spans/events per export will depend on the type of application and which instrumentations are present. But assuming that 100 is plausible, this adds almost 8kB to the payload.

This will further increase if we add additional context attributes (user attributes, URL etc.).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While decompressing is common in browsers (be it gzip or brotli), none of the current request APIs expose a way to have browser compress the request (MDN: XHR, fetch)

This is incorrect. The CompressionStreams API provides a native solution for this and is supported in Chromium-based browsers already.

This does mean that indeed you have to bring your own compression methods. More code = larger bundle that the browser needs to download, parse and execute. First phase is network bound (but does benefit from compression itself), while second and third are CPU bound.

Also benefits from caching. In a network-constrained situation the cost of retrieving the additional code is paid once and the result cached. Conditional requests and etags are your friend.

Also in most cases instrumentation is required to be loaded ASAP (sometimes even before rest of the content on the page), causing site loading to be blocked until code is downloaded (should it be in the of the page)

The additional code for compression is only needed to export telemetry and does not need to be loaded at the same time as the code enabling instrumentation. Deferring until an export is required can increase the time-to-export but would not impact time-to-interaction or any other user-focused timing.

I propose adding an entirely new field called ephemeral_resource as a sibling to resource in ScopeSpans and ScopeLogs - this way, the original resource remains immutable and the new field can be use for the ephemeral attributes of the resource.

This is inverted. ResourceSpans contain ScopeSpans, not the other way around. Ephemeral resource attributes could be added as Scope* attributes on each Scope* produced during the time when the ephemeral resource attributes are active, but I'm not sure I see how changing the OTLP data structure advances the conversation in a safe way. That is the most invasive way of going about this that I could think of.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Aneurysm9 sorry my bad, I meant ResourceSpans and ResourceLogs and not ScopeSpans and ScopeLogs. I corrected this in my previous comment, can you check if it makes sense this time?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cost to generate, serialize, and compress that many spans is also not a synchronous process that takes x milliseconds, but many small processes which each take a small fraction of X. It is most important to ensure that each individual step doesn't impact user experience. With the example of 100 spans otlp -> protobuf -> pako on the pixel 4a given, the whole process is 4.393ms but you have 2 chances to yield to the event loop to ensure user experience is not affected.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect. The CompressionStreams API provides a native solution for this and is supported in Chromium-based browsers already.

Have missed it but I generally don't consider new browser features as a solution unless usage% is >90% (and well, safari has a monopoly on ios so....) (also 90% is probably low considering how much RUM products are asked for IE11 support but they already have a miserable experience due to using IE in current year so making it optional is worth consideration)

but you have 2 chances to yield to the event loop to ensure user experience is not affected.

There is one but - not when user is leaving the page, tho generally you don't have 100 spans then, making it a question of how much do you want to maintain 2 different code paths (a sync one and an "async" one)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tigrannajaryan i just want to emphasize what @scheler said, that the purpose of this OTEP is not to avoid compression or gain efficiency, but to extend our data model in a way that correctly represents these attributes.

If we don't want to extend the current Resource concept, we could add a new concept, call it ProccessScope or something similar, and have it work in effectively the same manner.

Personally, I'd prefer we extend resources over adding a new scope. But I prefer both over an approach that makes it impossible to cleanly implement RUM using OpenTelemetry.

In other words, I'm against "just tack on the process scope as span/event attributes" the same way I'd be opposed to "just tack on the instrumentation scope as span/event attributes." In both cases, yes it would "work." But it would create a headache for implementers and confusion for users.

We should strive for a clean data model, where everything is explained just by looking at the data structure.


There are two types of resource attributes, **permanent** and **ephemeral**. Attributed which are labeled as permanent in the semantic conventions must be present when the SDK is initialized. They cannot be added or updated at a later date.

Resources are managed via a ResourceProvider. Setting an attribute on a ResourceProvider will cause that attribute value to be included in the resource attached to any signal generated in the future. Spans which have already been started, along with any telemetry which has already been passed to the export pipeline, will not have the new attribute value. Optionally, a check can be added to ensure that permanent resources are not modified after the SDK has started
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how this would simplify things? You then still have an attribute that needs special handling. Whether it is by name or with an explicit label would not make things more/less simple, would it?

The nested attributes proposal also does not require SDKs to implement them. If we want ephemeral attributes to depend on that, it would mean that SDKs could also not implement ephemeral attributes.


## Trade-offs and mitigations

This change should be fully backwards compatible, with one potential exception: fingerprinting. It is possible that an analysis tool which accepts OTLP may identify individual services by creating an identifier by hashing all of the resource attributes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is another issue: Exporters right now may be implemented to assume they only ever deal with spans with the same resource. With this proposal, they could receive a batch of mixed spans.
Such an exporter may then misbehave and e.g. use the resource from the first/last span for everything.
Implementing sorting of spans by resource can be a bit costly.

Also there may be exporters for protocols that only support a single resource per connected agent. They would then probably need to stamp the ephemeral attributes on every single telemetry item.

Similar issues may apply to span processors.

(And possibly samplers that receive a resource in their constructor, but I don't think that will be a problem in practice open-telemetry/opentelemetry-specification#1658)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, exporters must deal with more than one resource already, which is what made this change so simple!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an issue open for that: open-telemetry/opentelemetry-specification#1690
Right now, I don't think it's clear, and Dynatrace exporters take a shortcut here and always use the resource of the first item, assuming it will be the same for every item in the batch (everything else is an absolute edge case today)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I agree this should be clarified. My understanding is that a BatchSpanProcessor may be shared across multiple SDKs within the same process, and that is done in order to have different sets of resources for different sub-processes. So there is no guarantee that all spans in a batch have the same resource. I know that @MSNev has examples of this pattern.

But, I think that this pattern is extremely rare, so it doesn't surprise me that Dynatrace and other exporters could take a shortcut without anyone noticing.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our examples are (currently) used using our internal (not OpenTelemetry) SDK's on clients where multiple teams provide different components to the same "view" (page etc) and need / want to report telemetry to their own backends.

And in some runtimes we have a single batching system which is shared, rather than having each component on the view creating its own SDK instance with all of the overhead and batching mechanisms. Thus reducing the runtime impact on resources for the client (CPU, Memory, etc)


This change should be fully backwards compatible, with one potential exception: fingerprinting. It is possible that an analysis tool which accepts OTLP may identify individual services by creating an identifier by hashing all of the resource attributes.

In this case, it is recommended that these systems modify their behavior, and choose a subset of permanent resources to use as a hash identifier.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might be a pretty big deal for some, if they only allow storing one set of resource attributes per hash.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@open-telemetry/specs-approvers Please take a look - I suspect we may need a lot of eyes, in case somebody relies on this right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems crazy to me to use a resource hash as an identifier, given that there is no requirement that the items within it would uniquely identify a service...

But I'm throwing it out there as a possibility, just to cover all the bases.

Copy link
Member

@Oberon00 Oberon00 Jun 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be using something that doesn't exist yet, instead of hashing the whole resources: open-telemetry/opentelemetry-specification#1034 (EDIT: To clarify: We don't do/need this at Dynatrace, I don't know anybody who does. Just a side note)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree! There are various attributes which could count as a unique identifier. We could clarify in the spec which ones are currently defined.

One possibility: by default, the SDK could generate a unique ID every time it starts, which would be a reliable identifier because we generate it ourselves. However, this identifier would not be stable across restarts. So there are limits to what can be provided without user input.


There are two problems to this approach. One is that the duplication of attributes is very inefficient. This is a problem on clients, which have a limited newtwork bandwidth. This problem is compounded by a lack of support for gzip and other compression algorithms on the browser.

The second problem is that it becomes difficult to distinguish between emphemeral resources and other types pf attributes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it needed to distinguish them by type? Usually the attribute keys should be all you need. E.g. if you have a session.id attribute, would you care whether it is an ephemeral resource or a span/event attribute?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the browser,the overhead of applying the session.id as an attribute on every span and event would be untenable.

As far as the need to differentiate, putting data in the proper envelope helps backend systems use it more effectively.

You might ask, why have resources at all in OTLP? Why not simple apply resources as attributes on every span and event? Besides the inefficiency, it would make life very difficult for backend systems which want to apply different analysis to resources and span attributes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the browser,the overhead of applying the session.id as an attribute on every span and event would be untenable.

Citation needed 😃
Would these arguments also apply against #207?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see this thread (#208 (comment)) for a lengthy discussion on data limitations in the browser.

I don't think these arguments apply to #207, that proposal would be helpful imho. Just not a solution for ephemeral resources, since many of the events which need these resources happen when there is no trace present.


An alternative to ephemeral resources would be to create span, metrics, and log processors which attach these ephemeral attributes to every instance of every signal. This would not require a modification to the specification.

There are two problems to this approach. One is that the duplication of attributes is very inefficient. This is a problem on clients, which have a limited newtwork bandwidth. This problem is compounded by a lack of support for gzip and other compression algorithms on the browser.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In situations where at least one of the ephemeral attributes changes very often, telemetry items are created between the changes and there are lots of permanent attributes, attaching to to the telemetry items ("signal instance") could even be more efficient.

Generally, I wonder how many ephemeral attributes we expect relative to permanent ones.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not expecting large numbers of ephemeral attributes, nor are we expecting them to change with great frequency.

The expectation is that there would be between 1 and 10 ephemeral attributes set on a client, which may update after 15 minutes of inactivity, after the application reawakens, or in response to a change in user or user settings.

@@ -0,0 +1,78 @@
# Ephemeral Resource Attributes

Define a new type of resource attribute, ephemeral resources, which are allowed to change over the lifetime of the process. Existing resources are redefined as permanent resources, which must be present at SDK initialization and cannot be changed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have proposed a somewhat similar OTEP #207

If #207 was implemented you could store your ephemeral resource attributes on the Context, and replace the active context when they change. Please check if #207 would also cover your use case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! That looks like a good proposal, but the context scope still presumes a transactional scope within a server handling many independent transactions.

For clients, all telemetry emitted, including logs which are not bounded by a span, are related. Which is why the resource scope appears to be the correct one for things like this.

Copy link
Member

@Oberon00 Oberon00 Jun 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a continuum of use cases here, where some are better addressed by this OTEP and others better by #207. If one added the possibility to set a new context as root context (where the default is the empty context), we could have something that applies to everything.
Though the browser usually only has one thread of execution of which everything is a child context (I believe), so you probably would only need to set the attributes you want as active before starting your root spans, and it would stick.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might work... but it might be better to keep the concept of a "process scope" and a "context scope" separate. I see these attributes as more similar to resources and instrumentation scopes - they represent the environment the transaction is occurring within.

Because contexts are immutable, and no rules as to when child contexts may be created, there would be synchronization issues between when ephemeral resources are updated and when they would applied, if they only change the root context and thus only affect transactions which start from a new root context.

@carlosalberto
Copy link
Contributor

@tedsuo Thanks - I feel like some examples would be great, as it seems it's the Validator the one separating Resources/Attributes between permanent and ephemeral?

@tedsuo
Copy link
Contributor Author

tedsuo commented Jun 30, 2022

Sure, no problem @carlosalberto. Would you want an example implementation? Or an example use case?

@tedsuo
Copy link
Contributor Author

tedsuo commented Jul 7, 2022

Added an example implementation and example use case.

@tedsuo
Copy link
Contributor Author

tedsuo commented Jul 14, 2022

Yes? No? What should we do here? Based to these requirements, it would be good to understand how the TC the would like to move forward.

@tigrannajaryan
Copy link
Member

tigrannajaryan commented Jul 14, 2022

@tedsuo the spec defined Resource like this:

A Resource is an immutable representation of the entity producing telemetry as Attributes.

This text is in a Stable spec document. How do we reconcile this OTEP with the spec's stance on immutability of the Resource? Are you suggesting that we break a Stable spec document? Or you do not think this is a breaking change?

@t2t2
Copy link

t2t2 commented Jul 14, 2022

How do we reconcile this OTEP with the spec's stance on immutability of the Resource?

This doesn't change anything about current resource immutability - an update on the resource provider would end up in a new resource instance. To speak in code:

const resourceProvider = new ResourceProvider({
    // Initial set of attributes, internally does a new Resource(attrs) and stores it as current value
    'session.id': '1',
});
const tracerProvider = new TracerProvider({ resourceProvider });
const tracer = tracerProvider.getTracer(/* irrelevant */);

const span1 = tracer.getSpan(/* ... */);
// internally span.resource = tracer.tracerProvider.resourceProvider.getResource()

// Some time later, user logs in and their identity is known
resourceProvider.setAttribute('enduser.id', 'superadmin');
// internally currentResource = currentResource.merge(new Resource(newAttrs)), which as per the current spec
// returns a new Resource with merged attrs
// That new Resource is set as the current value in ResourceProvider

// Or session expires and a new one is set
resourceProvider.setAttribute('session.id', '2')

const span2 = tracer.getSpan(/* ... */);


span1.resource !== span2.resource

assert.deepEquals(span1.resource.attributes, { 'session.id': '1' });
assert.deepEquals(span2.resource.attributes, { 'session.id': '2', 'enduser.id': 'superadmin' });

@tigrannajaryan
Copy link
Member

This doesn't change anything about current resource immutability - an update on the resource provider would end up in a new resource instance.

I disagree. This is not just about a Resource instance in memory. It is about the Resource that is emitted by the instrumented application. The recipients of telemetry expect that the resource is immutable, i.e. its attributes do not change over time.

The OTEP talk about this in the "Trade-offs and mitigations" section. I think this is a breaking change. It breaks the contract between Otel sources and telemetry destinations. The OTEP text even recommends this:

In this case, it is recommended that these systems modify their behavior

I don't think this is acceptable. We are saying that "yes, we broke the contract, deal with it". IMO, we cannot do that.

@tigrannajaryan
Copy link
Member

I thought a bit more about this, I want to find a solution.

I don't think we can delete the requirement which says the Resource is immutable. I think this needs to stay otherwise we are breaking the contract. Additionally, unfortunately the spec says we are not allowed to change the association of the Resource and TracerProvider once that association is established:

a resource can be associated with the TracerProvider when the TracerProvider is created.

However, let's step back for a moment. I don't think recipients of telemetry care about the association inside the SDK. The recipients care about the data model and data model certainly allows the SDK to emit telemetry associated with different Resources. A new TracerProvider can be created with a new Resource and can be used to emit telemetry that was previously emitted using a different TracerProvider and this is completely legal.

Given the above, I do not see any clause in the spec that directly prohibits us from introducing a new way for TracerProvider to be associated with some proxy object which itself is associated with a Resource and allow that association to change over time. Yes, this is in a sense cheating, but it allows to introduce this new way such that it is not a breaking change for the SDK. That's what the proxy ResourceProvider here does.

To me the following questions remain:

  1. Is it right that session id is part of the Resource? It doesn't feel right but I can't put my finger on it, so I will refrain from objecting to this for now.
  2. Why do we need to introduce anything called "ephemeral attributes"? I think this is not needed. They are regular attributes just like any other. Nothing ephemeral here. We only introduce a new way to specify the Resource that must be associated with the produced telemetry. That's all it is. It is a regular Resource, an immutable one. Attributes are all regular.
  3. Is it really possible to introduce ResourceProvider with the ability to attach it to TracerProvider in a way that does not break any existing code? We need to see prototypes that demonstrate this.

@martinkuba
Copy link

Is it right that session id is part of the Resource?

It is an attribute that applies to all telemetry coming out of the application. It does not change from signal to signal, nor is it scoped to a specific instrumentation. I don't think there is any other place it could go than the resource level (given the current data model).

Why do we need to introduce anything called "ephemeral attributes"?

I think this is an attempt to alleviate the contract between OTel sources and destinations. If there is a real reason that backends need to have an immutable set of resource attributes per application instance, then this would make it possible by defining in the semantic conventions which attributes are permanent and which can change.

We assumed that the only reason backends would be relying on this contract is if they were doing something like hashing all the resource attributes (e.g. to identify the instance). Yes, this would force these backends to be updated, but it would provide them with a way to continue using the hashing. Also, since the TracerProvider can be recreated within the same application instance, defining which attributes are permanent or ephemeral is just making it explicit.

@Aneurysm9
Copy link
Member

Is it right that session id is part of the Resource?

It is an attribute that applies to all telemetry coming out of the application. It does not change from signal to signal, nor is it scoped to a specific instrumentation. I don't think there is any other place it could go than the resource level (given the current data model).

I'm not sure I see it the same way. Does it truly apply to all telemetry coming out of the application? Is it not possible for the same application instance to have two sessions active? Doesn't the fact that it can change while the application is running necessarily mean that it does not apply to all telemetry? Yes, the "session ID" attribute as a concept does, but not any given value. That is different from all other resource attributes.

As for not being scoped to a specific instrumentation, it is akin to the trace ID in that it can be used for correlation of signals. How would it be useful with distribution metrics? Do I really care to have a timeseries for every user session to track load times, or do I want to have a more general metric that has exemplars pointing at potentially interesting sessions?

As for where else it could go, it could certainly be added as a scope attribute. This would require a bit more bookkeeping on the part of the instrumentor to keep a map of sessions to tracers, etc., or to store them in session-scoped storage, but is feasible. More appropriate, perhaps, would be in the context where it would be available to all signals. Propagation across process boundaries to allow for correlation (I assume a session can be serviced by application elements that are outside of the immediately user-facing process) is still an issue. I think, though that this all reinforces my belief that session ID and trace ID are synonymous and that sessions are simply long traces. Do we really need a new concept, and to contort ourselves to find ways to claim that we're not breaking compatibility with a stable specification, to handle something that the existing concepts can already handle?

@t2t2
Copy link

t2t2 commented Sep 16, 2022

Note: I originally started this as part of response to open-telemetry/opentelemetry-specification#2500 (comment) but the first section ended up being more related to this otep being stuck, so here it is!


Let's eliminate the confusion of what a session means for a bit. There are some other attributes that are

  1. good candidates for resource level
  2. value can change over time
  3. as a concept is more familiar to backend service / APM kind of usage that current otel contributors are a lot more familiar with

Let's bring in enduser.id

Currently it's defined as a span level identifying attribute. Which hey, makes total sense in a server side environment. You've got a server side service that you can have one server serve all of the users of the application. If you'd want to have enduser that caused a request set on all of the child spans, yes context makes a lot of sense since the entire server isn't dedicated to just one enduser. Anyways got my 3rd condition

Let's jump to client side. I open up local food delivery app, and it's instrumented to generate telemetry. Alright, what's the resource attributes. Well you've got

  • the ones describing the app (service.name = app name, service.version = app version, probably also including installation source, build type, ....)
  • but also the runtime (so like os, os version, ...; but when running in browser also browser user agent related info)
  • but also I'm logged in and interacting with the app so most of the telemetry comes out of my interactions with the app or server side updates pushed based on me being logged in. I'd consider this a reason to have enduser.id on resource, fulfilling condition 1

I try to order something but suddenly app runs into a bug and crashes. Smash cut, support person is messaging devop team "hey got this guy going crazy over not being able to order, can you figure out what's going on there, why his app keeps crashing". Devops looks up data based on my name, sees an attempt to order 100 kebabs in tracing spans that caused KebabStackOverflow in logs. Really this paragraph is only here to be referenced later while still having a linear timeline for the domain knowledge story

Somehow you're next to me and mention you uninstalled the app due to constant crashing a while ago and now have a please come back discount on your account. I hand my phone to you, you log my account out and log in with your account. And manage to successfully order after a more reasonable order size.

Now the logged in account has changed, so if the logged in account is in resource, the above telemetry should be over 3 resources: Data from me, data from anonymous user, data from you. (so now we've fulfilled condition 2)


Other than local food delivery app, some other examples:

  • The user of a self-service kiosk at an airport
  • The logged in cashier at a POS system
  • A rental scooter user

But other potential attributes:

So I think something people who haven't built a RUM need to consider is that a major difference between backend services and client side apps is that apps have (a lot more) state. A lot of this state is global (not scoped to parts of the app like within one request that is forgotten once the request is fulfilled), it changes over time (due to time, user interactions, or completely external actions) and in a lot of the cases it's useful or needed to assist in debugging using gathered telemetry (who, what device, what screen/url, what isp, geolocation)


A lot of these attributes are also what you'd want to query data by. Already mentioned looking up data based on app user info, but let's consider some of the RUM use cases:

  • Viewing the flow of an user during the session (querying spans/logs based on session.id)
  • Which page URL an error occurred the most (querying logs based on document.location/whatever it will get spec-ed to)
  • How's the webvitals score for people in a specific country (metrics(?) with geodata)
  • Comparing request spans based on ISP

These add considerations for efficient data ingest and storage. Now every vendor will probably have different opinions on this based on how they use and store data. In July I got some knowledge from @mdubbyap on splunk/signalfx ingest side about our use cases (and probably should have used this knowledge earlier so I don't accidentally misremember it but oh well Ted's been on vacation anyway so it wouldn't have helped move this forward):

For ingesting best is to minimise the amount of bytes that needs to be read in order to determine where to pipe the data to (be it partitioning, buckets or whatever optimises your infra). Worst is having to read deep enough to get into each span/log/metric and check it's attributes for the value. If it's a value on the resource, ingest only needs to read resource's bytes before determining where to send the data, not needing to parse the rest of the payload. (Since we focus on showing session experience, then obviously for us session.id attribute is 👀👀)


There also can be ways fulfilling legal requirements can be easier if these attributes are more easily readable, eg. indexing data based on enduser info to make deleting data on user request (such as GDPR) to be easier

Also linking open-telemetry/opentelemetry-specification#2775 as it's gone into topic of descriptive or identifying attributes, which has been to be one of the reasons against this otep so far

@scheler
Copy link
Contributor

scheler commented May 31, 2023

Hi, wanted to give an update on this topic, since some of us from the client-side-telemetry SIG have asked a few TC members to help us on the topic further. Copying the message that @jack-berg posted on slack -

  • @jsuereth and I discussed the issue of session id and the other attributes you want to attach to all telemetry in this week’s TC meeting. Here are some of the take aways:
  • We don’t think it’s appropriate to include these as resource attributes. However, we recognize the specification needs more clarity about what represents an entity in the client instrumentation space where different architectures (web application, native application, SPA) can result in SDK lifecycles that are quite different.
  • If not resource attributes, then the other option is to include them on individual records. We think you should pursue a strategy where these attributes are set in context, and lifted out of context onto the individual records in a custom SpanProcessor / LogRecordProcessor. It should be possible to do this today given the APIs that are available, but ideally it would be easier. We can use this use case to help steer an OTEP where shared attributes propagated via context are included on all signals.
  • Naturally, this is going to cause bigger payloads over the network in environments where compression isn’t available. We think this problem needs solving, but is orthogonal to where the attributes live from a data modeling perspective. We’ll separately pursue adjusting the OTLP specification to try to optimize for these types of scenarios. One potential solution is to extend the protocol with the notion of dictionaries of shared attributes, which individual records could reference instead of duplicating.

@Oberon00
Copy link
Member

@scheler

We think you should pursue a strategy where these attributes are set in context, and lifted out of context onto the individual records in a custom SpanProcessor / LogRecordProcessor

This is what I proposed in OTEP #207 to be a blessed concept with its own API, by the way. But as you said, it is in principle implementable today.

@tedsuo
Copy link
Contributor Author

tedsuo commented Jul 31, 2023

Closing this in favor of a new proposal coming from the RUM/Client group.

@tedsuo tedsuo closed this Jul 31, 2023
tigrannajaryan added a commit that referenced this pull request Sep 26, 2024
This is a proposal to address Resource and Entity data model
interactions, including a path forward to address immediate friction and
issues in the current resource specification.


The proposal includes all links and context needed to justify it, but
duplicating a snapshot here:

## Motivation

This proposal attempts to focus on the following problems within
OpenTelemetry to unblock multiple working groups:

- Allowing mutating attributes to participate in Resource ([OTEP
208](#208)).
- Allow Resource to handle entities whose lifetimes don't match the
SDK's lifetime ([OTEP
208](#208)).
- Provide support for async resource lookup
([spec#952](open-telemetry/opentelemetry-specification#952)).
- Fix current Resource merge rules in the specification, which most
implementations violate
([oteps#208](#208),
[spec#3382](open-telemetry/opentelemetry-specification#3382),
[spec#3710](open-telemetry/opentelemetry-specification#3710)).
- Allow semantic convention resource modeling to progress
([spec#605](open-telemetry/opentelemetry-specification#605),
[spec#559](open-telemetry/opentelemetry-specification#559),
etc).

---------

Co-authored-by: Tigran Najaryan <[email protected]>
Co-authored-by: jack-berg <[email protected]>
Co-authored-by: Arve Knudsen <[email protected]>
Co-authored-by: David Ashpole <[email protected]>
carlosalberto pushed a commit to carlosalberto/opentelemetry-specification that referenced this pull request Oct 21, 2024
This is a proposal to address Resource and Entity data model
interactions, including a path forward to address immediate friction and
issues in the current resource specification.

The proposal includes all links and context needed to justify it, but
duplicating a snapshot here:

## Motivation

This proposal attempts to focus on the following problems within
OpenTelemetry to unblock multiple working groups:

- Allowing mutating attributes to participate in Resource ([OTEP
208](open-telemetry/oteps#208)).
- Allow Resource to handle entities whose lifetimes don't match the
SDK's lifetime ([OTEP
208](open-telemetry/oteps#208)).
- Provide support for async resource lookup
([spec#952](open-telemetry#952)).
- Fix current Resource merge rules in the specification, which most
implementations violate
([oteps#208](open-telemetry/oteps#208),
[spec#3382](open-telemetry#3382),
[spec#3710](open-telemetry#3710)).
- Allow semantic convention resource modeling to progress
([spec#605](open-telemetry#605),
[spec#559](open-telemetry#559),
etc).

---------

Co-authored-by: Tigran Najaryan <[email protected]>
Co-authored-by: jack-berg <[email protected]>
Co-authored-by: Arve Knudsen <[email protected]>
Co-authored-by: David Ashpole <[email protected]>
carlosalberto pushed a commit to carlosalberto/oteps that referenced this pull request Oct 23, 2024
This is a proposal to address Resource and Entity data model
interactions, including a path forward to address immediate friction and
issues in the current resource specification.


The proposal includes all links and context needed to justify it, but
duplicating a snapshot here:

## Motivation

This proposal attempts to focus on the following problems within
OpenTelemetry to unblock multiple working groups:

- Allowing mutating attributes to participate in Resource ([OTEP
208](open-telemetry#208)).
- Allow Resource to handle entities whose lifetimes don't match the
SDK's lifetime ([OTEP
208](open-telemetry#208)).
- Provide support for async resource lookup
([spec#952](open-telemetry/opentelemetry-specification#952)).
- Fix current Resource merge rules in the specification, which most
implementations violate
([oteps#208](open-telemetry#208),
[spec#3382](open-telemetry/opentelemetry-specification#3382),
[spec#3710](open-telemetry/opentelemetry-specification#3710)).
- Allow semantic convention resource modeling to progress
([spec#605](open-telemetry/opentelemetry-specification#605),
[spec#559](open-telemetry/opentelemetry-specification#559),
etc).

---------

Co-authored-by: Tigran Najaryan <[email protected]>
Co-authored-by: jack-berg <[email protected]>
Co-authored-by: Arve Knudsen <[email protected]>
Co-authored-by: David Ashpole <[email protected]>
carlosalberto pushed a commit to carlosalberto/oteps that referenced this pull request Oct 23, 2024
This is a proposal to address Resource and Entity data model
interactions, including a path forward to address immediate friction and
issues in the current resource specification.


The proposal includes all links and context needed to justify it, but
duplicating a snapshot here:

## Motivation

This proposal attempts to focus on the following problems within
OpenTelemetry to unblock multiple working groups:

- Allowing mutating attributes to participate in Resource ([OTEP
208](open-telemetry#208)).
- Allow Resource to handle entities whose lifetimes don't match the
SDK's lifetime ([OTEP
208](open-telemetry#208)).
- Provide support for async resource lookup
([spec#952](open-telemetry/opentelemetry-specification#952)).
- Fix current Resource merge rules in the specification, which most
implementations violate
([oteps#208](open-telemetry#208),
[spec#3382](open-telemetry/opentelemetry-specification#3382),
[spec#3710](open-telemetry/opentelemetry-specification#3710)).
- Allow semantic convention resource modeling to progress
([spec#605](open-telemetry/opentelemetry-specification#605),
[spec#559](open-telemetry/opentelemetry-specification#559),
etc).

---------

Co-authored-by: Tigran Najaryan <[email protected]>
Co-authored-by: jack-berg <[email protected]>
Co-authored-by: Arve Knudsen <[email protected]>
Co-authored-by: David Ashpole <[email protected]>
carlosalberto pushed a commit to carlosalberto/oteps that referenced this pull request Oct 30, 2024
This is a proposal to address Resource and Entity data model
interactions, including a path forward to address immediate friction and
issues in the current resource specification.


The proposal includes all links and context needed to justify it, but
duplicating a snapshot here:

## Motivation

This proposal attempts to focus on the following problems within
OpenTelemetry to unblock multiple working groups:

- Allowing mutating attributes to participate in Resource ([OTEP
208](open-telemetry#208)).
- Allow Resource to handle entities whose lifetimes don't match the
SDK's lifetime ([OTEP
208](open-telemetry#208)).
- Provide support for async resource lookup
([spec#952](open-telemetry/opentelemetry-specification#952)).
- Fix current Resource merge rules in the specification, which most
implementations violate
([oteps#208](open-telemetry#208),
[spec#3382](open-telemetry/opentelemetry-specification#3382),
[spec#3710](open-telemetry/opentelemetry-specification#3710)).
- Allow semantic convention resource modeling to progress
([spec#605](open-telemetry/opentelemetry-specification#605),
[spec#559](open-telemetry/opentelemetry-specification#559),
etc).

---------

Co-authored-by: Tigran Najaryan <[email protected]>
Co-authored-by: jack-berg <[email protected]>
Co-authored-by: Arve Knudsen <[email protected]>
Co-authored-by: David Ashpole <[email protected]>
carlosalberto pushed a commit to carlosalberto/opentelemetry-specification that referenced this pull request Oct 31, 2024
This is a proposal to address Resource and Entity data model
interactions, including a path forward to address immediate friction and
issues in the current resource specification.


The proposal includes all links and context needed to justify it, but
duplicating a snapshot here:

## Motivation

This proposal attempts to focus on the following problems within
OpenTelemetry to unblock multiple working groups:

- Allowing mutating attributes to participate in Resource ([OTEP
208](open-telemetry/oteps#208)).
- Allow Resource to handle entities whose lifetimes don't match the
SDK's lifetime ([OTEP
208](open-telemetry/oteps#208)).
- Provide support for async resource lookup
([spec#952](open-telemetry#952)).
- Fix current Resource merge rules in the specification, which most
implementations violate
([oteps#208](open-telemetry/oteps#208),
[spec#3382](open-telemetry#3382),
[spec#3710](open-telemetry#3710)).
- Allow semantic convention resource modeling to progress
([spec#605](open-telemetry#605),
[spec#559](open-telemetry#559),
etc).

---------

Co-authored-by: Tigran Najaryan <[email protected]>
Co-authored-by: jack-berg <[email protected]>
Co-authored-by: Arve Knudsen <[email protected]>
Co-authored-by: David Ashpole <[email protected]>
carlosalberto pushed a commit to carlosalberto/oteps that referenced this pull request Oct 31, 2024
This is a proposal to address Resource and Entity data model
interactions, including a path forward to address immediate friction and
issues in the current resource specification.


The proposal includes all links and context needed to justify it, but
duplicating a snapshot here:

## Motivation

This proposal attempts to focus on the following problems within
OpenTelemetry to unblock multiple working groups:

- Allowing mutating attributes to participate in Resource ([OTEP
208](open-telemetry#208)).
- Allow Resource to handle entities whose lifetimes don't match the
SDK's lifetime ([OTEP
208](open-telemetry#208)).
- Provide support for async resource lookup
([spec#952](open-telemetry/opentelemetry-specification#952)).
- Fix current Resource merge rules in the specification, which most
implementations violate
([oteps#208](open-telemetry#208),
[spec#3382](open-telemetry/opentelemetry-specification#3382),
[spec#3710](open-telemetry/opentelemetry-specification#3710)).
- Allow semantic convention resource modeling to progress
([spec#605](open-telemetry/opentelemetry-specification#605),
[spec#559](open-telemetry/opentelemetry-specification#559),
etc).

---------

Co-authored-by: Tigran Najaryan <[email protected]>
Co-authored-by: jack-berg <[email protected]>
Co-authored-by: Arve Knudsen <[email protected]>
Co-authored-by: David Ashpole <[email protected]>
carlosalberto pushed a commit to carlosalberto/oteps that referenced this pull request Nov 1, 2024
This is a proposal to address Resource and Entity data model
interactions, including a path forward to address immediate friction and
issues in the current resource specification.


The proposal includes all links and context needed to justify it, but
duplicating a snapshot here:

## Motivation

This proposal attempts to focus on the following problems within
OpenTelemetry to unblock multiple working groups:

- Allowing mutating attributes to participate in Resource ([OTEP
208](open-telemetry#208)).
- Allow Resource to handle entities whose lifetimes don't match the
SDK's lifetime ([OTEP
208](open-telemetry#208)).
- Provide support for async resource lookup
([spec#952](open-telemetry/opentelemetry-specification#952)).
- Fix current Resource merge rules in the specification, which most
implementations violate
([oteps#208](open-telemetry#208),
[spec#3382](open-telemetry/opentelemetry-specification#3382),
[spec#3710](open-telemetry/opentelemetry-specification#3710)).
- Allow semantic convention resource modeling to progress
([spec#605](open-telemetry/opentelemetry-specification#605),
[spec#559](open-telemetry/opentelemetry-specification#559),
etc).

---------

Co-authored-by: Tigran Najaryan <[email protected]>
Co-authored-by: jack-berg <[email protected]>
Co-authored-by: Arve Knudsen <[email protected]>
Co-authored-by: David Ashpole <[email protected]>
carlosalberto pushed a commit to open-telemetry/opentelemetry-specification that referenced this pull request Nov 8, 2024
This is a proposal to address Resource and Entity data model
interactions, including a path forward to address immediate friction and
issues in the current resource specification.


The proposal includes all links and context needed to justify it, but
duplicating a snapshot here:

## Motivation

This proposal attempts to focus on the following problems within
OpenTelemetry to unblock multiple working groups:

- Allowing mutating attributes to participate in Resource ([OTEP
208](open-telemetry/oteps#208)).
- Allow Resource to handle entities whose lifetimes don't match the
SDK's lifetime ([OTEP
208](open-telemetry/oteps#208)).
- Provide support for async resource lookup
([spec#952](#952)).
- Fix current Resource merge rules in the specification, which most
implementations violate
([oteps#208](open-telemetry/oteps#208),
[spec#3382](#3382),
[spec#3710](#3710)).
- Allow semantic convention resource modeling to progress
([spec#605](#605),
[spec#559](#559),
etc).

---------

Co-authored-by: Tigran Najaryan <[email protected]>
Co-authored-by: jack-berg <[email protected]>
Co-authored-by: Arve Knudsen <[email protected]>
Co-authored-by: David Ashpole <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.