-
Notifications
You must be signed in to change notification settings - Fork 890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding resource attributes post-creation (e.g. via auto-discovery) #1298
Comments
I think there are several possible routes we can go here:
|
Note, this was also briefly discussed in #515 (comment) |
I've always wished |
Let's try to not get too much into implementation details of how to allow changing resources (interface / struct / sync / async / auto-detected / explicit) at first and maybe discuss change in semantics and impact on backends/collectors/maybe exporters. I think this has the potential to break assumptions and optimizations. The resource is supposed to capture identification information about the entity (runtime/process/container/...) producing telemetry. It would be very annoying if the backends identifies the resource as belonging to different entities over the lifetime of a telemetry producer just because some identification information is missing. For example, a backend might have an entity describing a particular WildFly cluster, which may consist of multiple instances. If the cluster name is only detected later, any spans reported earlier would be mapped to the wrong entity, or otherwise the backend would need to retroactively move previously reported spans to a different entity. |
Hmm - I've always considered I never considered Resource semantically and didn't really expect to. There's no unique ID on it (well maybe this was the reason for the service ID I didn't understand well? 😅). So I figured it's just a bag of attributes and that keeps things simple IMO. Yes if for some reason some spans got sent before the resource got resolved the bag would have less attributes, but I don't know if that could really have any worse effect on the backend than the loss of data, which was inevitable since the data didn't exist at the time. |
As an example, we already have |
I assume you mean the actual value of the hostname may change. The resource attribute will always be what was initially set, won't it? |
The loss of data can in theory be prevented at the client side by delaying the sending of spans/metrics (buffering them) until all (expected) resources are available. Whether this is more desirable than data loss is questionable though (What if a problem prevents the resource from ever being detected? Aren't e.g. memory metrics especially interesting also during startup?). For Dynatrace the effect of the missing attributes is definitely worse. We have a quite entity-centric UI so users will primarily interact with entities like "MyLambdaFunction". If the ARN is missing from a few spans, we will detect a different entity like " I guess this is more of a problem of the particular attribute's detectability than of the Resource being mutable though. Even if the resource stays mutable, it may be sensible to require some attributes to be set already at SDK startup. |
I definitely agree with this - it's semantics, e.g., service.name is so important for most backends that we really need it to be static. But at the same time there are not 100% static, but mostly static data or 100% static after the first request, not startup attributes. I think they should be able to take advantage of our protocol's compression strategy. |
I agree. But I think we should distinguish at least the "immutable and known from the beginning" part from the others. I think we should also distinguish "immutable" from "might change". Another advantage of treating the mutable/delayed attributes as a separate concept is that we stay 100% compatible with backends that e.g. receive OTLP but do not expect the data in the existing resource protobuf field to change. |
I propose to restrict the scope of the initial request with only delayed, but immutable attributes, which may appear during runtime, but will stay immutable. To make this feature really useful, I see that #1330 is also required – If a resource is exposed as an API with |
After carefully reading the spec I think that the immutability of First, Second, spec is extremely vague on how Third, OTLP exporter already assume that there can be several different Fourth, OTLP->Jaeger conversion also demands that "Multiple resources can exist for a single process and exporters need to handle this case accordingly." Fifth, OTLP->Zipkin conversion ignores All in all, as per current specification all Otel compatible backends already have to deal with several resources coming from the same process without any guarantee about their immutability. Including the situation that spans from the same function in the app can come with different resource attributes at different times. Thus I don't see how current proposal makes situation any worse. |
@vovencij Learning from Metrics systemsA similar conversation has taken place in the metrics group, about whether in-process metrics processing pipelines should be able to presume a fixed set of resource attributes. Whether or not the OTLP Exporter is required to support multiple resource attribute sets has perhaps not been stated; I believe @bogdandrutu has in the past said that one SDK should emit one resource, as the resource concept is meant to identify an entity. Comments by @Oberon00 in this thread indicate that Dynatrace is looking to translate whole resources into entities as well. @iNikem's point 4 and 5,
are echoed by @anuraaga:
I was trying to answer related questions a year ago in this OTEP issue proposing a "Resource Scope" concept, describing the idea of sub-resource: open-telemetry/oteps#78. My impression is that users would like the flexibility to configure multiple "sub-entities" per process (across OTel signals), first as a matter of programming convenience and second as a matter of performance. There is an open proposal that OTel SDK should require a In metrics, there a requirement that each timeseries identified by unique attributes has a single writer, #1297. It is unsafe to remove unique-identifying attributes without re-aggregating those timeseries. Should we distinguish Obligatory vs. Optional resource attributes?In a Prometheus ecosystem, the corresponding attributes to In Prometheus, a number of what OTel calls "resources" are available from service discovery, where these attributes are given provisional names (starting with Metrics vendors also use cloud provider APIs directly to late-bind resource attributes, for example the AWS Resource Groups Tagging API. In order to late-bind resource attributes obtained from cloud providers, there are likely to be other Obligatory resource attributes needed. For example, to use the AWS API mentioned here you must have an "ARN", an Obligatory resource name, in order to late-bind its Optional attributes. Brief proposalIf every resource attribute included the Obligatory vs. Optional distinction, we might then:
If we can formalize a semantic convention here, something like |
The SpanProcessor has no say, here, the resource is passed from the TracerProvider to the Tracer to the Span which then gets passed to the exporter, as spec'd here: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/sdk.md#additional-span-interfaces
and the Resource spec says it maybe even clearer https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/resource/sdk.md#resource-sdk:
Note that both of these are already stable. |
I agree that multiple resources can exist via multiple tracer providers, but I don't think that associating the same exporter object instance to multiple tracer providers is necessarily the best way to solve this. In fact, I think we should specify that in general each SpanProcessor must be added to at most one TracerProvider and each Exporter must be added to add most one SpanProcessor. There may be cases where you'd want to deviate from that, but unless you have a particular use case you want to cover I would say keep it simple.
Fully agree, there are long-standing issues that call for something like that like #335 (I just linked that issue previously to a new issue that was talking about something similar for logs).
I would be careful with calling it "process" entity. In practice it will more likely be an opentelemetry entity (telemetry.sdk.instance_id?), if it is auto-generated at least. If e.g. a process hosts multiple runtimes (or isolated java classloaders) it might be difficult to keep the service.instance.id equal across them. But I think that would still suffice for the use case you line out.
I would be more happy with the terminology "Indentifier" vs "Descriptive" resource attributes. For example, while the service.name might really be considered obligatory and identifying by some, the ARN is definititely optional: Most processes won't have an ARN. But if they have an ARN, it is important for identifying the process. The AWS Lambda ARN in particular is a difficult case since it is not available from the beginning but only once the first request to the Lambda comes in (which is potentially never in the case of provisioned concurrency or initialization failures, etc), or you request it per API. Issue #1261 is about that. |
@jmacd I want to call on one point in your description that I'd like to discuss:
From what I understand, Resource is the central concept that unifies telemetry. Resource represents the way we navigate from one item of telemetry to another in our observability system. It's how we can link from Logs => Metrics or Trace => Metrics. As such, I've always viewed it as a *Query against an APMs understanding of service/application structure". When it comes to questions like "Can we allow attributes to be added to a resource later", I think it's a balancing act. Yes, strictly speaking, it will "hurt" the ability to go from a Trace to associated metrics purely by resource information. However, if you think of Resource as the "query" or "lookup" against that other piece of telemetry, then I think it actually is "Ok" for a lot of use cases. I want to make sure we're all in Sync about this meaning for resource in the spec, or adjust my understanding if it's incorrect. I think the proposal for Optional / Obligatory resources fits within my notion of "can we cross reference between telemetry from the same source". So, I think I'd add an addendum to your proposal:
I think metrics, by nature of solutions, suffer more from incomplete labels than other sources. |
Hence why I suggest we use "identifying" and "descriptive"/"non-identifying". Required (obligatory) is IMHO, if not orthogonal, not the same. |
I like @Oberon00's suggested terms, "Identifying" and "Descriptive". |
Is it only resource though? There are many attributes which I think also are meant for linking telemetry. For example, rpc method seems like one that will be used to jump between signals - I think this is what's considered identifying in this proposal. On the flip side, Resource has attributes which seem to be descriptive, such as the command line. So I guess I just can't imagine a backend handling Resource in a special way compared to attributes - finding all identifiers means looking at both, as is finding all the descriptors. That being said, it probably doesn't affect the proposal that much which seems to be going in the direction of having delayed resource attributes be possible if I understand correctly. I'm probably missing something big, but does it matter if an attribute is identifying or descriptive? |
I think it's easy to imagine. E.g. from a performance perspective, for each batch of spans, you only have to look at the resources once. From a semantic perspective I think it's cleaner to have separate places describing entity and operation/metric/log. From a coding perspective, it would probably be harder to keep e.g. metric labels and span attributes describing the entity emitting telemetry in sync, than just ensuring that the Meter and the TracerProvider have the same resource associated. Current spec:
|
Thank everybody for this lovely discussion! As far as the original ask to have delayed attributes is satisfied, I am happy to have only a subset of Resource attributes ("Descriptive" ones) delay-able.
But I am suspicious about this proposal by @jsuereth . First, I don't like this "you can delay some attributes but only by this short period of time". Second, IMO it creates unnecessary dependency on metrics side of the SDK. But I may be wrong here :) |
I don't think we should place any requirements on descriptive attributes being collected before anything else, but only identifying attributes. For observers in particular, I think we should leave them out of the first version of the metric spec altogether, #1433. And if we have them, maybe the first metric export could happen before any observers are queried, either because there are none or because observer collection and metric export is decoupled, #1432). I can definitely see the use case for "freezing" the identifying resources at some point in time though. There could be some way to signal the Resource SDK "I will deliver some resources later" ( |
I just had a realization that current resources can be thought of as a special kind of "never-changing" metrics, and "mutable resource attributes" may be better represented as metrics: As far as I understand, this issue goes into the direction of relaxing resources from "write once at startup" to being "append-only". But there are some attributes that can change an arbitrary number of times. For example, #1647 brings up mobile networks (signal strenght, 2G/3G/4G, current cell), but we also have IP addresses, maybe current location, POD labels, CPU temperature. These have in common that the can change any time and may affect, i.e. be interesting for interpreting, all spans/metrics that are emitted during the time they apply. Especially when we look at the last example (CPU temperature), it appears to me that some attributes you might initially think of as resource attributes would indeed be better represented as metrics. I don't know if we have string-valued metrics, but I'm sure one could work around that with using some dummy value of zero/one and and a label with the actual value. CC @jmacd |
Something to think about. If we're manually instrumenting, this is not much of an issue. We are able to read env vars or AWS tags or container ids etc or whatever before creating the Resource object and then associate with a TracerProvider. This is more of a problem with auto-instrumentation where we may not have all the attributes at start up time. Perhaps we need a little leeway here? One thought that comes to mind is creating APIs as per the semantic convention for each resource domain (cloud, containers, k8s, os etc) and then having implementations for them (cloud:gcp, cloud:aws, os:windows, os:linux etc). I see that AWS is doing something along these lines by creating an extension in Java to get and populate resource attributes for almost all their run times. Perhaps the auto-instrumentation libraries can take in params at start up indicating the runtime and invoking the necessary implementation? Example in Java of how this could work : -Dotel.resource.attributes=service.name=myjavaservice -Dotel.cloud.provider=aws -Dotel.cloud.platform=ECS |
What are you trying to achieve?
I would like to have a possibility to add resource attributes via some auto-discovery mechanism, where attribute values are not known at the Resource creation time, but rather become known during application initialization. In case of auto-instrumentation Otel SDK is created very early in process runtime. Some useful information about the process may not be accessible yet.
An example may be a process of discovering an app server name and its version (#1143) by instrumentation, rather than setting these values via environment. Even without instrumentation, discovering and setting a version of an application (or whatever component that makes sense at the
Resource
level) at runtime is much easier than obtaining it before an application starts (as is needed for setting attributes via environment).The text was updated successfully, but these errors were encountered: