feat(*): Add tracing to some more host components #2398

calebschoepp · 2024-03-26T23:18:44Z

#2348 introduced basic observability support to Spin via the OpenTelemetry standard. It had only basic support for the tracing signal and so the logical next step was to extend the tracing to cover more Spin surface area (host components, triggers, etc.). This PR is my attempt to "trace all the things". The complexity of choosing good span names and deciding what to trace has waylaid my efforts though. This PR is no longer a comprehensive attempt to trace all of Spin. Rather it has become a first crack in the glass — an exploratory attempt to try and build consensus around how we want to trace Spin.

In my digging I've identified the following crates that we want to trace. A * after a crate means it was initially traced in #2348. Bolded crates I have added tracing to in this PR.

core * (lots of questions here about whether we should trace all the WASI stuff or not)
key-value-{azure, redis, sqlite}
llm-{local, remote} *
outbound-{http, mqtt, mysql, networking, pg}
outbound-redis
sqlite-{inproc, libsql}
trigger-http *
trigger-redis
variables

As I was going through the exercise of tracing a subset of these crates I ran into the issue of how to name the spans. As @rylev suggested I wanted to have a consistent scheme that I used for all the span names across Spin. I ended up opting for a name scheme along the lines of verb_{optional adjective}_noun e.g. execute_query_in_process_sqlite_db.

Problem 1: Very quickly this naming scheme started to fray around the edges. For example with set_add_values_redis should set actually be part of the noun i.e. add_values_redis_set. Sure, but than that conflicts with the clean redis noun. Not only is this naming scheme not very consistent, but it also isn't particularly user-friendly to read in a trace.

Problem 2: Lots of our host components have a base implementation that calls out to an underlying provider e.g. llm and llm-local. It wasn't clear to me whether I should trace both of these function calls or not. On the one hand it is nice information to be able to see the stack trace and know which implementation you're using. On the other hand it isn't very beginner friendly and makes the trace look cluttered for little extra information.

Enter more complexity stage left: OTEL semantic conventions. It turns out that OTEL has pretty detailed conventions for how to trace certain things like database calls or messaging systems. For example let's look at the outbound-redis crate. OTEL has conventions for exactly this use case of interacting with Redis. It turns out the span names they recommend are low cardinality versions of the Redis query e.g. GET key. This is very different then the pattern of get_value_redis that I created above.

Problem 3: I'm not sure what to do with these OTEL conventions. It seems pretty obvious that we should use them in a lot of places. But, I also see a lot of problems with following the conventions:

Some of our host components don't map cleanly to a convention e.g. llm. What do we do there?
By following the OTEL conventions we lose the kind of consistency in span names from Spin that we might want e.g. one span is going to be SELECT * FROM users while another is going to beshop.orders process. Another example: if we were to follow the convention to the letter the redis-trigger would become something like mychannel deliver instead of something like handle_redis_message. Likewise handle_http_request becomes GET /my/route/*. I can imagine it becoming confusing to a user of Spin who is just trying to understand when their Spin app is triggered.
If we follow the convention what level of a host component do we trace at. For example with KV there is kv, cached_kv, and sqlite_kv. Do we still emit a span at each level of that stack? Do we just do it at the root underlying implementation? I don't know.

Well that was a whole lot of confusion and uncertainty that I just spewed out. My hope is that the discussion in this PR can help me wade through this uncertainty. If you're looking for concrete points to address I would recommend responding to problems 1 through 3.

As a final note here is a strawman that I would advocate and that we can all argue and diff against: We don't try to have any internal sense of consistent Spin spans. Instead we completely follow the upstream semantic conventions where appropriate. Where conventions don't exist we just wing it.

itowlson · 2024-03-27T00:07:11Z

Your strawman seems like a good starting point. We have to assume that Spin lives in a heterogeneous environment, and that Spin telemetry is going into the same handwave database as other telemetry sources: therefore, consistency with the other things in that environment would seem to trump personal aesthetics.

That said, this is based entirely on my personal aesthetic of "what would make sense" rather than any experience of using telemetry tooling. So, ya know.

itowlson · 2024-03-27T00:30:26Z

Something that might help with the first half of problem 2 is specific guidelines on the granularity of telemetry. You mention it being nice to know the stack trace: is this implying you envisage putting telemetry on every function at every layer? I'm guessing not, but then what counts as a significant "boundary" for telemetry? I agree knowing which implementation you're using is helpful, so is a record emitted only when entering a provider? I'm not trying to pre-empt your thought process here, just illustrating the kind of questions I'd want to know how to answer when writing code within this framework.

That also makes me wonder how much we can encapsulate. We don't want the code base to become a big pile of #[instrument] macros with islands of code dotted amongst them, and we don't want everyone who touches the Spin codebase to have to internalise the OpenTelemetry Semantic Conventions. So once we figure out what we want to do and have some motivating examples to work from, it would be great to think about ways we can hide that from the Spin "business logic" developer. We can't be the first people to run into this - maybe there's prior art we can draw on. But anyway that's for the future...

calebschoepp · 2024-03-27T16:16:56Z

Is this implying you envisage putting telemetry on every function at every layer?

As you suspected, definitely not. It would be insanity to trace every function.

What counts as a significant "boundary" for telemetry?

Great question. For starters I think that the crates I identified above are a good starting point for a list of significant "boundaries" that need to be traced. Within each of these crates I would propose that we should only emit a span from a single layer in the crate i.e. only emit get_value_kv rather than get_value_kv -> get_value_cached_kv->get_value_sqlite_kv`.

Side note: one could argue that we should still emit a span at every layer and we can mark some of those spans at the TRACE level so they are normally filtered out unless the user sets SPIN_OTEL_TRACING_LEVEL=trace. But, I don't think we should do this b/c this gets to the issue of cluttering the Spin codebase with lots of #[instrument] macros. I also think it adds little extra value. Whatever extra information is conveyed in those spans can be added as attributes to the single span.

If we choose to only emit a span from one layer within a crate then the immediate question is do we do it at the interface level (e.g. sqlite crate) or do we do it at the provider level (e.g. sqlite-inproc crate). I would suggest that we should do it at the provider level b/c we will have access to more relevant information there: the actual implementation being used; what the underlying query is; etc. One disadvantage to doing it at the provider layer is that technically the duration of the span is going to be a little bit shorter than it should be. We won't be accounting for the overhead of the interface layer. This is a tradeoff I'm personally okay with. Another disadvantage is that you have to write the same code over and over again for key-value-azure, key-value-redis, etc., instead of just for key-value.

@itowlson thoughts?

That also makes me wonder how much we can encapsulate. We don't want the code base to become a big pile of #[instrument] macros with islands of code dotted amongst them, and we don't want everyone who touches the Spin codebase to have to internalise the OpenTelemetry Semantic Conventions.

I'm not sure what to do about this. At a certain point I expect that we'll have to just accept the #[instrument]s littered everywhere. We could potentially write a macro to extract the common behaviour and reduce how big they appear 🤷 . But, not sure how much that really helps.

More concerning than #[instrument]s littered everywhere is the modifications we'll have to make to the code to set attributes and properly report errors to the span. Here is an example of where I've had to modify the code to properly report the inner error to the span. The err portion of the macro doesn't know how to properly mark the span as failed if there is an inner error. (Side note: As I write this I realize maybe I just shouldn't worry about reporting these inner errors, that is the problem of more upstream code).

If we follow OTEL conventions we should expect that there will be a lot more code for dynamically setting attributes littered throughout our host components.

lann

Choosing what to trace is always going to be a bit subjective but I'd start relatively small and grow based on feedback. Covering most network operations plus a few "known possibly slow/failure-prone" operations (app loading, sqlite, LLM) should get us close enough.

Left a few comments on specific operations, but in general I think we should mostly avoid instrumenting trivial (always fast, infallible) operations even if it means inconsistency between different e.g. KV implementations. Additionally, avoid most "tightly-nested" spans where a parent span is going to look essentially identical to its single child span.

crates/key-value-azure/src/lib.rs

crates/key-value-azure/Cargo.toml

crates/key-value-redis/src/lib.rs

crates/key-value/src/lib.rs

lann · 2024-03-28T15:09:13Z

On naming: I think it could be nice to choose a different convention for host components specifically where span names better reflect the guest's perspective. We could also (mostly) set the otel.name, giving us more flexibility in naming without clashing with tracing span name conventions.

Concretely, what about something like <host component>.<operation> e.g. spin-llm.infer?

lann · 2024-03-28T15:13:08Z

OTEL conventions: I do think its worth some effort to stick with these conventions as it will help with familiarity and integration with other tooling. That said, I think everyone would prefer non-conventional soon if that is the tradeoff.

calebschoepp · 2024-03-28T15:32:38Z

Concretely, what about something like . e.g. spin-llm.infer?

I like this convention, but don't know how to square it with the OTEL conventions. For example spin-redis-outbound.get according to semantic conventions should be something like GET key. Are you suggesting that we just stick to our own convention for all host components and ignore the semantic conventions? Would it make sense to combine them spin-redis-outbound.get GET key, or do a tightly nested span where the parent is spin-redis-outbound.get and the child is GET key?

We could also (mostly) set the otel.name, giving us more flexibility in naming without clashing with tracing span name conventions.

I'm not sure what you're driving at here. I know otel.name lets us set the name dynamically but we still have to choose one convention.

calebschoepp · 2024-03-28T15:35:21Z

Also @lann I'm curious what your thoughts are on #[instrument(err)]. More specifically it seems there is a pattern in host components of returning Result<Result<T>>. err only is able to report on the outer result. Is it worth doing the work of manually emitting an event! on the inner result so that the span is marked as an error if the inner result contains an error?

lann · 2024-03-28T15:54:19Z

don't know how to square it with the OTEL conventions

I would suggest preferring OTEL conventions where they are clear and come up with a nice convention that we can be internally consistent about where there is no OTEL convention.

lann · 2024-03-28T15:59:33Z

We could also (mostly) set the otel.name, giving us more flexibility in naming without clashing with tracing span name conventions.

I'm not sure what you're driving at here. I know otel.name lets us set the name dynamically but we still have to choose one convention.

The tracing span name is usually just a function name and is always contextualized by the "target" (module name by default), so there isn't really a concern about disambiguating between modules. Using something like spin-llm.infer for the tracing span name would look weird to most tracing subscribers, so I'm suggesting (mostly) leaving those alone and just overriding the otel.name.

lann · 2024-03-28T17:06:41Z

Is it worth doing the work of manually emitting an event! on the inner result so that the span is marked as an error if the inner result contains an error?

It depends... How hard is it going to be to debug that error if you don't emit it?

calebschoepp · 2024-03-28T18:25:57Z

I would suggest preferring OTEL conventions where they are clear and come up with a nice convention that we can be internally consistent about where there is no OTEL convention.

I hear this, but I still feel a bit hungup on the inconsistency. It just feels off to me to have spin-local-llm.infer in one place and GET key in another place when they represent the same level of abstraction. A tightly nested span of GET key seems like the fix to me, but in your opinion @lann am I just getting to hung up here on consistency?

The tracing span name is usually just a function name and is always contextualized by the "target" (module name by default), so there isn't really a concern about disambiguating between modules. Using something like spin-llm.infer for the tracing span name would look weird to most tracing subscribers, so I'm suggesting (mostly) leaving those alone and just overriding the otel.name.

Ack. Part of me wants to just do it statically in the instrument though if we don't need to do it dynamically so that we don't have to clutter the code with otel.name = "different name". That would be confusing to people dropping into the codebase. It's not like we have or expect other tracing subscribers.

lann · 2024-03-28T18:38:02Z

in your opinion @lann am I just getting to hung up here on consistency?

Yep! 🙂

lann · 2024-03-28T18:45:07Z

Ack. Part of me wants to just do it statically in the instrument though if we don't need to do it dynamically so that we don't have to clutter the code with otel.name = "different name". That would be confusing to people dropping into the codebase. It's not like we have or expect other tracing subscribers.

I don't feel strongly about this part. 🤷

calebschoepp · 2024-03-28T20:04:31Z

Okay so the new strawman argument is:

Where there are existing OTEL semantic conventions we follow those.
Where there are not existing OTEL semantic conventions we will follow our own convention which is approximately <host component>.<operation>.
We will avoid tightly nested spans.
We will not trace things that are infallible and/or fast just for the sake of consistency.
We will prefer just merging something over having it perfectly consistent.

@itowlson any further feedback. I'm going to start implementing based off this next week.

itowlson · 2024-04-01T22:17:39Z

@calebschoepp The plan sounds great - thank you for restating it clearly, it really helps. But I would like to non-blockingly explore some bits from up thread.

At a certain point I expect that we'll have to just accept the #[instrument]s littered everywhere.

We'll certainly have to have something littered everywhere. But my long-terms goals for these would be:

Make the things small - or, at least, make them non-obtrusive. When we start to see the common patterns in our instrumentation, it would be great to pull out at least the more verbose ones. This is most important for intra-function instrumentation: instrumentation on a function is less obtrusive because the reader is already at a clear and prominent "understanding boundary" if that makes sense.
Limit the knowledge a contributor needs to have. Contributors (and reviewers) shouldn't need to know the OpenTelemetry Semantic Conventions, or ideally to have too much mastery of our own conventions (or when to use one or the other). As we see common patterns, it would be great to figure out how we can pull them up into "pit of success" helpers that contributors can use.

These are definitely things we can't do without more experience! So we should push ahead with the plan as is, but come back in a few weeks or months and see what patterns and practices have emerged, and then what can be automated and what can go in the contributor guide. But definitely great to "prefer just merging" and learn as we go!

Signed-off-by: Caleb Schoepp <[email protected]>

calebschoepp · 2024-04-03T18:52:51Z

@itowlson @lann I've modified the code to align with the big discussion here. One last sanity check might be nice. I'll give you both about a day and a half to look and if I don't see any comments I'll just assume it is good to go.

itowlson

Comments are nits (or even less than nits, vague frets). This looks good - thank you so much for the thoughtful discussion and careful evolution on this.

crates/key-value-redis/src/lib.rs

crates/outbound-redis/src/lib.rs

Signed-off-by: Caleb Schoepp <[email protected]>

calebschoepp requested review from lann, itowlson and rylev March 26, 2024 23:18

lann reviewed Mar 28, 2024

View reviewed changes

lann mentioned this pull request Mar 29, 2024

Improving observability of Spin #2293

Closed

rylev approved these changes Apr 2, 2024

View reviewed changes

calebschoepp added 5 commits April 3, 2024 12:49

feat(kv): Trace KV host components

5b6deb9

Signed-off-by: Caleb Schoepp <[email protected]>

feat(llm): Update tracing of llm host components

f25a5e9

Signed-off-by: Caleb Schoepp <[email protected]>

feat(sqlite): Trace sqlite host components

b5c497d

Signed-off-by: Caleb Schoepp <[email protected]>

feat(variables): Trace variables host component

68dc711

Signed-off-by: Caleb Schoepp <[email protected]>

feat(outbound-redis): Trace outbound-redis host component

b403607

Signed-off-by: Caleb Schoepp <[email protected]>

calebschoepp force-pushed the trace-all-the-things branch from ef1d276 to b403607 Compare April 3, 2024 18:49

itowlson approved these changes Apr 3, 2024

View reviewed changes

crates/key-value-redis/src/lib.rs Show resolved Hide resolved

crates/outbound-redis/src/lib.rs Outdated Show resolved Hide resolved

ref(*): Nit fix TODO comment and update example cargo.lock

8af29e4

Signed-off-by: Caleb Schoepp <[email protected]>

calebschoepp enabled auto-merge April 3, 2024 20:16

calebschoepp merged commit d377a61 into fermyon:main Apr 3, 2024
13 of 14 checks passed

calebschoepp deleted the trace-all-the-things branch May 29, 2024 17:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(*): Add tracing to some more host components #2398

feat(*): Add tracing to some more host components #2398

calebschoepp commented Mar 26, 2024

itowlson commented Mar 27, 2024

itowlson commented Mar 27, 2024

calebschoepp commented Mar 27, 2024

lann left a comment

lann commented Mar 28, 2024

lann commented Mar 28, 2024

calebschoepp commented Mar 28, 2024

calebschoepp commented Mar 28, 2024

lann commented Mar 28, 2024

lann commented Mar 28, 2024

lann commented Mar 28, 2024

calebschoepp commented Mar 28, 2024

lann commented Mar 28, 2024

lann commented Mar 28, 2024

calebschoepp commented Mar 28, 2024

itowlson commented Apr 1, 2024

calebschoepp commented Apr 3, 2024

itowlson left a comment

feat(*): Add tracing to some more host components #2398

feat(*): Add tracing to some more host components #2398

Conversation

calebschoepp commented Mar 26, 2024

itowlson commented Mar 27, 2024

itowlson commented Mar 27, 2024

calebschoepp commented Mar 27, 2024

lann left a comment

Choose a reason for hiding this comment

lann commented Mar 28, 2024

lann commented Mar 28, 2024

calebschoepp commented Mar 28, 2024

calebschoepp commented Mar 28, 2024

lann commented Mar 28, 2024

lann commented Mar 28, 2024

lann commented Mar 28, 2024

calebschoepp commented Mar 28, 2024

lann commented Mar 28, 2024

lann commented Mar 28, 2024

calebschoepp commented Mar 28, 2024

itowlson commented Apr 1, 2024

calebschoepp commented Apr 3, 2024

itowlson left a comment

Choose a reason for hiding this comment