Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(*): Add tracing to some more host components #2398

Merged
merged 6 commits into from
Apr 3, 2024

Conversation

calebschoepp
Copy link
Collaborator

#2348 introduced basic observability support to Spin via the OpenTelemetry standard. It had only basic support for the tracing signal and so the logical next step was to extend the tracing to cover more Spin surface area (host components, triggers, etc.). This PR is my attempt to "trace all the things". The complexity of choosing good span names and deciding what to trace has waylaid my efforts though. This PR is no longer a comprehensive attempt to trace all of Spin. Rather it has become a first crack in the glass — an exploratory attempt to try and build consensus around how we want to trace Spin.

In my digging I've identified the following crates that we want to trace. A * after a crate means it was initially traced in #2348. Bolded crates I have added tracing to in this PR.

  • core * (lots of questions here about whether we should trace all the WASI stuff or not)
  • key-value-{azure, redis, sqlite}
  • llm-{local, remote} *
  • outbound-{http, mqtt, mysql, networking, pg}
  • outbound-redis
  • sqlite-{inproc, libsql}
  • trigger-http *
  • trigger-redis
  • variables

As I was going through the exercise of tracing a subset of these crates I ran into the issue of how to name the spans. As @rylev suggested I wanted to have a consistent scheme that I used for all the span names across Spin. I ended up opting for a name scheme along the lines of verb_{optional adjective}_noun e.g. execute_query_in_process_sqlite_db.

Problem 1: Very quickly this naming scheme started to fray around the edges. For example with set_add_values_redis should set actually be part of the noun i.e. add_values_redis_set. Sure, but than that conflicts with the clean redis noun. Not only is this naming scheme not very consistent, but it also isn't particularly user-friendly to read in a trace.

Problem 2: Lots of our host components have a base implementation that calls out to an underlying provider e.g. llm and llm-local. It wasn't clear to me whether I should trace both of these function calls or not. On the one hand it is nice information to be able to see the stack trace and know which implementation you're using. On the other hand it isn't very beginner friendly and makes the trace look cluttered for little extra information.

Enter more complexity stage left: OTEL semantic conventions. It turns out that OTEL has pretty detailed conventions for how to trace certain things like database calls or messaging systems. For example let's look at the outbound-redis crate. OTEL has conventions for exactly this use case of interacting with Redis. It turns out the span names they recommend are low cardinality versions of the Redis query e.g. GET key. This is very different then the pattern of get_value_redis that I created above.

Problem 3: I'm not sure what to do with these OTEL conventions. It seems pretty obvious that we should use them in a lot of places. But, I also see a lot of problems with following the conventions:

  • Some of our host components don't map cleanly to a convention e.g. llm. What do we do there?
  • By following the OTEL conventions we lose the kind of consistency in span names from Spin that we might want e.g. one span is going to be SELECT * FROM users while another is going to beshop.orders process. Another example: if we were to follow the convention to the letter the redis-trigger would become something like mychannel deliver instead of something like handle_redis_message. Likewise handle_http_request becomes GET /my/route/*. I can imagine it becoming confusing to a user of Spin who is just trying to understand when their Spin app is triggered.
  • If we follow the convention what level of a host component do we trace at. For example with KV there is kv, cached_kv, and sqlite_kv. Do we still emit a span at each level of that stack? Do we just do it at the root underlying implementation? I don't know.

Well that was a whole lot of confusion and uncertainty that I just spewed out. My hope is that the discussion in this PR can help me wade through this uncertainty. If you're looking for concrete points to address I would recommend responding to problems 1 through 3.

As a final note here is a strawman that I would advocate and that we can all argue and diff against: We don't try to have any internal sense of consistent Spin spans. Instead we completely follow the upstream semantic conventions where appropriate. Where conventions don't exist we just wing it.

@itowlson
Copy link
Contributor

Your strawman seems like a good starting point. We have to assume that Spin lives in a heterogeneous environment, and that Spin telemetry is going into the same handwave database as other telemetry sources: therefore, consistency with the other things in that environment would seem to trump personal aesthetics.

That said, this is based entirely on my personal aesthetic of "what would make sense" rather than any experience of using telemetry tooling. So, ya know.

@itowlson
Copy link
Contributor

Something that might help with the first half of problem 2 is specific guidelines on the granularity of telemetry. You mention it being nice to know the stack trace: is this implying you envisage putting telemetry on every function at every layer? I'm guessing not, but then what counts as a significant "boundary" for telemetry? I agree knowing which implementation you're using is helpful, so is a record emitted only when entering a provider? I'm not trying to pre-empt your thought process here, just illustrating the kind of questions I'd want to know how to answer when writing code within this framework.

That also makes me wonder how much we can encapsulate. We don't want the code base to become a big pile of #[instrument] macros with islands of code dotted amongst them, and we don't want everyone who touches the Spin codebase to have to internalise the OpenTelemetry Semantic Conventions. So once we figure out what we want to do and have some motivating examples to work from, it would be great to think about ways we can hide that from the Spin "business logic" developer. We can't be the first people to run into this - maybe there's prior art we can draw on. But anyway that's for the future...

@calebschoepp
Copy link
Collaborator Author

Is this implying you envisage putting telemetry on every function at every layer?

As you suspected, definitely not. It would be insanity to trace every function.

What counts as a significant "boundary" for telemetry?

Great question. For starters I think that the crates I identified above are a good starting point for a list of significant "boundaries" that need to be traced. Within each of these crates I would propose that we should only emit a span from a single layer in the crate i.e. only emit get_value_kv rather than get_value_kv -> get_value_cached_kv->get_value_sqlite_kv`.

Side note: one could argue that we should still emit a span at every layer and we can mark some of those spans at the TRACE level so they are normally filtered out unless the user sets SPIN_OTEL_TRACING_LEVEL=trace. But, I don't think we should do this b/c this gets to the issue of cluttering the Spin codebase with lots of #[instrument] macros. I also think it adds little extra value. Whatever extra information is conveyed in those spans can be added as attributes to the single span.

If we choose to only emit a span from one layer within a crate then the immediate question is do we do it at the interface level (e.g. sqlite crate) or do we do it at the provider level (e.g. sqlite-inproc crate). I would suggest that we should do it at the provider level b/c we will have access to more relevant information there: the actual implementation being used; what the underlying query is; etc. One disadvantage to doing it at the provider layer is that technically the duration of the span is going to be a little bit shorter than it should be. We won't be accounting for the overhead of the interface layer. This is a tradeoff I'm personally okay with. Another disadvantage is that you have to write the same code over and over again for key-value-azure, key-value-redis, etc., instead of just for key-value.

@itowlson thoughts?

That also makes me wonder how much we can encapsulate. We don't want the code base to become a big pile of #[instrument] macros with islands of code dotted amongst them, and we don't want everyone who touches the Spin codebase to have to internalise the OpenTelemetry Semantic Conventions.

I'm not sure what to do about this. At a certain point I expect that we'll have to just accept the #[instrument]s littered everywhere. We could potentially write a macro to extract the common behaviour and reduce how big they appear 🤷 . But, not sure how much that really helps.

More concerning than #[instrument]s littered everywhere is the modifications we'll have to make to the code to set attributes and properly report errors to the span. Here is an example of where I've had to modify the code to properly report the inner error to the span. The err portion of the macro doesn't know how to properly mark the span as failed if there is an inner error. (Side note: As I write this I realize maybe I just shouldn't worry about reporting these inner errors, that is the problem of more upstream code).

If we follow OTEL conventions we should expect that there will be a lot more code for dynamically setting attributes littered throughout our host components.

Copy link
Collaborator

@lann lann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choosing what to trace is always going to be a bit subjective but I'd start relatively small and grow based on feedback. Covering most network operations plus a few "known possibly slow/failure-prone" operations (app loading, sqlite, LLM) should get us close enough.

Left a few comments on specific operations, but in general I think we should mostly avoid instrumenting trivial (always fast, infallible) operations even if it means inconsistency between different e.g. KV implementations. Additionally, avoid most "tightly-nested" spans where a parent span is going to look essentially identical to its single child span.

crates/key-value-azure/src/lib.rs Outdated Show resolved Hide resolved
crates/key-value-azure/src/lib.rs Outdated Show resolved Hide resolved
crates/key-value-azure/Cargo.toml Outdated Show resolved Hide resolved
crates/key-value-redis/src/lib.rs Outdated Show resolved Hide resolved
crates/key-value/src/lib.rs Outdated Show resolved Hide resolved
crates/key-value/src/lib.rs Outdated Show resolved Hide resolved
@lann
Copy link
Collaborator

lann commented Mar 28, 2024

On naming: I think it could be nice to choose a different convention for host components specifically where span names better reflect the guest's perspective. We could also (mostly) set the otel.name, giving us more flexibility in naming without clashing with tracing span name conventions.

Concretely, what about something like <host component>.<operation> e.g. spin-llm.infer?

@lann
Copy link
Collaborator

lann commented Mar 28, 2024

OTEL conventions: I do think its worth some effort to stick with these conventions as it will help with familiarity and integration with other tooling. That said, I think everyone would prefer non-conventional soon if that is the tradeoff.

@calebschoepp
Copy link
Collaborator Author

Concretely, what about something like . e.g. spin-llm.infer?

I like this convention, but don't know how to square it with the OTEL conventions. For example spin-redis-outbound.get according to semantic conventions should be something like GET key. Are you suggesting that we just stick to our own convention for all host components and ignore the semantic conventions? Would it make sense to combine them spin-redis-outbound.get GET key, or do a tightly nested span where the parent is spin-redis-outbound.get and the child is GET key?

We could also (mostly) set the otel.name, giving us more flexibility in naming without clashing with tracing span name conventions.

I'm not sure what you're driving at here. I know otel.name lets us set the name dynamically but we still have to choose one convention.

@calebschoepp
Copy link
Collaborator Author

Also @lann I'm curious what your thoughts are on #[instrument(err)]. More specifically it seems there is a pattern in host components of returning Result<Result<T>>. err only is able to report on the outer result. Is it worth doing the work of manually emitting an event! on the inner result so that the span is marked as an error if the inner result contains an error?

@lann
Copy link
Collaborator

lann commented Mar 28, 2024

don't know how to square it with the OTEL conventions

I would suggest preferring OTEL conventions where they are clear and come up with a nice convention that we can be internally consistent about where there is no OTEL convention.

@lann
Copy link
Collaborator

lann commented Mar 28, 2024

We could also (mostly) set the otel.name, giving us more flexibility in naming without clashing with tracing span name conventions.

I'm not sure what you're driving at here. I know otel.name lets us set the name dynamically but we still have to choose one convention.

The tracing span name is usually just a function name and is always contextualized by the "target" (module name by default), so there isn't really a concern about disambiguating between modules. Using something like spin-llm.infer for the tracing span name would look weird to most tracing subscribers, so I'm suggesting (mostly) leaving those alone and just overriding the otel.name.

@lann
Copy link
Collaborator

lann commented Mar 28, 2024

Is it worth doing the work of manually emitting an event! on the inner result so that the span is marked as an error if the inner result contains an error?

It depends... How hard is it going to be to debug that error if you don't emit it?

@calebschoepp
Copy link
Collaborator Author

I would suggest preferring OTEL conventions where they are clear and come up with a nice convention that we can be internally consistent about where there is no OTEL convention.

I hear this, but I still feel a bit hungup on the inconsistency. It just feels off to me to have spin-local-llm.infer in one place and GET key in another place when they represent the same level of abstraction. A tightly nested span of GET key seems like the fix to me, but in your opinion @lann am I just getting to hung up here on consistency?

The tracing span name is usually just a function name and is always contextualized by the "target" (module name by default), so there isn't really a concern about disambiguating between modules. Using something like spin-llm.infer for the tracing span name would look weird to most tracing subscribers, so I'm suggesting (mostly) leaving those alone and just overriding the otel.name.

Ack. Part of me wants to just do it statically in the instrument though if we don't need to do it dynamically so that we don't have to clutter the code with otel.name = "different name". That would be confusing to people dropping into the codebase. It's not like we have or expect other tracing subscribers.

@lann
Copy link
Collaborator

lann commented Mar 28, 2024

in your opinion @lann am I just getting to hung up here on consistency?

Yep! 🙂

@lann
Copy link
Collaborator

lann commented Mar 28, 2024

Ack. Part of me wants to just do it statically in the instrument though if we don't need to do it dynamically so that we don't have to clutter the code with otel.name = "different name". That would be confusing to people dropping into the codebase. It's not like we have or expect other tracing subscribers.

I don't feel strongly about this part. 🤷

@calebschoepp
Copy link
Collaborator Author

Okay so the new strawman argument is:

  • Where there are existing OTEL semantic conventions we follow those.
  • Where there are not existing OTEL semantic conventions we will follow our own convention which is approximately <host component>.<operation>.
  • We will avoid tightly nested spans.
  • We will not trace things that are infallible and/or fast just for the sake of consistency.
  • We will prefer just merging something over having it perfectly consistent.

@itowlson any further feedback. I'm going to start implementing based off this next week.

@itowlson
Copy link
Contributor

itowlson commented Apr 1, 2024

@calebschoepp The plan sounds great - thank you for restating it clearly, it really helps. But I would like to non-blockingly explore some bits from up thread.

At a certain point I expect that we'll have to just accept the #[instrument]s littered everywhere.

We'll certainly have to have something littered everywhere. But my long-terms goals for these would be:

  • Make the things small - or, at least, make them non-obtrusive. When we start to see the common patterns in our instrumentation, it would be great to pull out at least the more verbose ones. This is most important for intra-function instrumentation: instrumentation on a function is less obtrusive because the reader is already at a clear and prominent "understanding boundary" if that makes sense.

  • Limit the knowledge a contributor needs to have. Contributors (and reviewers) shouldn't need to know the OpenTelemetry Semantic Conventions, or ideally to have too much mastery of our own conventions (or when to use one or the other). As we see common patterns, it would be great to figure out how we can pull them up into "pit of success" helpers that contributors can use.

These are definitely things we can't do without more experience! So we should push ahead with the plan as is, but come back in a few weeks or months and see what patterns and practices have emerged, and then what can be automated and what can go in the contributor guide. But definitely great to "prefer just merging" and learn as we go!

@calebschoepp
Copy link
Collaborator Author

@itowlson @lann I've modified the code to align with the big discussion here. One last sanity check might be nice. I'll give you both about a day and a half to look and if I don't see any comments I'll just assume it is good to go.

Copy link
Contributor

@itowlson itowlson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments are nits (or even less than nits, vague frets). This looks good - thank you so much for the thoughtful discussion and careful evolution on this.

crates/key-value-redis/src/lib.rs Show resolved Hide resolved
crates/outbound-redis/src/lib.rs Outdated Show resolved Hide resolved
@calebschoepp calebschoepp merged commit d377a61 into fermyon:main Apr 3, 2024
13 of 14 checks passed
@calebschoepp calebschoepp deleted the trace-all-the-things branch May 29, 2024 17:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants