-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(*): Add tracing to some more host components #2398
Conversation
Your strawman seems like a good starting point. We have to assume that Spin lives in a heterogeneous environment, and that Spin telemetry is going into the same handwave database as other telemetry sources: therefore, consistency with the other things in that environment would seem to trump personal aesthetics. That said, this is based entirely on my personal aesthetic of "what would make sense" rather than any experience of using telemetry tooling. So, ya know. |
Something that might help with the first half of problem 2 is specific guidelines on the granularity of telemetry. You mention it being nice to know the stack trace: is this implying you envisage putting telemetry on every function at every layer? I'm guessing not, but then what counts as a significant "boundary" for telemetry? I agree knowing which implementation you're using is helpful, so is a record emitted only when entering a provider? I'm not trying to pre-empt your thought process here, just illustrating the kind of questions I'd want to know how to answer when writing code within this framework. That also makes me wonder how much we can encapsulate. We don't want the code base to become a big pile of |
As you suspected, definitely not. It would be insanity to trace every function.
Great question. For starters I think that the crates I identified above are a good starting point for a list of significant "boundaries" that need to be traced. Within each of these crates I would propose that we should only emit a span from a single layer in the crate i.e. only emit Side note: one could argue that we should still emit a span at every layer and we can mark some of those spans at the TRACE level so they are normally filtered out unless the user sets If we choose to only emit a span from one layer within a crate then the immediate question is do we do it at the interface level (e.g. @itowlson thoughts?
I'm not sure what to do about this. At a certain point I expect that we'll have to just accept the More concerning than If we follow OTEL conventions we should expect that there will be a lot more code for dynamically setting attributes littered throughout our host components. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Choosing what to trace is always going to be a bit subjective but I'd start relatively small and grow based on feedback. Covering most network operations plus a few "known possibly slow/failure-prone" operations (app loading, sqlite, LLM) should get us close enough.
Left a few comments on specific operations, but in general I think we should mostly avoid instrumenting trivial (always fast, infallible) operations even if it means inconsistency between different e.g. KV implementations. Additionally, avoid most "tightly-nested" spans where a parent span is going to look essentially identical to its single child span.
On naming: I think it could be nice to choose a different convention for host components specifically where span names better reflect the guest's perspective. We could also (mostly) set the Concretely, what about something like |
OTEL conventions: I do think its worth some effort to stick with these conventions as it will help with familiarity and integration with other tooling. That said, I think everyone would prefer non-conventional soon if that is the tradeoff. |
I like this convention, but don't know how to square it with the OTEL conventions. For example
I'm not sure what you're driving at here. I know |
Also @lann I'm curious what your thoughts are on |
I would suggest preferring OTEL conventions where they are clear and come up with a nice convention that we can be internally consistent about where there is no OTEL convention. |
The |
It depends... How hard is it going to be to debug that error if you don't emit it? |
I hear this, but I still feel a bit hungup on the inconsistency. It just feels off to me to have
Ack. Part of me wants to just do it statically in the instrument though if we don't need to do it dynamically so that we don't have to clutter the code with |
Yep! 🙂 |
I don't feel strongly about this part. 🤷 |
Okay so the new strawman argument is:
@itowlson any further feedback. I'm going to start implementing based off this next week. |
@calebschoepp The plan sounds great - thank you for restating it clearly, it really helps. But I would like to non-blockingly explore some bits from up thread.
We'll certainly have to have something littered everywhere. But my long-terms goals for these would be:
These are definitely things we can't do without more experience! So we should push ahead with the plan as is, but come back in a few weeks or months and see what patterns and practices have emerged, and then what can be automated and what can go in the contributor guide. But definitely great to "prefer just merging" and learn as we go! |
Signed-off-by: Caleb Schoepp <[email protected]>
Signed-off-by: Caleb Schoepp <[email protected]>
Signed-off-by: Caleb Schoepp <[email protected]>
Signed-off-by: Caleb Schoepp <[email protected]>
Signed-off-by: Caleb Schoepp <[email protected]>
ef1d276
to
b403607
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments are nits (or even less than nits, vague frets). This looks good - thank you so much for the thoughtful discussion and careful evolution on this.
Signed-off-by: Caleb Schoepp <[email protected]>
#2348 introduced basic observability support to Spin via the OpenTelemetry standard. It had only basic support for the tracing signal and so the logical next step was to extend the tracing to cover more Spin surface area (host components, triggers, etc.). This PR is my attempt to "trace all the things". The complexity of choosing good span names and deciding what to trace has waylaid my efforts though. This PR is no longer a comprehensive attempt to trace all of Spin. Rather it has become a first crack in the glass — an exploratory attempt to try and build consensus around how we want to trace Spin.
In my digging I've identified the following crates that we want to trace. A * after a crate means it was initially traced in #2348. Bolded crates I have added tracing to in this PR.
core
* (lots of questions here about whether we should trace all the WASI stuff or not)key-value-{azure, redis, sqlite}
llm-{local, remote}
*outbound-{http, mqtt, mysql, networking, pg}
outbound-redis
sqlite-{inproc, libsql}
trigger-http
*trigger-redis
variables
As I was going through the exercise of tracing a subset of these crates I ran into the issue of how to name the spans. As @rylev suggested I wanted to have a consistent scheme that I used for all the span names across Spin. I ended up opting for a name scheme along the lines of
verb_{optional adjective}_noun
e.g.execute_query_in_process_sqlite_db
.Problem 1: Very quickly this naming scheme started to fray around the edges. For example with
set_add_values_redis
shouldset
actually be part of the noun i.e.add_values_redis_set
. Sure, but than that conflicts with the cleanredis
noun. Not only is this naming scheme not very consistent, but it also isn't particularly user-friendly to read in a trace.Problem 2: Lots of our host components have a base implementation that calls out to an underlying provider e.g.
llm
andllm-local
. It wasn't clear to me whether I should trace both of these function calls or not. On the one hand it is nice information to be able to see the stack trace and know which implementation you're using. On the other hand it isn't very beginner friendly and makes the trace look cluttered for little extra information.Enter more complexity stage left: OTEL semantic conventions. It turns out that OTEL has pretty detailed conventions for how to trace certain things like database calls or messaging systems. For example let's look at the
outbound-redis
crate. OTEL has conventions for exactly this use case of interacting with Redis. It turns out the span names they recommend are low cardinality versions of the Redis query e.g.GET key
. This is very different then the pattern ofget_value_redis
that I created above.Problem 3: I'm not sure what to do with these OTEL conventions. It seems pretty obvious that we should use them in a lot of places. But, I also see a lot of problems with following the conventions:
llm
. What do we do there?SELECT * FROM users
while another is going to beshop.orders process
. Another example: if we were to follow the convention to the letter theredis-trigger
would become something likemychannel deliver
instead of something likehandle_redis_message
. Likewisehandle_http_request
becomesGET /my/route/*
. I can imagine it becoming confusing to a user of Spin who is just trying to understand when their Spin app is triggered.kv
,cached_kv
, andsqlite_kv
. Do we still emit a span at each level of that stack? Do we just do it at the root underlying implementation? I don't know.Well that was a whole lot of confusion and uncertainty that I just spewed out. My hope is that the discussion in this PR can help me wade through this uncertainty. If you're looking for concrete points to address I would recommend responding to problems 1 through 3.
As a final note here is a strawman that I would advocate and that we can all argue and diff against: We don't try to have any internal sense of consistent Spin spans. Instead we completely follow the upstream semantic conventions where appropriate. Where conventions don't exist we just wing it.