Easy Ergonomic Telemetry in Elixir with Sibyl #45

vereis · 2023-02-02T20:06:37Z

Easy Ergonomic Telemetry in Elixir with Sibyl

One of the things I wanted to do was to start a series of posts which re-implement popular libraries in the BEAM ecosystem from first principles to help explain and teach how they work and perhaps why they work the way they do.

The only one I've managed to see through to date is my "Build you a Telemetry for such learn!" post which walks users through the BEAM's de-factor telemetry library, aptly named :telemetry.

When I was writing that post, one of the things I tried to do was explain what :telemetry did and why, but one question I kept getting after publishing was how we could actually use :telemetry in real Elixir projects. This is something I wanted to try and answer.

In the Elixir community, we have a lot of examples how :telemetry is used in the context of a library (i.e. how to consume telemetry events), but not neccessarily anything more than that.

As a learning from instrumenting several such systems while I was a consultant and during my time here at Vetspire, we built a library to make it as easy as possible to start using :telemetry right away! I'll start off by explaining high-level how we'd do things manually, then tie it all together with Sibyl.

Telemetry on the BEAM

As stated, the de-facto telemetry library on the BEAM (i.e. across Erlang, Elixir, and other BEAM-hosted languages) is :telemetry, which is made up of two high level concepts: "event-emission" and "event-handling".

Typically, a lot of code in the wild will show off "event-emission". This is when code is annotated with calls to the :telemetry library to denote something happened, for example:

defmodule MyApp.Users do
  def register_user(attrs) do
    :telemetry.execute([:my_app, :users, :register], %{email: attrs.email}, %{})

    %User{}
    |> User.changeset(attrs)
    |> Repo.insert()
    |> case do
      {:ok, %User{} = user} ->
        :telemetry.execute([:my_app, :users, :register, :success], %{email: attrs.email}, %{})
        {:ok, user}

      {:error, %Ecto.Changeset{} = error} ->
        :telemetry.execute([:my_app, :users, :register, :error], %{email: attrs.email}, %{})
        {:error, changeset}
    end
  end
end

The calls to :telemetry.execute/3 don't really do very much out-of-the-box! Only if an application running in the BEAM opts into receiving (handling) any listed events will anything fancy happen. That's done elsewhere via the following:

for event <- @my_app_events do
  :telemetry.attach("my-handler", event, &MyApp.Telemetry.handle_event/4, %{})
end

Once the above code is run anywhere on the same BEAM VM as the code emitting events, the :telemetry.execute/3 calls now automatically dynamically dispatch to MyApp.Telemetry.handle_event/4 which is intended to be provided by the user.

Its important to note that telemetry handlers are executed synchronously (and that a single event might be consumed by any number of handlers) so care must be taken to not execute very intensive business logic inside of handlers, but the intended use-case is to allow you to react to any telemetry event being emitted and do something like log to an ETS table, external analytics service, send messages in Slack, etc.

For more details in how this all is wired up internally, do take a look at the official :telemetry repo whose source code is very easy to follow (though its written in Erlang) or my Elixir re-implementation walkthrough, but that detail isn't particularly important for this post.

Something I lightly hinted at above is that telemetry events and handlers are global and are a form of dynamic dispatch! I've witnessed, written, and repented for various eldritch hacks as a result of this, but the main thing we use these facts for are to ship telemetry metadata somewhere else, be it Phoenix performance metadata, Ecto query metadata, Oban worker metadata, or your own apps internal business events!

Ultimately, that's all there is to do. I've seen and worked on many projects where we're just emitting events inline and spawning up a Supervisor and GenServers to listen to telemetry events and offload any intensive work to background processes. These background processes might batch stuff up, but ultimately they just POST to an external service and you've got telemetry set up end-to-end.

Since the main work is annotating your code and writing code to ship your metrics elsewhere, one problem I wanted to look into solving was to maximize my bang for buck: what was the least amount of code I had to write to be able to send my metrics anywhere?

Thankfully the Open Source community at large has a solution in place!

Telemetry off the BEAM

While not quite as de-facto as :telemetry on the BEAM, there is an open-source telemetry "protocol" which allows you to ship your metrics to multiple different external consumers called (again, aptly) OpenTelemetry.

Unlike :telemetry however, OpenTelemetry aims to be a more generic way to implement observability rather than just, well, telemetry for your applications and infrastructure. I won't go into very much detail here, but OpenTelemetry aims to handle three main "signals": Traces, Metrics, and Logs.

Since the BEAM already has an official Logger, and many integrations and solutions exist for shipping logs out to places like Google Cloud or DataDog, neither myself nor Sibyl will cover logging. Rather, I'm more focused and excited about Traces and Metrics!

In short, a metric is a single measurement about something in your system. Metrics are the bread-and-butter of :telemetry, anything you :telemetry.execute/3 is arguably a metric. On the flipside, traces are higher level: what happens when a user clicks a button?

The simple user registration example above can be seen as a collection of metrics, which in tandem denote a trace! The metric would be how many users have registered, which can be derived by looking at the count of [:my_app, :users, :register, :success] events, and the trace give us details like function runtime by comparing either [:my_app, :users, :register] with either [:my_app, :users, :register, :success] or [:my_app, :users, :register, :error].

OpenTelemetry formalizes the above concepts and is supported by various products such as DataDog, LogScale, New Relic, and more. You can check the supported vendors on this list though not all of these places are consumers of OpenTelemetry but rather are producers.

Assuming your organization already uses one of the supported vendors, nothing really needs to be done to hook up to Sibyl, however, if you're not already using one of those vendors, you can still use OpenTelemetry locally for debugging & development with Jaeger which is something I highly recommend.

To start doing OpenTelemetry stuff in Elixir, you can use the OpenTelemetry libraries and instrument your code a second time, or if Sibyl is a fit for you, then we handle this for you!

Introducing Sibyl

Sibyl is a relatively small library which aims to do two things: abstract away implementation details like :telemetry event handling, emission, OpenTelemetry instrumentation, and provide a better DX while doing it.

Auto-instrumentation

When you bring Sibyl into your project, all you have to do to start using it is to use Sibyl in any of your Elixir modules. Upon doing so, your modules get extended with a few telemetry-related superpowers thusly:

defmodule MyApp.Users do
  use Sibyl

  @decorate trace
  def register_user(attrs) do
    %User{}
    |> User.changeset(attrs)
    |> Repo.insert()
  end
end

defmodule MyApp.Orders do
  use Sibyl

  @decorate_all trace

  def add_to_basket(%User{} = user, %Product{} = product) do
    ...
  end

  def empty_basket(%User{} = user) do
    ...
  end
end

One minor, but real issue when using the low level :telemetry library is that as developers add events; code gets copied around, typoed, deleted, and more. Unfortunately, since :telemetry works by opted your application into handling said events, unless you're explicitly testing all events are raised as expected, it is very easy to events to silently break.

By using the @decorate trace directive, Sibyl will automatically wrap the function being traced with :start, :stop, and :exception events as well as provide timings, function parameters, and more to any event handlers which have opted into said events! Additionally, @decorate_all trace() can be used to auto-instrument every function in your module for you.

This reduces the amount of boilerplate code you need to write to emit events, as well as make sure that telemetry events are named consistently and aren't typoed.

Internally, Sibyl will compile all events that it emits into your Elixir modules themselves; meaning that the event handling example from before can be simplified to:

:telemetry.attach_many("my-handler", Sibyl.Events.reflect(), &MyApp.Telemetry.handle_event/4, %{})

While it completely makes sense for :telemetry to be opt-in when consuming events from third party libraries, for application code, if I'm defining an event, I'm expecting to consume it.

This is where Sibyl's event reflection comes in handy. Sibyl allows you to reflect on events your application defines both on a module-by-module basis, or application wide. Doing so reduces the number of moving parts needed to be kept up-to-date when adding/listening to new telemetry events being added.

Compile Time Safety

Sibyl also provides its own Sibyl.emit/1 function to emit telemetry events. The benefit of this function, in tandem with the fact that Sibyl compiles a list of events it knows are going to be emitted, is the fact that Sibyl can validate event emissions are legitimate at compile time!

By default, when using the @decorate trace and @decorate_all trace annotations, Sibyl will automatically define events based on the current module anme and function names plus :start | :stop | :exception, but you can also define custom events with define_event/1 in any module that uses Sibyl.

This means that we can expand the above example to look as follows:

defmodule MyApp.Users do
  use Sibyl

  define_event :registered
  define_event :email_validation_error
  define_event :password_length_error

  @decorate trace
  def register_user(attrs) do
    %User{}
    |> User.changeset(attrs)
    |> Repo.insert()
    |> case do
      {:ok, %User{} = user} ->
        emit :registered
        {:ok, user}

      {:error, %Ecto.Changeset{} = changeset} ->
        case hd(changeset.errors) do
          ... -> emit :email_validation_error
          ... -> emit :password_length_error
        end

        {:error, changeset}
    end
  end
end

Events are validated by default to belong to the current module, but there are cases where metrics maybe defined in a generic MyApp.Events module. If this is the case, Sibyl can still validate event emission at compile time via emit MyApp.Events, :something, provided :something was defined correctly.

Bridging Telemetry & Open Telemetry

Sibyl not only abstracts away the instrumentation of emitting telemetry events, but coupled with reflection, it also allows us to automatically attach events.

Since writing code to ship events elsewhere ended up being one of the main things I wanted to do, and I didn't want to instrument my code with multiple different libraries, Sibyl also has the ability to tie :telemetry events together with OpenTelemetry traces.

As I touched on above, any function which has been decorated will emit :start | :stop | :exception events. Sibyl ships with the concept of "handlers", the most important one of which is Sibyl.Handlers.OpenTelemetry.

Sibyl handlers are just pre-packaged telemetry handlers. Attaching the Sibyl.Handlers.OpenTelemetry handler will automatically build OpenTelemetry traces as your application emits standard :telemetry events, removing the need for manual instrumentation entirely. Sibyl is also aware of how to stitch traces together across different Tasks, and more generic multi-process trace stitching can be built with helper functions in that module.

This means that by simply invoking Sibyl.Handlers.attach_all_events/1 in your application's top-level supervisor, you can automatically inherit all of the power that OpenTelemetry tracing gives you.

Due to this simplicity, I personally always use Jaeger when developing locally, and in production, I'll have all the same tracing/metric stuff being exported to New Relic.

It is also easy to build custom handlers if needed, Sibyl provides an example Sibyl.Handlers.Logger handler which attaches to Elixir's standard logger, and Sibyl.Handlers.FlameGraph which can export traces to a file format compatible with Google Chrome's built in flamegraph tooling, Speedscope, and others.

Definitely do see Sibyl's Documentation for more details, but with that, we've covered the core usage of Sibyl!

Future Work

While Sibyl's core functionality is more or less complete, we're still trying to build out some super cool features at Vetspire to help us debug and make Sibyl the best tool it can be!

Additional Handlers

As demonstrated with Sibyl.Handlers.FlameGraph, Sibyl is powerful enough and flexible enough to be able to output traces to various formats such as function call graphs, flame graphs, OpenTelemetry and more!

As we find usecases to building out new handlers, they will be added to the repo on an experimental basis, but hopefully will become solid enough to be part of Sibyl core in the future.

One of the main things I want to do with this is implement a function call graph exporter, to produce output similar to mix xref.

Plugins

Another piece of low-hanging fruit is the ability to support plugins for all handlers which correctly implement the Sibyl.Handler behaviour. We're currently working on some prototype plugins, but the idea would be to allow Sibyl to automatically configure itself to hook into telemetry events emitted by third party libraries and potentially extend the amount of metadata they emit.

Currently, the API being worked on is as follows:

Sibyl.Handlers.attach_all_events("my-handler", plugins: [Sibyl.Plugins.Absinthe, Sibyl.Plugins.Phoenix, Sibyl.Plugins.Ecto])

This is very near ready for release, so stay tuned!

Dynamic

Lastly, possibly the most powerful feature of Sibyl already exists (though its not quite production-ready at the time of writing!). Sibyl is designed to take :telemetry events as input, but is flexible enough to be able to take any sort of input.

As a proof of concept, Sibyl.Dynamic leverages the BEAM's built in tracing functionality to allow you to build OpenTelemtry traces, FlameGraph traces, or traces for any other handler automatically during runtime rather than compile time on live production systems!

Currently you can try this out via running the following in a live remote shell:

iex(my_app@10.84.9.3)3> Sibyl.Dynamic.enable()
{:ok, #PID<0.16925.0>}
iex(my_app@10.84.9.3)4> Sibyl.Dynamic.trace(Enum, :map, 2)
{:ok, #PID<0.16925.0>}
iex(my_app@10.84.9.3)5> Enum.map([1, 2, 3], & &1 * 2) # emits traces to any attached handlers
[2, 4, 6]

Conclusion

Hopefully, I've been able to outline why I think Sibyl is super cool and ergonomic. Please feel free to take a look at Sibyl's official documentation, package link, and Git Repo.

Please do feel free to take Sibyl for a spin, thanks for reading!

The text was updated successfully, but these errors were encountered:

vereis changed the title ~~Draft: Easy Ergonomic Telemetry in Elixir with Sibyl~~ WIP: Easy Ergonomic Telemetry in Elixir with Sibyl Feb 2, 2023

vereis added Elixir Project labels Feb 2, 2023

vereis changed the title ~~WIP: Easy Ergonomic Telemetry in Elixir with Sibyl~~ Easy Ergonomic Telemetry in Elixir with Sibyl Feb 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Easy Ergonomic Telemetry in Elixir with Sibyl #45

Easy Ergonomic Telemetry in Elixir with Sibyl #45

vereis commented Feb 2, 2023 •

edited

Loading

Easy Ergonomic Telemetry in Elixir with Sibyl #45

Easy Ergonomic Telemetry in Elixir with Sibyl #45

Comments

vereis commented Feb 2, 2023 • edited Loading

Easy Ergonomic Telemetry in Elixir with Sibyl

Telemetry on the BEAM

Telemetry off the BEAM

Introducing Sibyl

Auto-instrumentation

Compile Time Safety

Bridging Telemetry & Open Telemetry

Future Work

Additional Handlers

Plugins

Dynamic

Conclusion

vereis commented Feb 2, 2023 •

edited

Loading