We're changing database #408

samuelcolvin · 2024-08-29T19:24:54Z

Rollout

We're gradually rolling out queries to the new database now. If you're affected, you'll see a banner like this:

If you notice queries taking longer or returning errors or different results, please let us know below or contact us via email or Slack.

If you need to continue querying the old database, you can do so by right-clicking on your profile picture in the top right and setting the query engine to 'TS' (Timescale, the old database):

To get rid of the warning banner, set the query engine to 'TS' and then back to 'FF' (FusionFire, the new database) again.

We will be increasing the percentage of users whose default query engine is FF over time and monitoring the impact. We may decrease it again if we notice problems. If you set a query engine explicitly to either TS or FF, this won't affect you. Otherwise, your query engine may switch back and forth. For most users, there shouldn't be a noticeable difference.

Most queries should be faster with FF, especially if they aggregate lots of data over a long time period. If your dashboards were timing out before with TS, try using FF. However some specific queries that are very fast with TS are slower with FF. In particular, TS can look up trace and span IDs almost instantly without needing a specific time range. If you click on a link to a trace/span ID in a table, it will open the live view with a time range of 30 days because it doesn't know any better. If this doesn't load, reduce the time range.

Summary

We're changing the database that stores observability data in the Logfire platform from Timescale to a custom database built on Apache Datafusion.

This should bring big improvements in performance, but will lead to some SQL compatibility issues initially (details below).

Background

Timescale is great, it can be really performant when you know the kind of queries you regularly run (so you can set up continuous aggregates) and when you can enable their compression features (which both save money and make queries faster).

Unfortunately we can't use either of those features:

our users can query their data however they like using SQL, so continuous aggregates aren't that helpful
Timescale's compression features are incompatible with row level permissions — in Timescale/PostgreSQL we have to have row level permissions since we're running users SQL directly against the database

Earlier this year, as the volume of data the Logfire platform received increased in the beta, these limitations became clearer and clearer.

The other more fundamental limitation of Timescale was their open/closed source business model.

The ideal data architecture for us (and any analytics database I guess) is separated storage and compute: data is stored in S3/GCS as parquet (or equivalent), with an external index used by the query/compute nodes. Timescale has this, but it's completely closed source. So we can either get a scaleable architecture but be forced to use their SAAS, or run Timescale as a traditional "coupled storage and compute" database ourselves.

For lots of companies either of those solutions would be satisfactory, but if Logfire scales as we hope it does, we'd be scuppered with either.

Datafusion

We settled on Datafusion as the foundation for our new database for a few reasons:

It's completely open source so we can build the separated storage and compute solution we want
It's all Rust, quite a few of our team are comfortable writing Rust, meaning the database isn't just a black box, we can dive in and improve it as we wish (as an example, Datafusion didn't have JSON querying support until we implemented it in datafusion-functions-json). Since starting to use datafusion, our team has contributed 20 or 30 pull requests to datafusion, and associated projects like arrow-rs and sqlparser-rs
Datafusion is extremely extensible, we can adjust the SQL syntax, how queries are planned and run and build indexes exactly as we need them
Datafusion's SQL parser has pretty good compatibility with Postgres, and again, it's just Rust so we can improve it fairly easily
The project is excellently run, part of Apache, leverages the Arrow/Parquet ecosystem, and is used by large organizations like InfluxDB, Apple and Nvidia

Transition

For the last couple of months we've been double-writing to Timescale and Fusionfire (our cringey internal name for the new datafusion-based database), working on improving reliability and performance of Fusionfire for all types of queries.

Fusionfire is now significantly (sometimes >10x) faster than timescale for most queries. There's a few low latency queries on very recent data which are still faster on timescale that we're working on improving.

Currently by default the live view, explore view, dashboards and alerts use timescale by default. You can try fusionfire now for everything except alerts by right clicking on your profile picture in the top right and selecting "FF" as the query engine.

In the next couple of weeks we'll migrate fully to Fusionfire and retire timescale.

We're working hard to make Fusionfire more compatible with PostgreSQL (see apache/datafusion-sqlparser-rs#1398, apache/datafusion-sqlparser-rs#1394, apache/datafusion-sqlparser-rs#1360, apache/arrow-rs#6211, apache/datafusion#11896, apache/datafusion#11876, apache/datafusion#11849, apache/datafusion#11321, apache/arrow-rs#6319, apache/arrow-rs#6208, apache/arrow-rs#6197, apache/arrow-rs#6082, apache/datafusion#11307), but there are still a few expressions which currently don't run correctly (a lot related to intervals):

generate_series('2024-08-28 00:00:00'::timestamptz, '2024-08-28 00:00:60'::timestamptz, INTERVAL '10 seconds')
3 * interval '10 seconds'
end_timestamp - interval '1 second' > start_timestamp — will be fixed by Fix INTERVAL parsing to support expressions and units via dialect apache/datafusion-sqlparser-rs#1398
extract(seconds from end_timestamp - start_timestamp) — (second without the trailing s works thanks to allow DateTimeField::Custom with EXTRACT in Postgres apache/datafusion-sqlparser-rs#1394)
JSON functions like jsonb_array_elements aren't available yet

If you notice any other issues, please let us know on this issue or a new issue, and we'll let you know how quickly we can fix it.

The text was updated successfully, but these errors were encountered:

samuelcolvin · 2024-08-30T14:25:02Z

Small update as I forgot to include this in the main issue:

We previously supported direct connection to the database using the PostgreSQL wire protocol, meaning you could connect with psql, pgcli or pandas, but also with BI tools that "talked postgres" like tableau, google looker studio, metabase etc.

(Side note: it wasn't actually a direct connection, but rather a pg wire protocol proxy we wrote which checked the query AST for functions we didn't want to call (like pg_sleep), managed authentication, then proxied the queries to timescale)

We've had to temporarily switch this off while we migrate to fusionfire.

Instead we're allowing uses to query their data with SQL using an HTTP API (data can be returned as arrow IPC, JSON or CSV), see #405 — this should be available to use in the next few days.

We aim to reimplement the PG wire protocol connections with fusionfire in a few months, the hardest bit will be getting the information schemas to exactly match postgres so the very complex schema introspection queries run by BI tools and pgcli work correctly. If you need this feature urgently, please let us know.

baggiponte · 2024-09-03T08:00:12Z

Well, congratulations first of all! (Though I'd call it logfusion 🪵⚛️)

but there are still a few expressions which currently don't run correctly (a lot related to intervals)

@MarcoGorelli should've worked on a lot of these features for Polars: I don't know if he can contribute, but he's a bit of Time(zone) lord.

samuelcolvin · 2024-09-18T17:18:08Z

Thanks for reporting @frankie567, I've moved that to #433.

samuelcolvin pinned this issue Aug 29, 2024

This comment was marked as off-topic.

Sign in to view

samuelcolvin mentioned this issue Sep 18, 2024

WITHIN GROUP query syntax #433

Open

alexmojaki mentioned this issue Sep 23, 2024

Web Server Monitoring Dashboard: Requests Average Duration Query error #438

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

We're changing database #408

We're changing database #408

samuelcolvin commented Aug 29, 2024 •

edited by alexmojaki

Loading

samuelcolvin commented Aug 30, 2024 •

edited

Loading

baggiponte commented Sep 3, 2024

This comment was marked as off-topic.

samuelcolvin commented Sep 18, 2024

We're changing database #408

We're changing database #408

Comments

samuelcolvin commented Aug 29, 2024 • edited by alexmojaki Loading

Rollout

Summary

Background

Datafusion

Transition

samuelcolvin commented Aug 30, 2024 • edited Loading

baggiponte commented Sep 3, 2024

This comment was marked as off-topic.

samuelcolvin commented Sep 18, 2024

samuelcolvin commented Aug 29, 2024 •

edited by alexmojaki

Loading

samuelcolvin commented Aug 30, 2024 •

edited

Loading