Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VegaFusion SQL Database Integration #151

Open
jonmmease opened this issue Aug 1, 2022 · 0 comments
Open

VegaFusion SQL Database Integration #151

jonmmease opened this issue Aug 1, 2022 · 0 comments

Comments

@jonmmease
Copy link
Collaborator

jonmmease commented Aug 1, 2022

Overview

VegaFusion currently has the ability to extract data transformations from Vega specifications for evaluation in Rust/Python using the DataFusion in-memory query engine. The initial focus of VegaFusion was to support reading data from the same data sources supported by Vega (json and cvs in particular) so as to allow VegaFusion to work with unmodified Vega specifications. Additionally, VegaFusion supports reading data from arrow files and this is used by the vegafusion-jupyter package to make pandas DataFrames available in the Vega-Lite specifications generated by Altair.

In addition, it would be desirable for VegaFusion to be capable of working with datasets stored in external SQL databases, and to have the ability to push data transformation operations from Rust/Python into the external database engine.

Prior Art

As far as I’m aware, there have been two efforts to automatically push Vega transform logic into external SQL databases.

ibis-vega-transform

ibis-vega-transform is a Python library and JupyterLab extension developed by Quansight. It translates pipelines of Vega transforms into Ibis query expressions, which can then be evaluated with a variety of Ibis database backends (in particular, OmniSci).

The JupyterLab extension makes two-way communication between the browser and the Python kernel possible, and this is used to support interactive visualizations like histogram brushing.

In contrast to the Planner approach used by VegaFusion, ibis-vega-transform replaces pipelines of Vega transforms with a custom transform type and then registers a JavaScript handler for this custom transform type. This JavaScript handler then uses Jupyter Comms to communicate with the Python portion of the library. The Python library converts the requested Vega transforms into an ibis query, evaluates the query, and sends the resulting dataset back to the browser using a Jupyter Comm.

An advantage of this approach is that the Vega JavaScript library remains in control of the entire specification so the external ibis-vega-transform library does not need to maintain an independent task graph in order to support interactivity. A downside of this approach is that the result of every transform pipeline must be sent back to the client and be stored in the Vega dataflow graph. Often times this is not a problem, because the transform pipeline includes an aggregation stage that significantly reduces the dataset size. However, sometimes the result of a transform pipeline is quite large, but it is only used as input to other transform pipelines. In this case, it is advantageous to keep the large intermediary result cached on the server and to not send it to
the client at all. This use case is one of the reasons that VegaFusion uses the Planner+Runtime architecture described elsewhere.

VegaPlus

As described in its GitHub repo, VegaPlus is “a demo of how to run Vega by offloading computational-intensive operations to a separate database management system.”

In a similar fashion to ibis-vega-transform, VegaPlus registers a custom transform (transform-db that is used to execute SQL queries. This transform accepts a query string which can be dynamically generated using Vega signal expressions.

VegaPlus supports executing queries in the browser using duckdb-wasm, or my communicating with a nodejs middleware component that passes queries along to external databases.

A Spec Rewrite step is performed that converts standard Vega transforms into transform-db transforms.

Architectural Considerations

SQL Generation Time

An important architectural consideration is where and when the Vega transform to SQL generation logic takes place.

Generate SQL in Planner

Following VegaPlus, one option would be for the SQL string generation to happen as a rewriting step during the specification planning process.

An advantage of this approach is that the resulting specification can be inspected to understand how the transforms were converted into SQL.

A downside of this approach is that the planning process (and least in VegaFusion) does not have access to the input data schema. Without the data schema, it wouldn’t be straightforward to accurately convert the JavaScript-like Vega expression language into the SQL expression language. Take for example a formula expression datum.a + datum.b, without information on the types of a and b it’s not straightforward (not possible?) to decide whether this corresponds to SQL numeric addition or SQL string concatenation. Another common example would be a filter transform with the expression datum.c. SQL does not allow non-boolean columns to be used in a WHERE clause, so without the type of c it wouldn’t be straightforward to determine how this should be converted to a boolean following JavaScript’s type-based truthiness rules.

Generate SQL in Runtime

An alternative approach is to have the Runtime dynamically generate SQL from transform pipelines during transform evaluation.

With this approach, the data schema is available during SQL generation and so it would be much easier to correctly generate SQL that corresponds to Vega expressions.

With the approach, it would be also possible to switch from external SQL evaluation to local DataFusion-based evaluation when a transform is encountered that cannot (yet) be converted to SQL.

SQL Dialect Support

It will be important to make it possible to generate SQL in a variety of dialects, where some dialects may only support a subset of the functionality of other dialects. As far as possible, I’d like to keep this dialect logic out of the core of VegaFusion.

SQL Connection Libraries

The Rust ecosystem’s support for connecting to Database systems is not nearly as mature as Python’s. So it will be important to have a pluggable database connection paradigm that makes it possible to leverage custom database connection libraries (including from Python).

Design Proposal

Here is a sketch of the major components of a proposed design

sqlgen-rs library for SQL dialect generation

I’ve created the start of a fork of the sqlparser-rs library (which DataFusion uses for parsing) currently called sqlgen-rs. This library will provide the following features:

  1. Parse SQL queries (not other DDL statements) in the DataFusion dialect of SQL into an AST
  2. JSON-serializable SQL query AST
  3. AST to SQL string generation logic that can be customized to a variety of output dialects. It may be helpful to look at sqlglot as a reference for some of the dialect translations we want to support.

(1) and (2) were simply a matter of forking sqlparser-rs and stripping out the non-query parser/AST logic. (3) is where the work will go to support custom dialects.

sqlgen-rs is a pretty small library with a pretty small set of dependencies. I haven’t tried it yet, but it should be pretty straightforward to compile it to wasm and wrap in a TypeScript API for use in the browser eventually.

SqlConnection trait as query evaluation abstraction

An SqlConnection trait will be added to VegaFusion (maybe as a new vegafusion-sql crate) that provides the following associated functions (aka methods):

  • fn tables(&self) -> (String, Schema)
    • The names and schemas of the tables that may be used in queries. The connection’s promise is that it can evaluate requested SQL queries in a context where the named tables are available. This could be a CTE table defined as the result of an upstream query.
  • fn fetch_query(&self, query: &str, schema: &Schema) -> Result<VegaFusionTable>
    • Evaluate a query where the provided query string should be evaluated in the context of the named tables above, and is expected to return a VegaFusionTable (which wraps Arrow RecordBatches) with the provided schema.
    • Initially, the query string should start with SELECT. Eventually, we could also support WITH CTEs. If it starts with WITH, it should be fairly straightforward for a connection to strip the leading WITH and concatenate the rest of the query onto a larger CTE.
  • fn dialect(&self) -> &Dialect
    • The sqlgen-rs dialect that should be used to construct the query string above

For testing, we can define an implementation of SqlConnection that performs queries in SqLite using sqlx. The vegafusion-python-embed library can define an implementation that delegates to Python functions.

DataFusionConnection

A DataFusionConnection struct will implement the SqlConnection trait, providing a connection that utilizes DataFusion's SQL query engine.

SqlDataFrame struct as query builder

An SqlDataFrame struct is constructed from an SqlConnection and a table name and provides the following associated functions:

  • fn schema(&self) -> Schema
    • Return the schema of the query so far.
  • fn chain_query_str(&self, sql: &str) -> Result<Self>
    • Return a new SqlDataFrame that is the result of performing the requested query on the current SqlDataFrame.
    • The query should make use of a special table named provided by the parent_name associated function.
    • Regardless of the dialect of the underlying connection, this SQL string is to be provided in the native DataFusion/sqlgen-rs dialect of SQL.
    • This function will fail if the underlying connection dialect does not support some feature of the requested query.
    • To determine the schema of the resulting query, the query string is parsed and compiled by DataFusion.
  • fn chain_query(&self, query: &Query) -> Result<Self>
    • Same as query but accept the query in AST form.

SQL Vega transform implementations

Currently, each supported Vega transform has an associated eval function that accepts and returns a DataFusion DataFrame. These functions will all be updated to accept an SqlDataFrame instead. By default, an SqlDataFrame will be constructed that wraps a DataFusionConnection. In this way, DataFusion will still be the built-in query engine, but we will use DataFusion through the SQL string API rather than through the DataFrame API.

For portability, the goal will be to use as few UDFs as possible. And when UDFs are necessary, write them in such a way that sqlgen-rs can transform them into SQL in other dialects.

SqlConnection as “Inline Dataset”

The pre_transform_spec logic introduced the concept of an “inline dataset” that gets provided alongside the specification. Vega specifications can reference an inline dataset using a special url format (e.g. vegafusion+dataset://dataset_name).

Currently these inline datasets are provided through the Python API as either pandas DataFrames or pyarrow tables. This approach can be extended to support using SqlConnection objects as inline datasets.

@jonmmease jonmmease mentioned this issue Sep 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Performance
Development

No branches or pull requests

1 participant