Skip to content
Paul Rogers edited this page Feb 4, 2020 · 7 revisions

Background

Drill's internal data structures are vastly complicated because Drill, in an attempt to be more flexible, gave up the concepts that manage complexity in typical SQL engines.

A typical engine defines a schema which in turn defines a row format. Since SQL is tuple oriented, this is somewhat easy to do. We simply define a table of columns, each of which is a pointer to the implementation. For example, for a row-based system:

Name Type Offset
a INT 0
b VARCHAR 4
    Row layout
     a         b                    Varchar buffer
| . . . . | . . . . |     ...     | . . . . . . . 
r              |                  ^
               +------------------+

The table provides offsets into a tuple structure. The column b (VARCHAR) is 4-byte pointer into a separate buffer area. This is just one model, there are many.

The key point is that, given a schema, we can follow indirections into the row. Here, given a row start address (r above), we can compute column locations.

In Drill, however, things are far more complex.

  • Data is stored in vectors. Given a row index, we can locate the column value within the vector, which is good. But, there is good way to correlate row offsets across vectors. (Drill has multiple ways to do this.)
  • There is no good way to go from a schema to a vector. Vectors are stored in unordered "containers". The most general case requires a name look for each column access, which is slow.
  • Drill allows nested types: MAP, UNION (or MAP), DICT. As a result, indexing into a column is complex: we can't simply know that column "a" corresponds to some vector; each vector has structure (and so nested columns require name lookups.)
  • Drill allows arrays (including of MAPs). Thus, the row index for a set of vectors is not consistent: Scalar a uses one index, but Scalar b (within a map within an array) uses a different index.

The net result is that, over time, Drill has gotten ever more complex. One cannot simply create parallel structures which all reference the same column. For example, for a PROJECT operation:

Name Type Offset Source Transform
a INT 0 $in $copy
b VARCHAR 4 $fn $concat

Instead, since one cannot easily address either vectors or values, Drill uses complex structures. Each structure must be a tree (to handle arbitrary data structures.) Each column in the table corresponds to a separate tree structure, and each tree structure must link to the next one in line. Something like:

    Projection           Accessor          Vector             DrillBuf
       () -------------->  () ---------> (Container)
     /    \              /    \            /    \
   (a)    (b) --------> (a)   (b) ------ (vv)   (vv) -----> (buf)  (buf)

And so on to as many layers of MAP, ARRAY, DICT and other structures are available. Linkages are pointers. This causes odd behavior such as operators bind to vectors, but the contents of vectors change. Accessors contain pointers to vectors, and so work OK for single-input operators, but require complex pointer-indirection for multiple-input operators (such as merge) or for hyper-batches.

Each column has different semantics. So, each new layer requires that we re-implement a whole new set of structures: one for each vector type. Each time we add a new vector type, we must add a new class to each layer. This ends up with (On^2) complexity, which is not a "Good Thing."

Summary of the Problem

SQL systems are designed to interpret programs written in SQL. To do that they apply a large number of transforms to each column value in each row. A tuple layout allows the interpreter to walk tables that say what to do for each row.

In Drill, we try to model arbitrary tree data structures. And, we allow the tree structures to change order and content at any time. We want to interpret operations on these trees. Doing so is very, very complex.

Since we cannot use simple tables, we end up adding a new tree structure at each level of abstraction. Since each column behaves differently, we reproduce the type hierarchy at each level. Each node points to its children in the current abstraction, and to its counterpart in a lower-level abstraction. And since the structure can change, the entire hierarchy must be dynamic, and can change at any batch boundary.

This is far too complex. SQL is not designed for arbitrary tree structures. So, even if we got it to work, SQL can't exploit the structure.

We need a simpler solution.

Goal

The goal of this analysis is to answer the following questions:

  • What are the major existing abstractions (the layers of trees)?
  • Can we find a way to convert the trees into tables via effective column indexing?
  • How can we simplify the abstraction layers?

Existing Abstractions

Containers

  • DrillBuf - Semi-typed, physical direct-memory representation of a column.
  • ValueVector - Typed collection of column values.
  • VectorAccessible - Iterator over a set of vectors.
  • VectorContainer - Vectors plus BatchSchema plus SV mode
  • RecordBatch - Container plus SVs plus all of the operator machinery
  • WritableBatch - "Specialized version of record batch that can moves out buffers and preps them for writing."
  • RowSet and subclasses - Generalized wrapper around a set of vectors
  • BatchAccessor - Subset of RecordBatch methods for accessing a batch & SVs.

Accessors

  • ValueVector.Mutator, Accessor - Low level per-vector access.
  • ComplexWriter, ComplexReader - Older JSON-like accessors.
  • ColumnWriter, ColumnReader - Newer accessors.
  • Generated code - Directly manipulates value vectors

Metadata

  • MaterializedField - Description of a vector (and its children)
  • BatchSchema - List of MaterializedField in no specific order
  • ColumnMetadata - Enriched alternative to MaterializedField
  • TupleMetadta - Enriched form of BatchSchema (and schema for a map)
  • SerializedField - Protobuf column description for RPC
  • RecordBatchDef - Protobuf batch definition for RPC

Operators

  • RecordBatch - Combination of a batch of vectors plus a Volcano iterator. Is a VectorAccessible.
  • OperatorRecordBatch - Wrapper around newer, simpler operator structure to look like a RecordBatch
  • OperatorExec - Interface for the "guts" of an operator.
  • ?? - Shim around a RecordBatch to make it look like a OperatorExec

Desired Abstractions

  • Column schema - Description of one column,
  • Tuple schema - Description of a set of columns: indexed by name and position.
  • Column vector - Opaque (to clients) storage of column data
  • Column bundle - Collection of vectors, indexed by name (slow) or position.
  • Column accessors - Read/write access to a bundle of vectors
  • Record set - Vector bundle plus a record count and schema.
  • Record batch - Record set plus possible SV
  • Operator - Follows Volcano protocol, provides an output record batch

Rough structure:

Column vector o-- `ValueVector` o-- `DrillBuf`
                                o-- `Mutator`, `Accessor`
                                o-- `MaterializedField`
              o- `ColumnMetadata`

Column bundle o-- Column vector *
              o-- `TupleMetadata`
              o-- Row count

Column batch o-- Column bundle
             o-- Selection vector mode
             o-- Selection vector

Conversions:

Column vector --> `ColumnWriter`
              --> `ColumnReader`

Column bundle --> `RowSetWriter`
              --> `RowSetReader`

Column batch --> `RowSetReader` (with SV)
             --> `WritableBatch`

Challenges

  • Cannot change the client API: must return existing vector, metadata structures to existing JDBC, C++ clients.
  • An a-priori schema can include children; a vector-description schema should not duplicate vector structure.
  • Required stickiness of vectors: two operators share pointers to vectors, but becomes awkward for scan, merge batches.
  • SV4 complexity: normal and hyper batches

External APIs that must not change:

  • Serialized form of batches: data, metadata
  • Protobuf form of metadata

Strategy

Leave existing structures alone: wrap them in new ones. Gradually migrate operators. Then, remove old abstractions (which, at that point, are just shims on top of the new ones.)

Gradually migrate all vector access to accessors, to allow eventual evolution of underlying physical storage.

Mapping

Very preliminary mapping:

Current Revised Public API? Notes
DrillBuf Same Yes Now holds buffer, not used for value read/write.
ValueVector Same Yes Internally, value access is via accessors.
ValueVector. Accessor, Mutator Same Yes Deprecated internally.
MaterializedField Same Yes Deprecated internally in favor of ColumnMetadata
ComplexWriter, ComplexReader ColumnWriter, ColumnReader Partial Exposed in UDF definition.
VectorContainer VectorBundle Check

Steps

  • Define vector bundle a internal helper abstraction
  • Refactor readers/writers per the svr-exp3 code
    • Readers & writers work with vector bundles
    • Builders convert multiple vector structures to a bundle internally
  • Rationalize the BatchAccessor
    • Original purpose: interface for batch-related methods of RecordBatch
    • Has morphed into a way of holding, tracking batches. Revise?
  • Remove SV4 from public API, use only internally
    • Copier handles SV4 cleanly
    • Remove old external sort
    • Managed sort does SVR internally so exposes SV2 or none
    • What other operators produce, consume SV4?
  • Migrate to batch accessor
    • Add batch accessor to RecordBatch as wrapper around itself.
    • Extends to hold the container, SVs; RecordBatch methods defer to batch accessor
    • Gradually replace direct RecordBatch calls to calls to batch accessor
  • ColumnMetadata wrapper around a MaterializedField
  • TupleMetadata wrapper around a BatchSchema
  • VectorBundle to hold a set of vectors
    • In operators which use ad-hoc structures, replace with the VectorBundle
    • VectorBundle wrapper around a VectorAccessible or VectorContainer
    • In operators which work with a VectorContainer, change to use VectorBundle instead
Clone this wiki locally