Batch Structure

Background

Drill's internal data structures are vastly complicated because Drill, in an attempt to be more flexible, gave up the concepts that manage complexity in typical SQL engines.

A typical engine defines a schema which in turn defines a row format. Since SQL is tuple oriented, this is somewhat easy to do. We simply define a table of columns, each of which is a pointer to the implementation. For example, for a row-based system:

Name	Type	Offset
a	INT	0
b	VARCHAR	4

    Row layout
     a         b                    Varchar buffer
| . . . . | . . . . |     ...     | . . . . . . . 
r              |                  ^
               +------------------+

The table provides offsets into a tuple structure. The column b (VARCHAR) is 4-byte pointer into a separate buffer area. This is just one model, there are many.

The key point is that, given a schema, we can follow indirections into the row. Here, given a row start address (r above), we can compute column locations.

In Drill, however, things are far more complex.

Data is stored in vectors. Given a row index, we can locate the column value within the vector, which is good. But, there is good way to correlate row offsets across vectors. (Drill has multiple ways to do this.)
There is no good way to go from a schema to a vector. Vectors are stored in unordered "containers". The most general case requires a name look for each column access, which is slow.
Drill allows nested types: MAP, UNION (or MAP), DICT. As a result, indexing into a column is complex: we can't simply know that column "a" corresponds to some vector; each vector has structure (and so nested columns require name lookups.)
Drill allows arrays (including of MAPs). Thus, the row index for a set of vectors is not consistent: Scalar a uses one index, but Scalar b (within a map within an array) uses a different index.

The net result is that, over time, Drill has gotten ever more complex. One cannot simply create parallel structures which all reference the same column. For example, for a PROJECT operation:

Name	Type	Offset	Source	Transform
a	INT	0	$in	$copy
b	VARCHAR	4	$fn	$concat

Instead, since one cannot easily address either vectors or values, Drill uses complex structures. Each structure must be a tree (to handle arbitrary data structures.) Each column in the table corresponds to a separate tree structure, and each tree structure must link to the next one in line. Something like:

    Projection           Accessor          Vector             DrillBuf
       () -------------->  () ---------> (Container)
     /    \              /    \            /    \
   (a)    (b) --------> (a)   (b) ------ (vv)   (vv) -----> (buf)  (buf)

And so on to as many layers of MAP, ARRAY, DICT and other structures are available. Linkages are pointers. This causes odd behavior such as operators bind to vectors, but the contents of vectors change. Accessors contain pointers to vectors, and so work OK for single-input operators, but require complex pointer-indirection for multiple-input operators (such as merge) or for hyper-batches.

Each column has different semantics. So, each new layer requires that we re-implement a whole new set of structures: one for each vector type. Each time we add a new vector type, we must add a new class to each layer. This ends up with (On^2) complexity, which is not a "Good Thing."

Summary of the Problem

SQL systems are designed to interpret programs written in SQL. To do that they apply a large number of transforms to each column value in each row. A tuple layout allows the interpreter to walk tables that say what to do for each row.

In Drill, we try to model arbitrary tree data structures. And, we allow the tree structures to change order and content at any time. We want to interpret operations on these trees. Doing so is very, very complex.

Since we cannot use simple tables, we end up adding a new tree structure at each level of abstraction. Since each column behaves differently, we reproduce the type hierarchy at each level. Each node points to its children in the current abstraction, and to its counterpart in a lower-level abstraction. And since the structure can change, the entire hierarchy must be dynamic, and can change at any batch boundary.

This is far too complex. SQL is not designed for arbitrary tree structures. So, even if we got it to work, SQL can't exploit the structure.

We need a simpler solution.

Goal

The goal of this analysis is to answer the following questions:

What are the major existing abstractions (the layers of trees)?
Can we find a way to convert the trees into tables via effective column indexing?
How can we simplify the abstraction layers?

Existing Abstractions

Containers

DrillBuf - Semi-typed, physical direct-memory representation of a column.
ValueVector - Typed collection of column values.
VectorAccessible - Iterator over a set of vectors.
VectorContainer - Vectors plus BatchSchema plus SV mode
RecordBatch - Container plus SVs plus all of the operator machinery
WritableBatch - "Specialized version of record batch that can moves out buffers and preps them for writing."
RowSet and subclasses - Generalized wrapper around a set of vectors
BatchAccessor - Subset of RecordBatch methods for accessing a batch & SVs.

Accessors

ValueVector.Mutator, Accessor - Low level per-vector access.
ComplexWriter, ComplexReader - Older JSON-like accessors.
ColumnWriter, ColumnReader - Newer accessors.
Generated code - Directly manipulates value vectors

Metadata

MaterializedField - Description of a vector (and its children)
BatchSchema - List of MaterializedField in no specific order
ColumnMetadata - Enriched alternative to MaterializedField
TupleMetadta - Enriched form of BatchSchema (and schema for a map)
SerializedField - Protobuf column description for RPC
RecordBatchDef - Protobuf batch definition for RPC

Operators

RecordBatch - Combination of a batch of vectors plus a Volcano iterator. Is a VectorAccessible.
OperatorRecordBatch - Wrapper around newer, simpler operator structure to look like a RecordBatch
OperatorExec - Interface for the "guts" of an operator.
?? - Shim around a RecordBatch to make it look like a OperatorExec

Desired Abstractions

Column schema - Description of one column,
Tuple schema - Description of a set of columns: indexed by name and position.
Column vector - Opaque (to clients) storage of column data
Column bundle - Collection of vectors, indexed by name (slow) or position.
Column accessors - Read/write access to a bundle of vectors
Record set - Vector bundle plus a record count and schema.
Record batch - Record set plus possible SV
Operator - Follows Volcano protocol, provides an output record batch

Rough structure:

Column vector o-- `ValueVector` o-- `DrillBuf`
                                o-- `Mutator`, `Accessor`
                                o-- `MaterializedField`
              o- `ColumnMetadata`

Column bundle o-- Column vector *
              o-- `TupleMetadata`
              o-- Row count

Column batch o-- Column bundle
             o-- Selection vector mode
             o-- Selection vector

Conversions:

Column vector --> `ColumnWriter`
              --> `ColumnReader`

Column bundle --> `RowSetWriter`
              --> `RowSetReader`

Column batch --> `RowSetReader` (with SV)
             --> `WritableBatch`

Challenges

Cannot change the client API: must return existing vector, metadata structures to existing JDBC, C++ clients.
An a-priori schema can include children; a vector-description schema should not duplicate vector structure.
Required stickiness of vectors: two operators share pointers to vectors, but becomes awkward for scan, merge batches.
SV4 complexity: normal and hyper batches

External APIs that must not change:

Serialized form of batches: data, metadata
Protobuf form of metadata

Strategy

Leave existing structures alone: wrap them in new ones. Gradually migrate operators. Then, remove old abstractions (which, at that point, are just shims on top of the new ones.)

Gradually migrate all vector access to accessors, to allow eventual evolution of underlying physical storage.

Mapping

Very preliminary mapping:

Current	Revised	Public API?	Notes
`DrillBuf`	Same	Yes	Now holds buffer, not used for value read/write.
`ValueVector`	Same	Yes	Internally, value access is via accessors.
`ValueVector.` `Accessor`, `Mutator`	Same	Yes	Deprecated internally.
`MaterializedField`	Same	Yes	Deprecated internally in favor of `ColumnMetadata`
`ComplexWriter`, `ComplexReader`	`ColumnWriter`, `ColumnReader`	Partial	Exposed in UDF definition.
`VectorContainer`	`VectorBundle`	Check

Steps

Define vector bundle a internal helper abstraction
Refactor readers/writers per the svr-exp3 code
- Readers & writers work with vector bundles
- Builders convert multiple vector structures to a bundle internally
Rationalize the BatchAccessor
- Original purpose: interface for batch-related methods of RecordBatch
- Has morphed into a way of holding, tracking batches. Revise?
Remove SV4 from public API, use only internally
- Copier handles SV4 cleanly
- Remove old external sort
- Managed sort does SVR internally so exposes SV2 or none
- What other operators produce, consume SV4?
Migrate to batch accessor
- Add batch accessor to RecordBatch as wrapper around itself.
- Extends to hold the container, SVs; RecordBatch methods defer to batch accessor
- Gradually replace direct RecordBatch calls to calls to batch accessor
ColumnMetadata wrapper around a MaterializedField
TupleMetadata wrapper around a BatchSchema
VectorBundle to hold a set of vectors
- In operators which use ad-hoc structures, replace with the VectorBundle
- VectorBundle wrapper around a VectorAccessible or VectorContainer
- In operators which work with a VectorContainer, change to use VectorBundle instead

Provide feedback

Saved searches

Use saved searches to filter your results more quickly