Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRILL-5688: Add repeated map support to column accessors #887

Closed
wants to merge 10 commits into from

Conversation

paul-rogers
Copy link
Contributor

Restructures the existing "column accessor" code to adopt a JSON-like structure that works for all of Drill's data types. This PR focused on the "repeated map" vector. (Still to come is support for repeated lists, but they fit into the revised JSON structure.)

This PR has four commits that highlight different parts of the changes:

  • The core accessors themselves
  • Changes to vector classes along with a new "tuple metadata" class
  • Revisions to the "row set" test framework which uses, and tests, the accessors
  • "Collateral damage" changes that pick up changes to the row set classes and add a number of small test framework improvements.

Accessors

The accessor structure is explained in package_info.java files in the accessor packages. Basically, the structure is:

  • The accessor types are: tuple, array and scalar
  • A tuple is a set of (name, type) pairs
  • Maps and rows are both tuples
  • Arrays are a series of one of the three types

The accessors add an "object" layer that represents any of the three types. So, a tuple is really a list of (name, object accessor) pairs, where the object accessor provide access to a scalar, an array or a tuple as appropriate for each column.

The structure appears complex (since it must model JSON). But, an app using this code would use just the leaf scalar readers and writers. These classes currently access data via the value vector Mutator and Accessor classes. But, the goal is to eventually access the Netty PlatformDependent methods directly so that there is a single layer between the application and the call into direct memory. (Today there are multiple layers.)

There is quite a bit of code change here to provide the new structure. But, the core functionality of reading and writing to vectors has not changed much. And, this code has extensive unit tests, which should avoid the need to "mentally execute" each line of code.

Supporting Classes

A new TupleMetadata class is a superset of the existing BatchSchema, but also provides "classic" tuple-like access by position or name. Eventually, this will also hold additional information such as actual size and so on (information now laboriously rediscovered by the "record batch sizer.") Since the accessors use a "tuple" abstraction to model both rows and maps, the tuple metadata provides this same view. The top-most tuple models the row. Columns within the row can be maps, which have their own tuple schema, and so on.

TupleNameSpace moves locations (so it can be used in the vector package) but otherwise remains unchanged.

DrillBuf provides an experimental putInt() method that does bounds checking and sets a value, to minimize calls. This will probably move into the writer in a later PR.

This PR fixes DRILL-5690, a bug in repeated vectors that did not pass along Decimal scale and precision. See RepeatedValueVectors.java.

MaterializedField changes to add an isEquivalent() method to compare two fields, ignoring internal ($offset$, $bits$, etc.) vectors.

Row Set Classes and Tests

The RowSet family of classes changed in response to the accessor changes.

  • The reader and writer are moved to separate files.
  • Row sets now use a "parsed" form of "storage" classes to hold vectors (more below).
  • Static factory methods were added to hide constructor complexity.
  • The RowSetBuilder and RowSetComparison test tools added support for repeated maps.
  • Code to handle generic object writing moved from the RowSetBuilder into the accessors.
  • The old RowSetSchema evolved to become the TupleMetadata mentioned above.
  • Tests were greatly enhanced to test all modes of all supported scalar types, as well as the new JSON-like structure.

In the previous version, the row set classes had complex logic to figure out what kind of accessor to create for each vector. This became overly complex. In this version, the row set "parses" a vector container to create "storage" objects that represent tuples and columns. A column can, itself, be a tuple. (Note: there is no need to model lists since lists are just vectors at this level of abstraction, so need no special handling.)

With this change, accessor creation is a simple matter of walking a tree to assemble the JSON-structure.

This structure is also used to create a batch's vectors from a schema.

Other Changes

The last commit contains various other changes, mostly reflecting the changes above.

Paul Rogers added 8 commits July 26, 2017 22:03
Includes the core JSON-like reader and writer interfaces and
implementations.
Includes changes to value vectors, DrillBuf and other low-level classes.
Modifications to the row set abstraction (used for testing) for the
changed accessors. Row sets also act as tests for the accessor classes,
including a number of tests that test the classes used for testing.
(Yes, somewhat recursive…)
Changes to unit tests, and the unit test framework, required by the
changes to the accessor and row set classes.
@paul-rogers
Copy link
Contributor Author

This set of commits provide performance tuning of the column writer implementation. Discussion during the code review identified a number of tuning opportunities that lead to this work. The commits divide the files by layer:

  • Value vectors
  • Column accessors
  • Row set classes used for testing
  • Secondary files affected by the above changes

The result is a very nice performance improvement relative to the original vector mutator methods. The writers now provide:

  • 42% faster writes for required values
  • 73% faster writes for nullable values
  • 408% (4x) faster writes for repeated values

Not a bad outcome from a casual code review conversation!

Merge Vector Overflow into Column Writers

DRILL-5688 and DRILL-5657 originally separated the tasks of detecting and handling vector overflow. (Vector overflow occurs when we are about to write past the maximum buffer size.) The result was that each “set” operation needed to traverse two layers: the “column loader” which handles overflow, and the “column writer” which writes the value to a vector.

Discussion identified an opportunity to combine these into a single layer. The revised column accessors in this PR achieve that revision: each accessor detects overflow, call a handler to handle the overflow, then continues to write operation into the (presumably) new vector.

The “result set loader” (DRILL-5657) abstraction contains the logic to handle overflows. The original “row set” abstractions for tests don’t handle overflow; instead they throw a index out-of-bounds exception. (Tests seldom write enough data to trigger an overflow unless deliberately testing the overflow feature.)

Bypass the Vector Mutator

The big performance improvement was to avoid the overhead of the original vector mutator. In the original code, each write travels through many layers of code:

  • Vector mutator
  • Drillbuf
  • UDLE
  • Netty platform dependent
  • Java Unsafe

The revised column writers bypass the first three layers; and instead use the Netty platform dependent methods to write to vector memory. In this, they follow the text reader implementation which does something similar.

Specialized Offset Vector Writer

For arrays (repeated) and variable-width values, the original code is quite inefficient as it must do a direct memory read, and two writes, for each value. The revised writer code for variable-width fields does two writes. For arrays (repeated) values, the writer code does just one write per value and one write per row. (This is why the writer code is so much faster: the test writes five values per row: 15 direct memory I/Os in the original code, just 6 in the new writer code.)

A new OffsetVectorWriter class manages the details of writing to the offset vector.

Highly-Optimized Writer Code

The unfortunate side-effect of the optimization is that the writers are no longer simple. The previous version (which should be reviewed first) delegate work to the vector mutator. The new version provides an integrated set of code that:

  • Does one single bounds check per value to
  • Check if the vector size must increase, and if so
  • Check if the growth would cause overflow
  • Also check if the “fill empties” logic is needed to fill in missing values

On the other hand, it became clear that the writers could be simplified. The previous version, because it worked with the mutators in each vector class, needed a separate writer for required, optional and repeated vectors. The new code, because it works directly with Netty’s platform dependent layer, can use a single writer for all three data modes.

Now, one might note that the single implementation has some very slight overhead:

  • The nullable vector need not fill empty “data” values (null values can be garbage, empty filling is really only needed for the “bits” vector)
  • Repeated writers need not fill empty values (there may be empties in the offset vector, but not the data vector.)

However, the cost is a single comparison, so it did not seem worth the effort to generate three complex classes to avoid a single comparison. (The performance test results seem to back up this assumption.)

Because of the above, the ColumnAccessors.java template has become a bit more complex (as if it was not already plenty complex.) Best is to look at the generated ColumnAccessors class, and its nested classes, to see the result of the template.

Non-zeroing Vector Reallocation

The existing reAlloc methods double vector size and zero-fill the upper half of the new vector. This causes the setSafe() methods to use loops to call reAlloc() multiple times in case the vector may need to quadruple or more.

The revised column writer code avoids these two performance penalties by

  • Computing the required new vector size so it can be allocated in one go and
  • Skipping the zeroing step.

Existing code continues to use the original implementation.

Collateral Code Cleanup

Because vector overflow is now integrated into the writers, there is no longer a need to throw the prior VectorOverflowException, which is now removed. The code changes in prior commits to handle this exception are removed in this commit.

DRILL-5517 added “size aware” methods to value vectors to support the earlier version of the writers which used the vector mutators. Since that logic is now integrated into the writers themselves, this commit backs out those changes to tidy up the value vector code.

Previous versions of the row set writers flattened maps for easier creation of test data in the RowSetBuilder. Earlier commits changed this behavior to require an array of values to populate a map. A number of tests needed updating as a result.

Retrofitted a couple of tests to use the SubOperatorTest base class.

Split the MapWriter into two distinct base classes so that the code can set the value count which is implemented in two different classes without a common parent class method definition.

Earlier versions had a way to set every column type from an int value for testing. Moved this code around to make use of the generic object-based set methods now available.

Added a PerformanceTool used to compare writer performance to the original vector & mutator implementations.

Added a TestFillEmpties test which writes every fifth value to every data type and every mode to ensure that the “fill empties” logic works for all. This also has the handy side-effect of verifying null handling and basic scalar writer behavior.

Readers

A logical question is, “if the column writers benefited from optimization, what about the column readers?” Writers are the focus of the “memory fragmentation” project which will replace the low-level vector writers in various operators.

For now, readers are only used in test code - which is not performance critical. We can apply the same technique to readers when they find themselves on the critical performance path.

Paul Rogers added 2 commits July 31, 2017 15:33
The first cut at initializing the offset vector had a bug when the
vector was not previously initialized. This commit fixes that problem.
@paul-rogers
Copy link
Contributor Author

Closing as this PR is now superseded by #914.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant