BH Code Intro

(Submitted as a PR: /vector/src/main/java/org/apache/drill/exec/vector/accessor/*.md, content split into several files.)

The code for the mechanisms described here is complete and unit tested except for JSON issues described later. Some of the code is merged into Apache master, other code remains in a private branch pending review. Knowing the location of the code will allow you to follow along in the code as we discuss concepts.

It can be hard to understand complex code. To make doing so easier, we offer three paths:

This set of notes (plus additional material in the JIRA).
Comments in the code.
Unit tests which illustrate how to use each component.

The code itself has many comments and explanations; it is the best resource to understand the detailed algorithms and approaches.

Code in Master

The following pull requests have already been merged into master.

DRILL-5688: Add repeated map support to column accessors

DRILL-5657: Size-aware vector writer structure - Provides the basics of the "result set loader" layer, but excludes list, repeated list and union support.

DRILL-6049: Misc. hygiene and code cleanup changes - Includes all the non-core changes from the project to separate out non-core from core changes.

Active PRs

None at the moment. Work in progress is a rebase, followed by a PR for metadata updates.

Private Branch

The bulk of the mechanism described in these notes still reside in a private branch in GitHub.

Commit 7b36208f0d1040a994da9214f3ed2c4ae5e0c748 is a roll up of most of the work done over the last five months since DRILL- 5657 was issued.

Commits from 18ad03577811b748c253001ff3e2bea18b440542 on are new work done to add repeated list support for the JSON reader in the new mechanism.

Commits between these two represent the work to factor out non-core code into the DRILL-6049 PR.

Pull Request Plan

The work is large. Experience shows that PRs work best when they are compact. (DRILL-5657, for example, took about five months to review -- far too long because he PR was far too large.) To better match the Drill PR process, the following is one possible PR plan, moving from the bottom up:

Metadata revisions
Column writers (minor revisions. Excludes the advanced types.)
Advanced column writers. (Union, List, Repeated List.)
Column readers. (Major changes: simplified structure, single/SV4 vector accessors.)
Revision to the Result Set Loader layer (bug fixes, incorporating the above)
Operator framework
Scan operator
Scan framework (incorporates the projection revisions)
CSV and Mock data source revisions
JSON reader revision (assumes resolution of the design issues in DRILL-6035.

The code is integrated: each layer depends on the one below (though there are no upward dependencies). So, some temporary development may be required to break the dependencies to allow the above split.

Experience suggests that the best way to proceed is on each commit is to:

Identify a set of files for a new PR.
Create a new branch for that commit off of master.
Use get checkout to grab each file one-by-one from RowSetRev3.
Build.
Fix issues, such as missing dependencies.
Run pre-commit tests.
Issue the PR.

Design Approach

To better understand the code, it helps to understand the design philosophy behind the code. Here are a few guiding principles.

Design for the long term. Determine what we'll need in a production system and work toward that. Avoid short-term hacks.
Minimize functional change. Our goal is to limit batch size by upgrading certain parts of Drill. We tried to minimize user-visible changes as such changes, even when useful, require extensive discussion and vetting. Better to make the core changes in one project (this one), then do a later project to roll out user-visible imrpovements (such as those to improve JSON.)
Let the code tell us what works. The project tried various approaches. The one that was cleanest, simplest, and easiest to test won out.
Keep it simple. Complex, deeply nested functions. Classes that do many things. State maintained by many flags. All are the path to chaos. Instead, this project used small, focused classes. Composition is used to build up complex functionality from simple pieces. Clean APIs are preferred over complex coupling of components.
Test. Unit testing is an integral part of the project. Every component is wrapped in tests. It is not quite TDD (test-driven development), but it is absolutely of the form "write a bit of code, test it, write a bit more and repeat."
Loose Coupling. Each component is loosely coupled to others. This allows unit testing, but it also makes it much easier to reason about each component.
Document. This wiki. The specs. Extensive code comments. They are not just for you, the person who needs to understand the code. They also served as a way to think through the design issues when writing the code.

Status

At present the work scheduled thus far is complete. An earlier version, with only the CSV (really, compliant text) reader passed the full pre-commit test suite. The current version has not been run against pre-commit because of known issues with JSON (the new code does not support unions for reasons discussed later.)

The code has a robust set of unit tests: each component is exercised heavily. While more cases can be added, the unit tests have proven their value. Once a module passes its unit tests, bugs are seldom found when using the module at a higher level. Instead, what is usually found are required design changes as we discovered additional use cases. (For example, the projection mechanism constantly evolved as it went from flat schemas to the columns[] array used for CSV, to implicit columns, to maps, to maps of arrays and so on. But, at each stage, unit tests ensure that the new functionality works.)

Indeed, a key challenge in this project has been the gradual uncovering of dusty corners of Drill functionality. In many areas, the original purpose is lost to time as the original developers left, leaving behind no designs, documentation or code comments. The result is that the code is its own spec and several times "unnatural acts" have been performed to maintain compatibility with cryptic design decisions. There is opportunity for the team to understand, revisit, and perhaps rationalize some of these decisions. (We'll discuss a few where they apply.)

Most layers have undergone repeated evolution and refinement. Perhaps the one area that could use a bit more refinement is the scan framework: it works, it satisfies its requirements, but it is just a tad awkward.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly