Fast Decimal Type

Decimal literals should be represented as decimals (as opposed to DOUBLE currently) in order to be compliant with other databases and SQL standard. To make that happen, decimal arithmetic has to be fast enough, so that there is no significant performance gap compared to BIGINT or DOUBLE types.

To visualise performance problem take following examples into consideration:

a_bigint >= 2.5 => returns boolean
a_bigint >= 2.5 * b_bigint => returns boolean
sqrt((a_bigint * 2.5)^3) => returns double
a_decimal_5_2 * 2.5 <= b_bigint => returns boolean

There are number of issues which have to be addressed given that fixed point literals are DECIMALS (currently they are interpreted as DOUBLES):

Doing fast comparison where one side is BIGINT and the other DECIMAL. The problem is that BIGINT should be coerced to DECIMAL(19, 0). However decimals with precision 19 cannot be always represented by long Java primitive;
Performing fast calculations when both BIGINT/DOUBLE and DECIMAL is involved. BIGINT would be cast to DECIMAL(19, 0) instead of DOUBLE. Performance degradation between DOUBLE and DECIMAL arithmetic should be as small as possible;
Performance of small decimals (precision <= 19) should be comparable to the performance of Java long or double primitive;
Performance of large decimals (precision >= 19) should also be as good as possible (the same order of magnitude as for small decimals).

Given those requirements, those are proposed steps for implementing fast DECIMAL:

Add support in Presto for output Slice argument in scalar operators. An example operator could be decimalSum with signature: void decimalSum(Slice a, Slice b, Slice result). In such case operator doesn't have to instantiate Slice as a result. In this approach, a set of temporary accumulator variables: a1, .., aN would be allocated before processing rows. Those would store intermediate calculation results. For instance sum of columns: column1, column2, ... columnN could be calculated as: decimalSum(column1, column2, accumulator); decimalSum(column3, accumulator, accumulator); ...; decimalSum(columnN, accumulator, accumulator);. accumulator would then store total sum result.
Implementing large decimal (precision >= 19) arithmetic that would be specifically designed for 128-bit decimals. Such arithmetic would work directly on Slices. It wouldn't instantiate temporary objects or use Java BigInteger. Such arithmetic POC is available here: https://github.com/Teradata/presto/commit/6aa867720dfb6e5b30ccda143442d76e9d84b621 and is based on Hive's Decimal128 (https://github.com/prongs/apache-hive/blob/master/common/src/java/org/apache/hadoop/hive/common/type/Decimal128.java);
Have a dual Slice based representation of decimals with precision >= 19. The idea is that small decimal values can be represented using long primitive. For larger decimals (up to 128-bits) four integers with dedicated arithmetic can be used (see previous point). This will allow for fast, long based computations for smaller decimals.
(Optional depending on performance) Have specific operators for BIGINT and DECIMAL logical and/or arithmetic operations
(Optional) Rewrite expressions, e.g. a_bigint >= 2.5 could be rewritten to a_bigint >= 2

Concepts behind those proposals were investigated and benchmarked. Benchmark code is available: https://github.com/Teradata/presto/commit/e1dac35111cdc3cab807c763b7e43c0c2ed36f4e. The assumptions are following:

N rows contain 1, 10, 100 columns. Column is stored in multiple formats (all wrapped in Slice): long primitive, long primitive with additional byte flag, double primitive, four integers (fast large decimal POC), Java BigInteger based decimal
Column data is stored in Slice to simulate Presto workflow (e.g. extracting value from Block or reading it from hard drive)
Each benchmark computes sum of all columns for each row. The result is stored in Slice to simulate Presto workflow (e.g. storing value using BlockBuilder or writing it to disk)
Column data is generated from extendedPrice column of lineitem TPCH table.

There are two ways of computing columns sum:

sum(a_column, sum(b_column, sum(c_column, ...) -> in this approach result of one sum is input to other sum. This works well with Java primitives, but causes overhead when the sum result is Slice. This is because every sum operation has to return new Slice instance.
sum(a_column, 0, result); sum(b_column, result, result); ... -> in this approach sum operator gets a reference to result Slice variable where the result should be stored. This approach is best for DECIMALS that work on Slices because unnecessary objects are not created

Here are benchmarks results:

Benchmark                                                 (numberOfColumns)  (numberOfRows)  Mode  Cnt         Score         Error  Units
DecimalBenchmark.benchmarkAddDouble                                     100           10000  avgt   10  11009680,098 ± 136702,006  ns/op
DecimalBenchmark.benchmarkAddExactBigInt                                100           10000  avgt   10  10102069,534 ±  192094,824  ns/op
DecimalBenchmark.benchmarkAddExactSliceSmallDecimal                     100           10000  avgt   10  10000773,936 ±  314484,661  ns/op
DecimalBenchmark.benchmarkAddExactSliceSmallDecimalWithCondition        100           10000  avgt   10  12233633,786 ±  281106,921  ns/op
DecimalBenchmark.benchmarkAddExactSliceSmallDecimalWithFlag             100           10000  avgt   10  12851717,002 ±  649326,304  ns/op
DecimalBenchmark.benchmarkAddExactSliceSmallDecimalNoAccumulator        100           10000  avgt   10  27868271,539 ±  641941,350  ns/op
DecimalBenchmark.benchmarkAddFastSliceDecimal                           100           10000  avgt   10  23306906,663 ±  906315,500  ns/op
DecimalBenchmark.benchmarkAddFastSliceDecimalNoAccumulator              100           10000  avgt   10  45606653,390 ± 1319108,845  ns/op
 
DecimalBenchmark.benchmarkAddDouble                                      10           10000  avgt   10   2691757,513 ±  42252,904  ns/op
DecimalBenchmark.benchmarkAddExactBigInt                                 10           10000  avgt   10   2294961,172 ±   45196,712  ns/op
DecimalBenchmark.benchmarkAddExactSliceSmallDecimal                      10           10000  avgt   10   2105299,290 ±  154218,157  ns/op
DecimalBenchmark.benchmarkAddExactSliceSmallDecimalWithCondition         10           10000  avgt   10   2468484,502 ±   40638,119  ns/op
DecimalBenchmark.benchmarkAddExactSliceSmallDecimalWithFlag              10           10000  avgt   10   2882945,564 ±  233256,605  ns/op
DecimalBenchmark.benchmarkAddExactSliceSmallDecimalNoAccumulator         10           10000  avgt   10   4018341,231 ±   78284,635  ns/op
DecimalBenchmark.benchmarkAddFastSliceDecimal                            10           10000  avgt   10   4435123,345 ±  118476,732  ns/op
DecimalBenchmark.benchmarkAddFastSliceDecimalNoAccumulator               10           10000  avgt   10   6486706,982 ±  125816,358  ns/op
DecimalBenchmark.benchmarkAddBigIntegerDecimal                           10           10000  avgt   10  25570694,178 ±  718172,389  ns/op
 
DecimalBenchmark.benchmarkAddDouble                                       2           10000  avgt   10    531261,601 ±   7925,572  ns/op
DecimalBenchmark.benchmarkAddExactBigInt                                  2           10000  avgt   10    582688,160 ±   13036,141  ns/op
DecimalBenchmark.benchmarkAddExactSliceSmallDecimal                       2           10000  avgt   10    576614,844 ±   61817,158  ns/op
DecimalBenchmark.benchmarkAddExactSliceSmallDecimalWithCondition          2           10000  avgt   10    558155,381 ±   60957,820  ns/op
DecimalBenchmark.benchmarkAddExactSliceSmallDecimalWithFlag               2           10000  avgt   10    834472,680 ±  111274,151  ns/op
DecimalBenchmark.benchmarkAddExactSliceSmallDecimalNoAccumulator          2           10000  avgt   10   1368111,040 ±   40764,386  ns/op
DecimalBenchmark.benchmarkAddFastSliceDecimal                             2           10000  avgt   10   1402975,001 ±   54223,765  ns/op
DecimalBenchmark.benchmarkAddFastSliceDecimalNoAccumulator                2           10000  avgt   10   1658072,818 ±   20921,517  ns/op
DecimalBenchmark.benchmarkAddBigIntegerDecimal                            2           10000  avgt   10   5707279,463 ±  152062,044  ns/op
 
DecimalBenchmark.benchmarkAddDouble                                       1           10000  avgt   10    153647,332 ±   8232,550  ns/op
DecimalBenchmark.benchmarkAddExactBigInt                                  1           10000  avgt   10    199451,630 ±    4178,584  ns/op
DecimalBenchmark.benchmarkAddExactSliceSmallDecimal                       1           10000  avgt   10    167010,986 ±    3828,009  ns/op
DecimalBenchmark.benchmarkAddExactSliceSmallDecimalWithCondition          1           10000  avgt   10    176952,276 ±    5064,431  ns/op
DecimalBenchmark.benchmarkAddExactSliceSmallDecimalWithFlag               1           10000  avgt   10    223903,387 ±    8515,021  ns/op
DecimalBenchmark.benchmarkAddExactSliceSmallDecimalNoAccumulator          1           10000  avgt   10    330760,579 ±    8886,068  ns/op
DecimalBenchmark.benchmarkAddFastSliceDecimal                             1           10000  avgt   10    390820,057 ±   14403,589  ns/op
DecimalBenchmark.benchmarkAddFastSliceDecimalNoAccumulator                1           10000  avgt   10    594768,346 ±    8167,884  ns/op
DecimalBenchmark.benchmarkAddBigIntegerDecimal                            1           10000  avgt   10   2527092,601 ±   85102,283  ns/op

Benchmarks explanation:

benchmarkAddExactBigInt are benchmarks that operate on long Java primitives as inputs and results.
benchmarkAddExactSliceSmallDecimal are benchmarks that operate on long Java primitives wrapped in Slice and passed as arguments. The result (accumulator) parameter is provided to sum operator.
benchmarkAddExactSliceSmallDecimalWithCondition/Flag are benchmarks that do additional checks on input parameters. This is to simulate dual representation of small and large decimals in Slice. Condition checking means checking if the first bit of extracted long primitive is set. Flag checking means checking additional byte from Slice (this requires one additional call to Slice.getByte() method).
benchmarkAddExactSliceSmallDecimalNoAccumulator simulates case when a new Slice is returned as a result of sum operator instead of using accumulator
benchmarkAddFastSliceDecimal benchmarks POC implementation of fast large decimal
benchmarkAddBigIntegerDecimal benchmarks BigInteger based decimal implementation

Conclusions:

benchmarkAddExactBigInt and benchmarkAddDouble have comparable performance to benchmarkAddExactSliceSmallDecimalWithCondition and benchmarkAddExactSliceSmallDecimal. This confirms that it is possible to have fast dual (as long or four integers) representation of decimals in Slice.
benchmarkAddExactSliceSmallDecimal is about twice as fast as benchmarkAddExactSliceSmallDecimalNoAccumulator. This confirms that having output Slice argument is beneficial.
benchmarkAddFastSliceDecimal is 2-3 times slower than benchmarkAddExactBigInt but about order of magnitude faster than benchmarkAddBigIntegerDecimal depending on number of columns.