-
Notifications
You must be signed in to change notification settings - Fork 5.4k
Fast Decimal Type
Decimal literals should be represented as decimals (as opposed to DOUBLE
currently) in order to be compliant with other databases and SQL standard. To make that happen, decimal arithmetic has to be fast enough, so that there is no significant performance gap compared to BIGINT
or DOUBLE
types.
To visualise performance problem take following examples into consideration:
a_bigint >= 2.5 => returns boolean
a_bigint >= 2.5 * b_bigint => returns boolean
sqrt((a_bigint * 2.5)^3) => returns double
a_decimal_5_2 * 2.5 <= b_bigint => returns boolean
There are number of issues which have to be addressed given that fixed point literals are DECIMALS
(currently they are interpreted as DOUBLES
):
- Doing fast comparison where one side is
BIGINT
and the otherDECIMAL
. The problem is thatBIGINT
should be coerced toDECIMAL(19, 0)
. However decimals with precision19
cannot be always represented bylong
Java primitive; - Performing fast calculations when both
BIGINT
/DOUBLE
andDECIMAL
is involved.BIGINT
would be cast toDECIMAL(19, 0)
instead ofDOUBLE
. Performance degradation betweenDOUBLE
andDECIMAL
arithmetic should be as small as possible; - Performance of small decimals (
precision <= 19
) should be comparable to the performance of Javalong
ordouble
primitive; - Performance of large decimals (
precision >= 19
) should also be as good as possible (the same order of magnitude as for small decimals).
Given those requirements, those are proposed steps for implementing fast DECIMAL
:
- Add support in Presto for output
Slice
argument in scalar operators. An example operator could bedecimalSum
with signature:void decimalSum(Slice a, Slice b, Slice result)
. In such case operator doesn't have to instantiateSlice
as a result. In this approach, a set of temporary accumulator variables:a1
, ..,aN
would be allocated before processing rows. Those would store intermediate calculation results. For instance sum of columns:column1
,column2
, ...columnN
could be calculated as:decimalSum(column1, column2, accumulator); decimalSum(column3, accumulator, accumulator); ...; decimalSum(columnN, accumulator, accumulator);
.accumulator
would then store total sum result. - Implementing large decimal (
precision >= 19
) arithmetic that would be specifically designed for 128-bit decimals. Such arithmetic would work directly onSlices
. It wouldn't instantiate temporary objects or use JavaBigInteger
. Such arithmetic POC is available here: https://github.com/Teradata/presto/commit/6aa867720dfb6e5b30ccda143442d76e9d84b621 and is based on Hive's Decimal128 (https://github.com/prongs/apache-hive/blob/master/common/src/java/org/apache/hadoop/hive/common/type/Decimal128.java); - Have a dual
Slice
based representation of decimals withprecision >= 19
. The idea is that small decimal values can be represented usinglong
primitive. For larger decimals (up to 128-bits) four integers with dedicated arithmetic can be used (see previous point). This will allow for fast,long
based computations for smaller decimals. - (Optional depending on performance) Have specific operators for
BIGINT
andDECIMAL
logical and/or arithmetic operations - (Optional) Rewrite expressions, e.g.
a_bigint >= 2.5
could be rewritten toa_bigint >= 2
Concepts behind those proposals were investigated and benchmarked. Benchmark code is available: https://github.com/Teradata/presto/commit/e1dac35111cdc3cab807c763b7e43c0c2ed36f4e. The assumptions are following:
-
N
rows contain1, 10, 100
columns. Column is stored in multiple formats (all wrapped inSlice
):long
primitive,long
primitive with additional byte flag,double
primitive, four integers (fast large decimal POC), JavaBigInteger
based decimal - Column data is stored in
Slice
to simulate Presto workflow (e.g. extracting value fromBlock
or reading it from hard drive) - Each benchmark computes sum of all columns for each row. The result is stored in
Slice
to simulate Presto workflow (e.g. storing value usingBlockBuilder
or writing it to disk) - Column data is generated from
extendedPrice
column oflineitem
TPCH table.
There are two ways of computing columns sum:
-
sum(a_column, sum(b_column, sum(c_column, ...) ->
in this approach result of one sum is input to other sum. This works well with Java primitives, but causes overhead when the sum result isSlice
. This is because every sum operation has to return newSlice
instance. -
sum(a_column, 0, result); sum(b_column, result, result); ... ->
in this approach sum operator gets a reference to resultSlice
variable where the result should be stored. This approach is best forDECIMALS
that work onSlices
because unnecessary objects are not created
Here are benchmarks results:
Benchmark (numberOfColumns) (numberOfRows) Mode Cnt Score Error Units
DecimalBenchmark.benchmarkAddDouble 100 10000 avgt 10 11009680,098 ± 136702,006 ns/op
DecimalBenchmark.benchmarkAddExactBigInt 100 10000 avgt 10 10102069,534 ± 192094,824 ns/op
DecimalBenchmark.benchmarkAddExactSliceSmallDecimal 100 10000 avgt 10 10000773,936 ± 314484,661 ns/op
DecimalBenchmark.benchmarkAddExactSliceSmallDecimalWithCondition 100 10000 avgt 10 12233633,786 ± 281106,921 ns/op
DecimalBenchmark.benchmarkAddExactSliceSmallDecimalWithFlag 100 10000 avgt 10 12851717,002 ± 649326,304 ns/op
DecimalBenchmark.benchmarkAddExactSliceSmallDecimalNoAccumulator 100 10000 avgt 10 27868271,539 ± 641941,350 ns/op
DecimalBenchmark.benchmarkAddFastSliceDecimal 100 10000 avgt 10 23306906,663 ± 906315,500 ns/op
DecimalBenchmark.benchmarkAddFastSliceDecimalNoAccumulator 100 10000 avgt 10 45606653,390 ± 1319108,845 ns/op
DecimalBenchmark.benchmarkAddDouble 10 10000 avgt 10 2691757,513 ± 42252,904 ns/op
DecimalBenchmark.benchmarkAddExactBigInt 10 10000 avgt 10 2294961,172 ± 45196,712 ns/op
DecimalBenchmark.benchmarkAddExactSliceSmallDecimal 10 10000 avgt 10 2105299,290 ± 154218,157 ns/op
DecimalBenchmark.benchmarkAddExactSliceSmallDecimalWithCondition 10 10000 avgt 10 2468484,502 ± 40638,119 ns/op
DecimalBenchmark.benchmarkAddExactSliceSmallDecimalWithFlag 10 10000 avgt 10 2882945,564 ± 233256,605 ns/op
DecimalBenchmark.benchmarkAddExactSliceSmallDecimalNoAccumulator 10 10000 avgt 10 4018341,231 ± 78284,635 ns/op
DecimalBenchmark.benchmarkAddFastSliceDecimal 10 10000 avgt 10 4435123,345 ± 118476,732 ns/op
DecimalBenchmark.benchmarkAddFastSliceDecimalNoAccumulator 10 10000 avgt 10 6486706,982 ± 125816,358 ns/op
DecimalBenchmark.benchmarkAddBigIntegerDecimal 10 10000 avgt 10 25570694,178 ± 718172,389 ns/op
DecimalBenchmark.benchmarkAddDouble 2 10000 avgt 10 531261,601 ± 7925,572 ns/op
DecimalBenchmark.benchmarkAddExactBigInt 2 10000 avgt 10 582688,160 ± 13036,141 ns/op
DecimalBenchmark.benchmarkAddExactSliceSmallDecimal 2 10000 avgt 10 576614,844 ± 61817,158 ns/op
DecimalBenchmark.benchmarkAddExactSliceSmallDecimalWithCondition 2 10000 avgt 10 558155,381 ± 60957,820 ns/op
DecimalBenchmark.benchmarkAddExactSliceSmallDecimalWithFlag 2 10000 avgt 10 834472,680 ± 111274,151 ns/op
DecimalBenchmark.benchmarkAddExactSliceSmallDecimalNoAccumulator 2 10000 avgt 10 1368111,040 ± 40764,386 ns/op
DecimalBenchmark.benchmarkAddFastSliceDecimal 2 10000 avgt 10 1402975,001 ± 54223,765 ns/op
DecimalBenchmark.benchmarkAddFastSliceDecimalNoAccumulator 2 10000 avgt 10 1658072,818 ± 20921,517 ns/op
DecimalBenchmark.benchmarkAddBigIntegerDecimal 2 10000 avgt 10 5707279,463 ± 152062,044 ns/op
DecimalBenchmark.benchmarkAddDouble 1 10000 avgt 10 153647,332 ± 8232,550 ns/op
DecimalBenchmark.benchmarkAddExactBigInt 1 10000 avgt 10 199451,630 ± 4178,584 ns/op
DecimalBenchmark.benchmarkAddExactSliceSmallDecimal 1 10000 avgt 10 167010,986 ± 3828,009 ns/op
DecimalBenchmark.benchmarkAddExactSliceSmallDecimalWithCondition 1 10000 avgt 10 176952,276 ± 5064,431 ns/op
DecimalBenchmark.benchmarkAddExactSliceSmallDecimalWithFlag 1 10000 avgt 10 223903,387 ± 8515,021 ns/op
DecimalBenchmark.benchmarkAddExactSliceSmallDecimalNoAccumulator 1 10000 avgt 10 330760,579 ± 8886,068 ns/op
DecimalBenchmark.benchmarkAddFastSliceDecimal 1 10000 avgt 10 390820,057 ± 14403,589 ns/op
DecimalBenchmark.benchmarkAddFastSliceDecimalNoAccumulator 1 10000 avgt 10 594768,346 ± 8167,884 ns/op
DecimalBenchmark.benchmarkAddBigIntegerDecimal 1 10000 avgt 10 2527092,601 ± 85102,283 ns/op
Benchmarks explanation:
-
benchmarkAddExactBigInt
are benchmarks that operate onlong
Java primitives as inputs and results. -
benchmarkAddExactSliceSmallDecimal
are benchmarks that operate onlong
Java primitives wrapped inSlice
and passed as arguments. The result (accumulator) parameter is provided to sum operator. -
benchmarkAddExactSliceSmallDecimalWithCondition/Flag
are benchmarks that do additional checks on input parameters. This is to simulate dual representation of small and large decimals inSlice
. Condition checking means checking if the first bit of extractedlong
primitive is set. Flag checking means checking additional byte fromSlice
(this requires one additional call toSlice.getByte()
method). -
benchmarkAddExactSliceSmallDecimalNoAccumulator
simulates case when a newSlice
is returned as a result of sum operator instead of using accumulator -
benchmarkAddFastSliceDecimal
benchmarks POC implementation of fast large decimal -
benchmarkAddBigIntegerDecimal
benchmarksBigInteger
based decimal implementation
Conclusions:
-
benchmarkAddExactBigInt
andbenchmarkAddDouble
have comparable performance tobenchmarkAddExactSliceSmallDecimalWithCondition
andbenchmarkAddExactSliceSmallDecimal
. This confirms that it is possible to have fast dual (aslong
or four integers) representation of decimals inSlice
. -
benchmarkAddExactSliceSmallDecimal
is about twice as fast asbenchmarkAddExactSliceSmallDecimalNoAccumulator
. This confirms that having outputSlice
argument is beneficial. -
benchmarkAddFastSliceDecimal
is 2-3 times slower thanbenchmarkAddExactBigInt
but about order of magnitude faster thanbenchmarkAddBigIntegerDecimal
depending on number of columns.