Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skipping partial aggregation when it is not helping for high cardinality aggregates #11627

Merged
merged 5 commits into from
Aug 5, 2024

Conversation

korowa
Copy link
Contributor

@korowa korowa commented Jul 23, 2024

Which issue does this PR close?

Related to #6937.
Closes #6937

Rationale for this change

Currently DF plans (almost always) two aggregation operators -- Partial and Final, executing one after another with Partial output being input for Final. In case aggregate input is almost/close to unique, Partial aggregation doesn't group data well (output row count +- same as input rowcount), and DF ends up with doing the same work twice.

Suggestion is to start skipping partial aggregation after some fixed amount of input rows, in case at that moment accumulated unique groups / input rows exceeds some fixed threshold value (which by default is somewhere between 0.5 and 1, but closer to 1), and produce batched "as-is", replacing aggregate accumulators inputs with corresponding intermediate aggregate states (in order not to break record batch schema for downstream operators -- specifically, for CoalesceBatches)

What changes are included in this PR?

  • Execution configuration options skip_partial_aggregation_probe_rows_threshold and skip_partial_aggregation_probe_ratio_threshold -- the first is responsible for input rows to aggregate before checking aggregation ratio, the second -- for rate threshold
  • GroupedHashAggregateStream.skip_aggregation_probe and related methods for updating state / obtaining information if further input aggregation may be skipped
  • GroupsAccumulator.convert_to_state, and its implementations for PrimitiveGroupsAccumulator (sum / min / max) and Count accumulators -- method responsible for converting RecordBatch to intermediate aggregate state, without grouping input data, and ``GroupsAccumulator.convert_to_state_supported`, which indicates that accumulator is able to perform conversion described above.

Are these changes tested?

Added tests for switching to SkippingAggregation state for aggregate stream, and sqllogictests to validate correctness of accumulators in skipping aggregation mode.

Are there any user-facing changes?

Partial aggregation results may now contain records with duplicating values of GROUP BY expressions

@github-actions github-actions bot added documentation Improvements or additions to documentation logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt) labels Jul 23, 2024
@alamb
Copy link
Contributor

alamb commented Jul 23, 2024

Thank you @korowa -- I think this is the right approach. The challenge when I tried it before was that it slowed down some queries. We should run some benchmarks (I can help maybe tomorrow)


match opt_filter {
Some(filter) => {
values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use filter kernel here instead of zipping?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about filter, but kernels sound like a good idea, I'll try to switch to using them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @Dandandan is suggesting using https://docs.rs/arrow/latest/arrow/compute/kernels/filter/fn.filter.html

However, that would likely require a second copy of the values (apply/collect filter result and then apply prim_fn)

Copy link
Contributor Author

@korowa korowa Jul 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but the example shows that it filters out values from the source array, and conversion to state must produce the same number of elements, just placing nulls zeros instead of filtered values, so I'm planning to look for smth like "apply null mask".

I've started with some benchmarks (criterion based ones) and they show that current code for nullable columns (at least for count) is significantly slower that for non nullable ones (~15 times 😞 ), probably some part of this time can be recovered.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, we can't use filter here as we need to produce the values as is.

I think we should be able to build the values based on the values buffer and handle nulls separately:

  • no filter: just pass null mask of values
  • filter present: bitwise_and both null masks

this should also be beneficial for the non-null case, as it avoids the iterator/builder

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For "no filter" -- casting values.logical_nulls() to i64 helps a bit. Regarding bitwise_and -- I'll try (the problem with all logical functions is that filter may also contain nulls)

@alamb
Copy link
Contributor

alamb commented Jul 25, 2024

I am starting to run clickbench and tpch benchmarks on this PR. Will report results shortly.

It is a really neat idea to have the threshold's configurable

@alamb
Copy link
Contributor

alamb commented Jul 25, 2024

Here are my benchmark results - they look quite good. Other than ClickBench Q32 and TPCH Q17 they all looks faster 😍

Details

--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  main_base ┃ skip-partial-aggregation ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     0.84ms │                   0.86ms │     no change │
│ QQuery 1     │    95.11ms │                  96.03ms │     no change │
│ QQuery 2     │   200.81ms │                 209.94ms │     no change │
│ QQuery 3     │   208.67ms │                 207.07ms │     no change │
│ QQuery 4     │  2233.18ms │                2095.56ms │ +1.07x faster │
│ QQuery 5     │  2059.66ms │                2015.05ms │     no change │
│ QQuery 6     │    83.99ms │                  86.96ms │     no change │
│ QQuery 7     │    99.67ms │                 101.10ms │     no change │
│ QQuery 8     │  3235.66ms │                3017.30ms │ +1.07x faster │
│ QQuery 9     │  2419.16ms │                2350.53ms │     no change │
│ QQuery 10    │   848.81ms │                 857.76ms │     no change │
│ QQuery 11    │   926.94ms │                 933.87ms │     no change │
│ QQuery 12    │  2176.13ms │                2087.42ms │     no change │
│ QQuery 13    │  4677.48ms │                3770.29ms │ +1.24x faster │
│ QQuery 14    │  2938.45ms │                2845.23ms │     no change │
│ QQuery 15    │  2504.24ms │                2371.75ms │ +1.06x faster │
│ QQuery 16    │  6069.34ms │                5811.38ms │     no change │
│ QQuery 17    │  5991.68ms │                5856.53ms │     no change │
│ QQuery 18    │ 12199.74ms │               11468.73ms │ +1.06x faster │
│ QQuery 19    │   171.89ms │                 171.08ms │     no change │
│ QQuery 20    │  2693.33ms │                2795.76ms │     no change │
│ QQuery 21    │  3491.08ms │                3566.37ms │     no change │
│ QQuery 22    │  9438.41ms │                9598.53ms │     no change │
│ QQuery 23    │ 22160.51ms │               22473.59ms │     no change │
│ QQuery 24    │  1344.81ms │                1409.66ms │     no change │
│ QQuery 25    │  1167.37ms │                1182.06ms │     no change │
│ QQuery 26    │  1482.09ms │                1518.54ms │     no change │
│ QQuery 27    │  4044.47ms │                4035.97ms │     no change │
│ QQuery 28    │ 29023.37ms │               30566.78ms │  1.05x slower │
│ QQuery 29    │  1064.52ms │                1076.49ms │     no change │
│ QQuery 30    │  2553.83ms │                2598.63ms │     no change │
│ QQuery 31    │  3274.52ms │                3309.47ms │     no change │
│ QQuery 32    │ 17306.62ms │               18361.28ms │  1.06x slower │
│ QQuery 33    │  9624.79ms │                9860.60ms │     no change │
│ QQuery 34    │  9610.64ms │                9676.40ms │     no change │
│ QQuery 35    │  3800.23ms │                3819.15ms │     no change │
│ QQuery 36    │   352.06ms │                 351.07ms │     no change │
│ QQuery 37    │   238.56ms │                 238.28ms │     no change │
│ QQuery 38    │   196.30ms │                 204.93ms │     no change │
│ QQuery 39    │  1122.84ms │                1152.25ms │     no change │
│ QQuery 40    │   101.24ms │                  96.72ms │     no change │
│ QQuery 41    │    85.59ms │                  84.80ms │     no change │
│ QQuery 42    │   104.55ms │                 104.32ms │     no change │
└──────────────┴────────────┴──────────────────────────┴───────────────┘
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark Summary                       ┃             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Total Time (main_base)                  │ 173423.19ms │
│ Total Time (skip-partial-aggregation)   │ 174436.08ms │
│ Average Time (main_base)                │   4033.10ms │
│ Average Time (skip-partial-aggregation) │   4056.65ms │
│ Queries Faster                          │           5 │
│ Queries Slower                          │           2 │
│ Queries with No Change                  │          36 │
└─────────────────────────────────────────┴─────────────┘
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ main_base ┃ skip-partial-aggregation ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │ 3850.08ms │                3858.37ms │     no change │
│ QQuery 1     │ 1558.99ms │                1493.93ms │     no change │
│ QQuery 2     │ 3150.28ms │                2935.05ms │ +1.07x faster │
└──────────────┴───────────┴──────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                       ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main_base)                  │ 8559.35ms │
│ Total Time (skip-partial-aggregation)   │ 8287.35ms │
│ Average Time (main_base)                │ 2853.12ms │
│ Average Time (skip-partial-aggregation) │ 2762.45ms │
│ Queries Faster                          │         1 │
│ Queries Slower                          │         0 │
│ Queries with No Change                  │         2 │
└─────────────────────────────────────────┴───────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ main_base ┃ skip-partial-aggregation ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  203.55ms │                 194.22ms │     no change │
│ QQuery 2     │   38.30ms │                  35.41ms │ +1.08x faster │
│ QQuery 3     │   61.24ms │                  59.49ms │     no change │
│ QQuery 4     │   65.67ms │                  60.42ms │ +1.09x faster │
│ QQuery 5     │  102.54ms │                  92.60ms │ +1.11x faster │
│ QQuery 6     │   15.35ms │                  13.56ms │ +1.13x faster │
│ QQuery 7     │  210.69ms │                 202.60ms │     no change │
│ QQuery 8     │   39.94ms │                  39.42ms │     no change │
│ QQuery 9     │  115.73ms │                 107.84ms │ +1.07x faster │
│ QQuery 10    │  103.60ms │                 101.10ms │     no change │
│ QQuery 11    │   73.46ms │                  71.29ms │     no change │
│ QQuery 12    │   47.43ms │                  44.81ms │ +1.06x faster │
│ QQuery 13    │   80.77ms │                  74.77ms │ +1.08x faster │
│ QQuery 14    │   18.05ms │                  18.70ms │     no change │
│ QQuery 15    │   32.48ms │                  29.60ms │ +1.10x faster │
│ QQuery 16    │   42.73ms │                  37.71ms │ +1.13x faster │
│ QQuery 17    │  160.06ms │                 159.77ms │     no change │
│ QQuery 18    │  463.05ms │                 428.43ms │ +1.08x faster │
│ QQuery 19    │   48.21ms │                  46.87ms │     no change │
│ QQuery 20    │  102.42ms │                  80.20ms │ +1.28x faster │
│ QQuery 21    │  295.09ms │                 266.05ms │ +1.11x faster │
│ QQuery 22    │   23.35ms │                  21.86ms │ +1.07x faster │
└──────────────┴───────────┴──────────────────────────┴───────────────┘

I am going to rerun the numbers to make sure they are reproducable and then give this PR a closer look

@alamb
Copy link
Contributor

alamb commented Jul 25, 2024

I am going to rerun the numbers to make sure they are reproducable and then give this PR a closer look

The subsequent runs look good (I don't think there is any slowdown in TPCH Q17, but there is still a slowdown in ClickBench Q32)

Details


--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  main_base ┃ skip-partial-aggregation ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     0.83ms │                   0.88ms │  1.06x slower │
│ QQuery 1     │    96.66ms │                  97.23ms │     no change │
│ QQuery 2     │   193.35ms │                 197.50ms │     no change │
│ QQuery 3     │   207.49ms │                 207.81ms │     no change │
│ QQuery 4     │  2230.56ms │                2259.35ms │     no change │
│ QQuery 5     │  2045.20ms │                2127.06ms │     no change │
│ QQuery 6     │    85.28ms │                  88.51ms │     no change │
│ QQuery 7     │   102.05ms │                  99.53ms │     no change │
│ QQuery 8     │  3233.59ms │                3276.54ms │     no change │
│ QQuery 9     │  2411.04ms │                2454.00ms │     no change │
│ QQuery 10    │   857.96ms │                 869.92ms │     no change │
│ QQuery 11    │   941.82ms │                 938.58ms │     no change │
│ QQuery 12    │  2162.34ms │                2202.18ms │     no change │
│ QQuery 13    │  4619.49ms │                3945.16ms │ +1.17x faster │
│ QQuery 14    │  2925.89ms │                2965.62ms │     no change │
│ QQuery 15    │  2504.40ms │                2503.25ms │     no change │
│ QQuery 16    │  6050.13ms │                6101.81ms │     no change │
│ QQuery 17    │  6006.98ms │                5982.81ms │     no change │
│ QQuery 18    │ 12183.46ms │               11770.51ms │     no change │
│ QQuery 19    │   176.35ms │                 178.67ms │     no change │
│ QQuery 20    │  2748.47ms │                2728.24ms │     no change │
│ QQuery 21    │  3540.80ms │                3529.89ms │     no change │
│ QQuery 22    │  9516.53ms │                9674.69ms │     no change │
│ QQuery 23    │ 22398.86ms │               22611.21ms │     no change │
│ QQuery 24    │  1363.24ms │                1404.88ms │     no change │
│ QQuery 25    │  1171.56ms │                1210.13ms │     no change │
│ QQuery 26    │  1505.58ms │                1535.15ms │     no change │
│ QQuery 27    │  4077.71ms │                4075.32ms │     no change │
│ QQuery 28    │ 28976.66ms │               30911.26ms │  1.07x slower │
│ QQuery 29    │  1022.97ms │                1047.79ms │     no change │
│ QQuery 30    │  2589.79ms │                2533.81ms │     no change │
│ QQuery 31    │  3310.10ms │                3238.71ms │     no change │
│ QQuery 32    │ 17074.56ms │               17987.52ms │  1.05x slower │
│ QQuery 33    │  9640.00ms │                9704.17ms │     no change │
│ QQuery 34    │  9720.05ms │                9635.63ms │     no change │
│ QQuery 35    │  3796.26ms │                3825.34ms │     no change │
│ QQuery 36    │   344.07ms │                 357.52ms │     no change │
│ QQuery 37    │   237.58ms │                 238.73ms │     no change │
│ QQuery 38    │   201.65ms │                 205.97ms │     no change │
│ QQuery 39    │  1150.94ms │                1203.07ms │     no change │
│ QQuery 40    │    94.45ms │                 100.77ms │  1.07x slower │
│ QQuery 41    │    87.01ms │                  84.64ms │     no change │
│ QQuery 42    │   104.06ms │                 104.01ms │     no change │
└──────────────┴────────────┴──────────────────────────┴───────────────┘
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark Summary                       ┃             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Total Time (main_base)                  │ 173707.77ms │
│ Total Time (skip-partial-aggregation)   │ 176215.37ms │
│ Average Time (main_base)                │   4039.72ms │
│ Average Time (skip-partial-aggregation) │   4098.03ms │
│ Queries Faster                          │           1 │
│ Queries Slower                          │           4 │
│ Queries with No Change                  │          38 │
└─────────────────────────────────────────┴─────────────┘
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ main_base ┃ skip-partial-aggregation ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │ 3837.97ms │                3863.80ms │     no change │
│ QQuery 1     │ 1551.46ms │                1475.34ms │     no change │
│ QQuery 2     │ 3138.85ms │                2967.22ms │ +1.06x faster │
└──────────────┴───────────┴──────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                       ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main_base)                  │ 8528.27ms │
│ Total Time (skip-partial-aggregation)   │ 8306.35ms │
│ Average Time (main_base)                │ 2842.76ms │
│ Average Time (skip-partial-aggregation) │ 2768.78ms │
│ Queries Faster                          │         1 │
│ Queries Slower                          │         0 │
│ Queries with No Change                  │         2 │
└─────────────────────────────────────────┴───────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ main_base ┃ skip-partial-aggregation ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  197.75ms │                 194.26ms │     no change │
│ QQuery 2     │   36.69ms │                  36.57ms │     no change │
│ QQuery 3     │   63.32ms │                  59.03ms │ +1.07x faster │
│ QQuery 4     │   71.51ms │                  65.53ms │ +1.09x faster │
│ QQuery 5     │   99.09ms │                  95.95ms │     no change │
│ QQuery 6     │   14.75ms │                  14.78ms │     no change │
│ QQuery 7     │  211.05ms │                 216.63ms │     no change │
│ QQuery 8     │   40.90ms │                  40.35ms │     no change │
│ QQuery 9     │  106.66ms │                 109.15ms │     no change │
│ QQuery 10    │  106.03ms │                 101.38ms │     no change │
│ QQuery 11    │   73.27ms │                  71.56ms │     no change │
│ QQuery 12    │   48.35ms │                  45.22ms │ +1.07x faster │
│ QQuery 13    │   81.21ms │                  80.28ms │     no change │
│ QQuery 14    │   19.21ms │                  18.23ms │ +1.05x faster │
│ QQuery 15    │   30.57ms │                  32.26ms │  1.06x slower │
│ QQuery 16    │   41.27ms │                  38.89ms │ +1.06x faster │
│ QQuery 17    │  153.12ms │                 159.38ms │     no change │
│ QQuery 18    │  451.90ms │                 447.09ms │     no change │
│ QQuery 19    │   47.87ms │                  46.85ms │     no change │
│ QQuery 20    │  107.44ms │                  87.75ms │ +1.22x faster │
│ QQuery 21    │  293.86ms │                 283.98ms │     no change │
│ QQuery 22    │   22.71ms │                  21.86ms │     no change │
└──────────────┴───────────┴──────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                       ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main_base)                  │ 2318.52ms │
│ Total Time (skip-partial-aggregation)   │ 2266.97ms │
│ Average Time (main_base)                │  105.39ms │
│ Average Time (skip-partial-aggregation) │  103.04ms │
│ Queries Faster                          │         6 │
│ Queries Slower                          │         1 │
│ Queries with No Change                  │        15 │
└─────────────────────────────────────────┴───────────┘

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really cool @korowa . Thank you so much

Not only is it cool that it improves performance in many cases, it is cool that it has in incremental approach (can implemente convert_to_state for GroupsAccumulators over time)

I have two concerns:

  1. That this approach may overfit the problem (aka that it isn't generalizeable outside the context of the benchmark runs)
  2. That this approach might preclude making some larger changes (like simply turning off the intermediate generation)

and produce batched "as-is", replacing aggregate accumulators inputs with corresponding intermediate aggregate states (in order not to break record batch schema for downstream operators -- specifically, for CoalesceBatches)

I wonder if you have thought about some way to disable aggregation entirely in the partial aggregation phase (as in avoid having to convert it into the state)? The challenge as you have pointed out is that the state types may be different than the input, so it would likely be a larger/more involved change 🤔

I want to think about this PR some more, but I think it is really nice and I am inclined to say we should proceed with this approach

I think to merge it I would like to see:

  1. Some more background comments on why this approach (the existing code in this PR is already very well commented about what it does 🥇 ) -- I plan to help with this
  2. Look into why the clickbench queries got slower (I am worried there is some tuning now required which will be hard to get totally optimal)

}

// Transforms input batch to intermediate aggregate state, without grouping it
fn transform_to_states(&self, batch: RecordBatch) -> Result<RecordBatch> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite clever

datafusion/functions-aggregate/src/count.rs Outdated Show resolved Hide resolved
});
builder.finish()
}
(None, None) => Int64Array::from_value(1, values.len()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is unfortunate that we need to create this over and over again 🤔

@korowa
Copy link
Contributor Author

korowa commented Jul 25, 2024

@alamb thank you for sharing benchmark results -- I'll check out if any of them benefited from this feature (I suppose it shouldn't be triggered in many of them) and will look for the possible reasons of q32 (and other queries) slowdown (actually this one -- q32, should benefit most, when producing state will be implemented for avg) + additionally check if generating state can be done faster via kernels instead of loops.

Regarding you comments:

That this approach may overfit the problem (aka that it isn't generalizeable outside the context of the benchmark runs)

Probably, but I supposed this idea to be opposite to overfitting, since it relies more on the input data, rather then fixed settings (I may be wrong here however).

That this approach might preclude making some larger changes (like simply turning off the intermediate generation)
...
I wonder if you have thought about some way to disable aggregation entirely in the partial aggregation phase

Initially I've been considering to make partial aggregation just propagate input batches as is, and adding some internal flag into their schema metadata (pointing that final aggregation to use update_batch instead of merge_batch), but decided that it may be too "pipeline breaking" due to different batch schemas (as you've pointed out) -- I suppose it'll require an additional logic in CoalesceBatchesExec (it wont be able to concat batches with different schemas, coming from on/off partial aggregation partitions), it also could be a blocker for any (unplanned yet) optimizations of RepartitionExec (in case there will be some buffers with embedded batch concatenation), and it also may produce some burden for DF-based projects doing data shuffling across the nodes and having their own shuffle operators. Overall -- I've decided that current approach is more safe (at least at this moment), as it doesn't affect anything besides aggregation operator.

let mut builder = Int64Builder::with_capacity(values.len());
nulls
.into_iter()
.for_each(|is_valid| builder.append_value(is_valid as i64));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe .collect() should be slightly faster and less verbose than a builder here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even better, we should be able to cast the null array to int64

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW: into_iter().map().collect::<Int64Array>() seems to be slower than appending values to the builder 🤔

@alamb
Copy link
Contributor

alamb commented Jul 27, 2024

@alamb thank you for sharing benchmark results -- I'll check out if any of them benefited from this feature (I suppose it shouldn't be triggered in many of them) and will look for the possible reasons of q32 (and other queries) slowdown (actually this one -- q32, should benefit most, when producing state will be implemented for avg) + additionally check if generating state can be done faster via kernels instead of loops.

Awesome -- I am planning to look into them as well

That this approach may overfit the problem (aka that it isn't generalizeable outside the context of the benchmark runs)

Probably, but I supposed this idea to be opposite to overfitting, since it relies more on the input data, rather then fixed settings (I may be wrong here however).

The more I think about it, the more I agree with you. While there are tuning knobs (e.g. the fraction of tuples aggregates) I do think they are general.

That this approach might preclude making some larger changes (like simply turning off the intermediate generation)
...
I wonder if you have thought about some way to disable aggregation entirely in the partial aggregation phase

Initially I've been considering to make partial aggregation just propagate input batches as is, and adding some internal flag into their schema metadata (pointing that final aggregation to use update_batch instead of merge_batch), but decided that it may be too "pipeline breaking" due to different batch schemas (as you've pointed out) -- I suppose it'll require an additional logic in CoalesceBatchesExec (it wont be able to concat batches with different schemas, coming from on/off partial aggregation partitions), it also could be a blocker for any (unplanned yet) optimizations of RepartitionExec (in case there will be some buffers with embedded batch concatenation), and it also may produce some burden for DF-based projects doing data shuffling across the nodes and having their own shuffle operators. Overall -- I've decided that current approach is more safe (at least at this moment), as it doesn't affect anything besides aggregation operator.

I think this makes sense and I agree with your conclusion

@alamb
Copy link
Contributor

alamb commented Jul 27, 2024

My plan here is to spend time tomorrow morning doing some additional investigation / testing on the branch and unless I find any blockers I think we should proceed with it.

What I am thinking is that between this PR and the StringView PR #11667 we are going to be in pretty sweet shape.

The improvements with this change are so compelling in my opinion that I think we can document any potential performance regressions that this PR causes, and then work on them as a follow on before the release.

@korowa
Copy link
Contributor Author

korowa commented Jul 28, 2024

FWIW: regarding benchmarks -- running with target_partitions=4 shows that this feature is enabling while clickbench Q13 (count distinct is rewritten to double group by) and tpch Q20 (one of the filters contains correlated subquery with aggregation). Also partial aggregation is skipped on 1/4 partitions in clickbench Q18 and tpch Q16. As a result -- I'd expect any performance improvements only in clickbench Q13 and tcph Q20 (don't think 1/4 partitions in other two queries is able to make any effect), and I suppose that improvements shown by any other queries to be just a matter of luck and fluctuations -- I wasn't able to find any stable regressions during local benchmark runs.

Regarding Q32 -- I've run it separately and got equal runtimes for both branches (due to AVG it's not able to skip partial aggregation yet)

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃     master ┃ skip-partial-aggregation ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0     │ 19399.38ms │               19424.11ms │ no change │
└──────────────┴────────────┴──────────────────────────┴───────────┘

@alamb
Copy link
Contributor

alamb commented Jul 28, 2024

I spent some time this morning playing around with ClickBench query 32 locally and I agree any slowdown does not look significant or a blocker.

Q32

SELECT "WatchID", "ClientIP", COUNT(*) AS c, SUM("IsRefresh"), AVG("ResolutionWidth") FROM hits GROUP BY "WatchID", "ClientIP" ORDER BY c DESC LIMIT 10;

Running from datafusion-cli:

./datafusion-cli-skip-partial -c "SELECT \"WatchID\", \"ClientIP\", COUNT(*) AS c, SUM(\"IsRefresh\"), AVG(\"ResolutionWidth\") FROM 'hits.parquet' GROUP BY \"WatchID\", \"ClientIP\" ORDER BY c DESC LIMIT 10;"
datafusion-cli -c "SELECT \"WatchID\", \"ClientIP\", COUNT(*) AS c, SUM(\"IsRefresh\"), AVG(\"ResolutionWidth\") FROM 'hits.parquet' GROUP BY \"WatchID\", \"ClientIP\" ORDER BY c DESC LIMIT 10;"

Here are the timings I got:

skip-partial
4.570
4.280
4.528
main
4.564
4.444
4.441

@alamb
Copy link
Contributor

alamb commented Jul 28, 2024

I also tried out Q32 (that has AVG so can't use this optimization yet) but removed the AVG and set target partitions to something silly. I see this PR making a substantial difference (6s vs 7s)

1000 partitions, this PR

andrewlamb@Andrews-MacBook-Pro-2:~/Downloads$ ./datafusion-cli-skip-partial -c "set datafusion.execution.target_partitions = 1000; SELECT \"WatchID\", \"ClientIP\", COUNT(*) AS c, SUM(\"IsRefresh\") FROM 'hits.parquet' GROUP BY \"WatchID\", \"ClientIP\" ORDER BY c DESC LIMIT 10;"

Elapsed 0.001 seconds.

+---------------------+-------------+---+-----------------------------+
| WatchID             | ClientIP    | c | sum(hits.parquet.IsRefresh) |
+---------------------+-------------+---+-----------------------------+
| 7904046282518428963 | 1509330109  | 2 | 0                           |
| 8566928176839891583 | -1402644643 | 2 | 0                           |
| 6655575552203051303 | 1611957945  | 2 | 0                           |
| 7224410078130478461 | -776509581  | 2 | 0                           |
| 9102894172721185728 | 1489622498  | 1 | 1                           |
| 8964981845434484863 | 1822336830  | 1 | 0                           |
| 6991883311913569583 | -745122562  | 1 | 0                           |
| 6787783378461221127 | -506600142  | 1 | 0                           |
| 6042898921955304644 | 2054220936  | 1 | 0                           |
| 5581365862985039198 | 104944290   | 1 | 0                           |
+---------------------+-------------+---+-----------------------------+
10 row(s) fetched.
Elapsed 6.378 seconds.

1000 partitions, main

andrewlamb@Andrews-MacBook-Pro-2:~/Downloads$ datafusion-cli -c "set datafusion.execution.target_partitions = 1000; SELECT \"WatchID\", \"ClientIP\", COUNT(*) AS c, SUM(\"IsRefresh\") FROM 'hits.parquet' GROUP BY \"WatchID\", \"ClientIP\" ORDER BY c DESC LIMIT 10;"
DataFusion CLI v40.0.0
0 row(s) fetched.
Elapsed 0.002 seconds.

+---------------------+-------------+---+-----------------------------+
| WatchID             | ClientIP    | c | sum(hits.parquet.IsRefresh) |
+---------------------+-------------+---+-----------------------------+
| 7904046282518428963 | 1509330109  | 2 | 0                           |
| 8566928176839891583 | -1402644643 | 2 | 0                           |
| 6655575552203051303 | 1611957945  | 2 | 0                           |
| 7224410078130478461 | -776509581  | 2 | 0                           |
| 6780795588237729988 | 1894276368  | 1 | 1                           |
| 6158430646513894356 | -1557291761 | 1 | 0                           |
| 8433113762047612962 | 1214823432  | 1 | 0                           |
| 8783130976633619349 | 1072197582  | 1 | 0                           |
| 4959259883895284379 | 2023656393  | 1 | 0                           |
| 6328586531975293675 | 1549952556  | 1 | 1                           |
+---------------------+-------------+---+-----------------------------+
10 row(s) fetched.
Elapsed 7.771 seconds.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent a bunch more time reviewing this PR today and I think it is good and could be merged as is. Thank you so much @korowa and @Dandandan )

Before merging this PR I think we need

  • Run the benchmarks one more time
  • Give it a few more days to gather any more review comments

Here are the follow up items I suggest (and I can file tickets):

  • More documentation (I started here Improve aggregation documentation for multi-phase aggregation  #11695)
  • Add a metric to record when group by switches to skip partial aggregate mode, so we can see when this happens in EXPLAIN ANALYZE plans
  • File tickets to support convert_to_state for other GroupsAccumulators (like AVG for example) -- I think this could be done by the larger community easier after the additional documentaiton (and they can follow the test pattern you have in this PR)

FYI @kazuyukitanimura -- I wonder if you have time to review this change in the context of hash aggregate spilling as you originally contributed #7400

Context:

@@ -90,6 +94,69 @@ struct SpillState {
merging_group_by: PhysicalGroupBy,
}

struct SkipAggregationProbe {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @kazuyukitanimura -- I wonder if you have time to review this change to hash aggregate spilling as you originally contributed #7400

Context:


match opt_filter {
Some(filter) => {
values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @Dandandan is suggesting using https://docs.rs/arrow/latest/arrow/compute/kernels/filter/fn.filter.html

However, that would likely require a second copy of the values (apply/collect filter result and then apply prim_fn)

@@ -484,6 +612,12 @@ impl Stream for GroupedHashAggregateStream {
(
if self.input_done {
ExecutionState::Done
} else if self
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit is that putting this into a function (like self.should_skip_aggregation()) would make this logic easier to follow

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #11821 with a proposal for this change

@korowa
Copy link
Contributor Author

korowa commented Jul 28, 2024

1000 partitions

@alamb this is also a bit unexpected, since default value of rows to fire check after is 100_000 and its applied per partition (each partition is going to process at least 100k rows normally, without skipping aggregation), and the total number of rows in the file ~100kk (if I'm not mistaken). So this optimization should not benefit in this case, as in case of 1000 partitions each partition will read ~100_000 rows anyway 🤔

@ozankabak
Copy link
Contributor

We will also take a look today or tomorrow

let mut builder = Int64Builder::with_capacity(values.len());
nulls.into_iter().zip(filter.iter()).for_each(
|(is_valid, filter_value)| {
builder.append_value(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bitwise_and + cast?

Copy link
Contributor

@alamb alamb Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can use the nullif kernel here

Something like

let nulls = and(nulls, not(filter));
let output = nullif(values);

Update: or maybe we could just and the nulls from the input and the filter (as nulls is the validity mask` 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I came up wtih this in #11734:

/// Converts a `BooleanBuffer` representing a filter to a `NullBuffer`
/// where the NullBuffer is true for all values that were true
/// in the filter and `null` for any values that were false or null
fn filter_to_nulls(filter: &BooleanArray) -> Option<NullBuffer> {
    let (filter_bools, filter_nulls) = filter.clone().into_parts();
    // Only keep values where the filter was true
    // convert all false to null
    let filter_bools = NullBuffer::from(filter_bools);
    NullBuffer::union(Some(&filter_bools), filter_nulls.as_ref())
}

/// Compute the final null mask for an array
///
/// The output null mask :
/// * is true (non null) for all values that were true in the filter and non null in the input
/// * is false (null) for all values that were false in the filter or null in the input
fn filtered_null_mask(
    opt_filter: Option<&BooleanArray>,
    input: &dyn Array,
) -> Option<NullBuffer> {
    let opt_filter = opt_filter.and_then(filter_to_nulls);
    NullBuffer::union(opt_filter.as_ref(), input.nulls())
}

And then you compute the final null mask without messing with the input:

        let nulls = filtered_null_mask(opt_filter, sums);
        let sums = PrimitiveArray::<T>::new(sums.values().clone(), nulls)
            .with_data_type(self.sum_data_type.clone());

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, using NullBuffer::union is much better for the readability

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was the missing link for me (thank you!) -- we can operate directly on underlying buffers.

I've rewritten state conversion for count on bitand on buffers + cast to Int64 in the end, and according to benchmarks from the commit it got 20-25% faster.

Just a suggestion -- won't it be better to use BooleanBuffer + & (bitand operator) instead of NullBuffer + union? NullBuffer is a bit confusing, so I've "pulled" the logic from union right into state conversion function.

Additionally, I plan to prepare benches and minimize ArrayBuilder usage for min / max / sum during tomorrow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've rewritten state conversion for count on bitand on buffers + cast to Int64 in the end, and according to benchmarks from the commit it got 20-25% faster.

🎉

Just a suggestion -- won't it be better to use BooleanBuffer + & (bitand operator) instead of NullBuffer + union? NullBuffer is a bit confusing, so I've "pulled" the logic from union right into state conversion function.

I think they are equivalent: NullBuffer just wraps BooleanBuffer and NullBuffer::union just calls & underneath: https://docs.rs/arrow-buffer/52.2.0/src/arrow_buffer/buffer/null.rs.html#76 (after replicating the match(nulls, filter) logic)

I don't have a strong opinion about which is more/less confusing

What I suggest we do is pull the logic to compute the output null mask basd on the optional input nullmask and the optional filter into a function (like fn filtered_null_mask) as it will be used in basically all of the convert_to_state implementations. As long as it is well documented, I think either implementation will work well

Additionally, I plan to prepare benches and minimize ArrayBuilder usage for min / max / sum during tomorrow.

Sounds good -- would you like to keep updating this PR or shall we merge this PR and continue improvements with additional PRs on main?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to make these few changes in this PR (along with merging docs update and review suggestions) -- don't think it'll take long enough to accumulate any significant conflicts.

Copy link
Contributor

@alamb alamb Aug 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. We will wait for you to let us know when it is ready to merge

filter.into_iter().for_each(|filter_value| {
builder.append_value(filter_value.is_some_and(|val| val) as i64)
});
builder.finish()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cast?

@Dandandan
Copy link
Contributor

I added a couple of suggestions for performance

@alamb
Copy link
Contributor

alamb commented Jul 29, 2024

So this optimization should not benefit in this case, as in case of 1000 partitions each partition will read ~100_000 rows anyway 🤔

@korowa If we added a metric that tracks when this mode switched in, I think it would be easier to diagnose what is going on. I will make a PR to do so.

@alamb
Copy link
Contributor

alamb commented Jul 29, 2024

We will also take a look today or tomorrow

@ozankabak if I may toot my own horn a bit, I would personally suggest checking out the docs I wrote korowa#172 (and #11695) before the code of this PR as I tried to explain more at a high level what it is doing.

@korowa
Copy link
Contributor Author

korowa commented Aug 3, 2024

So if you are ok with that @korowa let's get this green and merge.

@alamb I'm totally fine with that -- taking into account, that there are already some followups/improvements for this feature, it's not worth blocking them (since making state conversion for COUNT faster, will probably take some time for me).

Please let me know if there are any changes/fixes that have to be done in order to make this PR ready for merging.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took another look at this PR and I think it is looking very nice. Thank you again @korowa and all reviewers

I will plan to merge it tomorrow (Monday) and file follow on tickets to track additional work.

@alamb
Copy link
Contributor

alamb commented Aug 5, 2024

🚀

@alamb
Copy link
Contributor

alamb commented Aug 5, 2024

Thank you again everyone for all your work.

I am hoping this is the first step towards some significantly improved TPCH / ClickBench performance

I filed the following follow on tickets / PRs:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate documentation Improvements or additions to documentation logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve Memory usage + performance with large numbers of groups / High Cardinality Aggregates
8 participants