Backport: [SPARK-12213][SQL] use multiple partitions for single distinct query #148

mbautin · 2016-02-01T18:26:23Z

Original commit message by Davies Liu:

Currently, we could generate different plans for query with single distinct (depends on spark.sql.specializeSingleDistinctAggPlanning), one works better on low cardinality columns, the other
works better for high cardinality column (default one).

This PR change to generate a single plan (three aggregations and two exchanges), which work better in both cases, then we could safely remove the flag spark.sql.specializeSingleDistinctAggPlanning (introduced in 1.6).

For a query like SELECT COUNT(DISTINCT a) FROM table will be

AGG-4 (count distinct)
  Shuffle to a single reducer
    Partial-AGG-3 (count distinct, no grouping)
      Partial-AGG-2 (grouping on a)
        Shuffle by a
          Partial-AGG-1 (grouping on a)

This PR also includes large refactor for aggregation (reduce 500+ lines of code)

cc yhuai nongli marmbrus

Author: Davies Liu [email protected]

Closes apache#10228 from davies/single_distinct.

Currently, we could generate different plans for query with single distinct (depends on spark.sql.specializeSingleDistinctAggPlanning), one works better on low cardinality columns, the other works better for high cardinality column (default one). This PR change to generate a single plan (three aggregations and two exchanges), which work better in both cases, then we could safely remove the flag `spark.sql.specializeSingleDistinctAggPlanning` (introduced in 1.6). For a query like `SELECT COUNT(DISTINCT a) FROM table` will be ``` AGG-4 (count distinct) Shuffle to a single reducer Partial-AGG-3 (count distinct, no grouping) Partial-AGG-2 (grouping on a) Shuffle by a Partial-AGG-1 (grouping on a) ``` This PR also includes large refactor for aggregation (reduce 500+ lines of code) cc yhuai nongli marmbrus Author: Davies Liu <[email protected]> Closes apache#10228 from davies/single_distinct.

Backport: [SPARK-12213][SQL] use multiple partitions for single distinct query

mbautin assigned markhamstra Feb 1, 2016

markhamstra added a commit that referenced this pull request Feb 1, 2016

Merge pull request #148 from mbautin/csd-1.6_SPARK-12213

bea8845

Backport: [SPARK-12213][SQL] use multiple partitions for single distinct query

markhamstra merged commit bea8845 into alteryx:csd-1.6 Feb 1, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backport: [SPARK-12213][SQL] use multiple partitions for single distinct query #148

Backport: [SPARK-12213][SQL] use multiple partitions for single distinct query #148

mbautin commented Feb 1, 2016

Backport: [SPARK-12213][SQL] use multiple partitions for single distinct query #148

Backport: [SPARK-12213][SQL] use multiple partitions for single distinct query #148

Conversation

mbautin commented Feb 1, 2016