feat: Add Cube UDTF #3935

vpapavas · 2019-11-21T01:16:01Z

Description

The cube UDTF takes as argument a list of d columns and creates up tp 2^d rows (excludes duplicates created by null values in the input). Normally, cube is as an aggregate operator (extension to Group By) used in multi-dimensional data. It is a popular feature supported by all RDBMSs and Spark, Flink As we don't have support for Grouping Sets yet, this UDTF enables us to achieve the same result in two steps: First, apply the cube UDTF to create all combinations, then apply aggregations on the result.

Example:
SELECT cube_explode(as_array(col1, col2)) VAL FROM TEST;

Result:

VAL
[col1 , col2]
[col1 , null]
[null ,col2]
[null , null]

Once we have support for variadic parameters and Object data type in UDTFs, I will change this to not take an array as parameter.

Testing done

Added unit test and QTT test

Reviewer checklist

Did not update docs yet!

vinothchandar

LGTM overall. Just few rename/test suggestions.

vinothchandar · 2019-11-21T18:05:11Z

ksql-engine/src/main/java/io/confluent/ksql/function/udtf/Cube.java

+import java.util.Collections;
+import java.util.List;
+
+@UdtfDescription(name = "cube", author = KsqlConstants.CONFLUENT_AUTHOR,


could we call this cube_explode , so that we could reserve cube as a keyword if we choose to add real support via grouping sets down the line?

This will also make it clearer that this is a UDTF, since explode is a well understood UDTF already across many projects.

Agree on reserving cube name for the real thing. Perhaps this fn is more of a permute ? Either way, it feels like kind of a hack for simply enhancing the recently-added explode fn to have a variant which takes multiple input column names (variadic version) or another override which takes a list of columns ? Adding a whole new fn to achieve that feels like an unnecessary cognitive burden ?

@blueedgenick The cube udtf actually does more than the explode one. Explode creates as many rows as tuples in the array. Cube creates 2^d rows.

perhaps i'm reading it wrong, that happens quite often :-) - but isn't that what explode would do, if you could pass it >1 array at once ? (perhaps minus the permutations with null for one or more inputs)

happy to be wrong, perhaps a richer example would help?

Hey @blueedgenick! These null permutations is exactly what we want. I added an example and some links. Hope this makes it more clear.

vinothchandar · 2019-11-21T18:14:18Z

ksql-engine/src/main/java/io/confluent/ksql/function/udtf/Cube.java

+    createAllCombinations(columns, pos + 1, current, result);
+
+    if (current.get(pos) == null) {
+      current.remove(pos);


seems like we remove in both if and else, for generating the null combination. can we do it once outside the block?

You could just pull forward the current.remove(pos); line and then do the else case by checking not null.

ksql-functional-tests/src/test/resources/query-validation-tests/cube.json

ksql-engine/src/test/java/io/confluent/ksql/function/udtf/CubeTest.java

AlanConfluent · 2019-11-21T21:30:04Z

ksql-engine/src/main/java/io/confluent/ksql/function/udtf/Cube.java

+public class Cube {
+
+  @Udtf
+  public <T> List<List<T>> cube(final List<T> columns) {


Just for my own knowledge, what's the API for a UDTF? What types are used in practice for T, the columns?

agavra · 2019-11-22T23:11:17Z

ksql-engine/src/main/java/io/confluent/ksql/function/udtf/Cube.java

+  }
+
+
+  private <T> void createAllCombinations(List<T> columns, int pos, List<T> current,


Here's a fun trick. I was wondering if we could write this iteratively without recursion, and this neat solution came to mind that avoids having 2^n stacks and (personally) I think is actually easier to read as well:

List<String> input = ImmutableList.of("1", "2", "3"); int combinations = 1 << input.size(); List<List<String>> result = new ArrayList<>(combinations); // bitmask is a binary number where a set bit represents that // the value at that index of input should be included - iterate // backwards so that we start with a full row instead of an empty // one for (int bitmask = combinations - 1; bitmask >= 0; bitmask--) { List<String> row = new ArrayList<>(input.size()); for (int i = 0; i < input.size(); i++) { row.add((bitmask & (1 << i)) == 0 ? null : input.get(i)); } result.add(row); } System.out.println(result);

output for this implementation:

[[1, 2, 3], [null, 2, 3], [1, null, 3], [null, null, 3], [1, 2, null], [null, 2, null], [1, null, null], [null, null, null]]

P.S. I spent a disgusting amount of time trying to figure this out, and was finally inspired by Guava's Sets#powerSet

That's really really cool! Thanks Almog! Incorporated your suggestion.

Add support for cube udtf change name, add test case fixed input for test

bitmask

agavra

This LGTM! There's a few copy suggestions and test formatting, I'm glad the bitmask solution worked out 😂

ksql-engine/src/main/java/io/confluent/ksql/function/udtf/Cube.java

ksql-engine/src/test/java/io/confluent/ksql/function/udtf/CubeTest.java

vpapavas requested a review from a team as a code owner November 21, 2019 01:16

vpapavas requested a review from vinothchandar November 21, 2019 01:16

vinothchandar approved these changes Nov 21, 2019

View reviewed changes

agavra requested a review from a team November 21, 2019 18:54

agavra reviewed Nov 21, 2019

View reviewed changes

ksql-engine/src/test/java/io/confluent/ksql/function/udtf/CubeTest.java Show resolved Hide resolved

AlanConfluent reviewed Nov 21, 2019

View reviewed changes

agavra reviewed Nov 22, 2019

View reviewed changes

agavra requested a review from a team November 22, 2019 23:11

vpapavas added 2 commits November 26, 2019 14:15

cube udtf sketch

20d0bf4

Add support for cube udtf change name, add test case fixed input for test

added almogs bitmask implementation

988fa27

bitmask

vpapavas force-pushed the cube-udf branch from 5a27a25 to 988fa27 Compare November 26, 2019 22:15

agavra approved these changes Nov 27, 2019

View reviewed changes

address almog's comments

3094fa8

vpapavas merged commit 6be8e7c into confluentinc:master Nov 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add Cube UDTF #3935

feat: Add Cube UDTF #3935

vpapavas commented Nov 21, 2019 •

edited

Loading

vinothchandar left a comment

vinothchandar Nov 21, 2019

vinothchandar Nov 21, 2019

blueedgenick Nov 21, 2019

vpapavas Nov 21, 2019

blueedgenick Nov 21, 2019

blueedgenick Nov 21, 2019

vpapavas Nov 22, 2019

vinothchandar Nov 21, 2019

AlanConfluent Nov 21, 2019

AlanConfluent Nov 21, 2019

agavra Nov 22, 2019 •

edited

Loading

vpapavas Nov 26, 2019

agavra left a comment

		}


		private <T> void createAllCombinations(List<T> columns, int pos, List<T> current,

feat: Add Cube UDTF #3935

feat: Add Cube UDTF #3935

Conversation

vpapavas commented Nov 21, 2019 • edited Loading

Description

Testing done

Reviewer checklist

vinothchandar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agavra Nov 22, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agavra left a comment

Choose a reason for hiding this comment

vpapavas commented Nov 21, 2019 •

edited

Loading

agavra Nov 22, 2019 •

edited

Loading