-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add Cube UDTF #3935
feat: Add Cube UDTF #3935
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall. Just few rename/test suggestions.
import java.util.Collections; | ||
import java.util.List; | ||
|
||
@UdtfDescription(name = "cube", author = KsqlConstants.CONFLUENT_AUTHOR, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we call this cube_explode
, so that we could reserve cube
as a keyword if we choose to add real support via grouping sets down the line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will also make it clearer that this is a UDTF, since explode is a well understood UDTF already across many projects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree on reserving cube
name for the real thing. Perhaps this fn is more of a permute
? Either way, it feels like kind of a hack for simply enhancing the recently-added explode
fn to have a variant which takes multiple input column names (variadic version) or another override which takes a list of columns ? Adding a whole new fn to achieve that feels like an unnecessary cognitive burden ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@blueedgenick The cube udtf actually does more than the explode one. Explode creates as many rows as tuples in the array. Cube creates 2^d rows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps i'm reading it wrong, that happens quite often :-) - but isn't that what explode
would do, if you could pass it >1 array at once ? (perhaps minus the permutations with null
for one or more inputs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
happy to be wrong, perhaps a richer example would help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @blueedgenick! These null permutations is exactly what we want. I added an example and some links. Hope this makes it more clear.
createAllCombinations(columns, pos + 1, current, result); | ||
|
||
if (current.get(pos) == null) { | ||
current.remove(pos); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like we remove in both if and else, for generating the null combination. can we do it once outside the block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could just pull forward the current.remove(pos);
line and then do the else case by checking not null.
ksql-functional-tests/src/test/resources/query-validation-tests/cube.json
Show resolved
Hide resolved
public class Cube { | ||
|
||
@Udtf | ||
public <T> List<List<T>> cube(final List<T> columns) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for my own knowledge, what's the API for a UDTF? What types are used in practice for T, the columns?
} | ||
|
||
|
||
private <T> void createAllCombinations(List<T> columns, int pos, List<T> current, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a fun trick. I was wondering if we could write this iteratively without recursion, and this neat solution came to mind that avoids having 2^n stacks and (personally) I think is actually easier to read as well:
List<String> input = ImmutableList.of("1", "2", "3");
int combinations = 1 << input.size();
List<List<String>> result = new ArrayList<>(combinations);
// bitmask is a binary number where a set bit represents that
// the value at that index of input should be included - iterate
// backwards so that we start with a full row instead of an empty
// one
for (int bitmask = combinations - 1; bitmask >= 0; bitmask--) {
List<String> row = new ArrayList<>(input.size());
for (int i = 0; i < input.size(); i++) {
row.add((bitmask & (1 << i)) == 0 ? null : input.get(i));
}
result.add(row);
}
System.out.println(result);
output for this implementation:
[[1, 2, 3], [null, 2, 3], [1, null, 3], [null, null, 3], [1, 2, null], [null, 2, null], [1, null, null], [null, null, null]]
P.S. I spent a disgusting amount of time trying to figure this out, and was finally inspired by Guava's Sets#powerSet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's really really cool! Thanks Almog! Incorporated your suggestion.
Add support for cube udtf change name, add test case fixed input for test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM! There's a few copy suggestions and test formatting, I'm glad the bitmask solution worked out 😂
Description
The
cube
UDTF takes as argument a list of d columns and creates up tp 2^d rows (excludes duplicates created bynull
values in the input). Normally,cube
is as an aggregate operator (extension toGroup By
) used in multi-dimensional data. It is a popular feature supported by all RDBMSs and Spark, Flink As we don't have support forGrouping Sets
yet, this UDTF enables us to achieve the same result in two steps: First, apply thecube
UDTF to create all combinations, then apply aggregations on the result.Example:
SELECT cube_explode(as_array(col1, col2)) VAL FROM TEST;
Result:
Once we have support for variadic parameters and Object data type in UDTFs, I will change this to not take an array as parameter.
Testing done
Added unit test and QTT test
Reviewer checklist
Did not update docs yet!