Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add Cube UDTF #3935

Merged
merged 3 commits into from
Nov 27, 2019
Merged

feat: Add Cube UDTF #3935

merged 3 commits into from
Nov 27, 2019

Conversation

vpapavas
Copy link
Member

@vpapavas vpapavas commented Nov 21, 2019

Description

The cube UDTF takes as argument a list of d columns and creates up tp 2^d rows (excludes duplicates created by null values in the input). Normally, cube is as an aggregate operator (extension to Group By) used in multi-dimensional data. It is a popular feature supported by all RDBMSs and Spark, Flink As we don't have support for Grouping Sets yet, this UDTF enables us to achieve the same result in two steps: First, apply the cube UDTF to create all combinations, then apply aggregations on the result.

Example:
SELECT cube_explode(as_array(col1, col2)) VAL FROM TEST;

Result:

VAL
[col1 , col2]
[col1 , null]
[null ,col2]
[null , null]

Once we have support for variadic parameters and Object data type in UDTFs, I will change this to not take an array as parameter.

Testing done

Added unit test and QTT test

Reviewer checklist

Did not update docs yet!

@vpapavas vpapavas requested a review from a team as a code owner November 21, 2019 01:16
Copy link
Contributor

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. Just few rename/test suggestions.

import java.util.Collections;
import java.util.List;

@UdtfDescription(name = "cube", author = KsqlConstants.CONFLUENT_AUTHOR,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we call this cube_explode , so that we could reserve cube as a keyword if we choose to add real support via grouping sets down the line?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will also make it clearer that this is a UDTF, since explode is a well understood UDTF already across many projects.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree on reserving cube name for the real thing. Perhaps this fn is more of a permute ? Either way, it feels like kind of a hack for simply enhancing the recently-added explode fn to have a variant which takes multiple input column names (variadic version) or another override which takes a list of columns ? Adding a whole new fn to achieve that feels like an unnecessary cognitive burden ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@blueedgenick The cube udtf actually does more than the explode one. Explode creates as many rows as tuples in the array. Cube creates 2^d rows.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps i'm reading it wrong, that happens quite often :-) - but isn't that what explode would do, if you could pass it >1 array at once ? (perhaps minus the permutations with null for one or more inputs)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

happy to be wrong, perhaps a richer example would help?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @blueedgenick! These null permutations is exactly what we want. I added an example and some links. Hope this makes it more clear.

createAllCombinations(columns, pos + 1, current, result);

if (current.get(pos) == null) {
current.remove(pos);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like we remove in both if and else, for generating the null combination. can we do it once outside the block?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could just pull forward the current.remove(pos); line and then do the else case by checking not null.

@agavra agavra requested a review from a team November 21, 2019 18:54
public class Cube {

@Udtf
public <T> List<List<T>> cube(final List<T> columns) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my own knowledge, what's the API for a UDTF? What types are used in practice for T, the columns?

}


private <T> void createAllCombinations(List<T> columns, int pos, List<T> current,
Copy link
Contributor

@agavra agavra Nov 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a fun trick. I was wondering if we could write this iteratively without recursion, and this neat solution came to mind that avoids having 2^n stacks and (personally) I think is actually easier to read as well:

    List<String> input = ImmutableList.of("1", "2", "3");
    int combinations = 1 << input.size();

    List<List<String>> result = new ArrayList<>(combinations);
    // bitmask is a binary number where a set bit represents that
    // the value at that index of input should be included - iterate
    // backwards so that we start with a full row instead of an empty
    // one
    for (int bitmask = combinations - 1; bitmask >= 0; bitmask--) {
      List<String> row = new ArrayList<>(input.size());
      for (int i = 0; i < input.size(); i++) {
        row.add((bitmask & (1 << i)) == 0 ? null : input.get(i));
      }
      result.add(row);
    }

    System.out.println(result);

output for this implementation:

[[1, 2, 3], [null, 2, 3], [1, null, 3], [null, null, 3], [1, 2, null], [null, 2, null], [1, null, null], [null, null, null]]

P.S. I spent a disgusting amount of time trying to figure this out, and was finally inspired by Guava's Sets#powerSet

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's really really cool! Thanks Almog! Incorporated your suggestion.

@agavra agavra requested a review from a team November 22, 2019 23:11
Add support for cube udtf

change name, add test case

fixed input for test
Copy link
Contributor

@agavra agavra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM! There's a few copy suggestions and test formatting, I'm glad the bitmask solution worked out 😂

@vpapavas vpapavas merged commit 6be8e7c into confluentinc:master Nov 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants