-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SQL functions for working with Maps and Arrays #1611
Add SQL functions for working with Maps and Arrays #1611
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @blueedgenick, I've left a few comments.
| ARRAY_DISTINCT | ``ARRAY_DISTINCT(array_col)`` | Returns an array of all the distinct values from | | ||
| | | the input array, or NULL if the input is NULL. | | ||
+------------------------+------------------------------------------------------------+---------------------------------------------------+ | ||
| ARRAY_EXCEPT | ``ARRAY_EXCEPT(array1, array2)`` | Returns an array of all the distinct values from | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this one be returning distinct values? Or just the array1 filtering out values in array2? That would make the function more single purpose. They can always compose it with ARRAY_DISTINCT if they want a distinct output.
@@ -983,13 +983,46 @@ Scalar functions | |||
+========================+============================================================+===================================================+ | |||
| ABS | ``ABS(col1)`` | The absolute value of a value. | | |||
+------------------------+------------------------------------------------------------+---------------------------------------------------+ | |||
| ARRAY_DISTINCT | ``ARRAY_DISTINCT(array_col)`` | Returns an array of all the distinct values from | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is order maintained? Might be worth documenting, (and testing!), order for this and other methods.
| | | array1 except for those also presnet in array2. | | ||
| | | NULL is returned if either input array is NULL. | | ||
+------------------------+------------------------------------------------------------+---------------------------------------------------+ | ||
| ARRAY_INTERSECT | ``ARRAY_INTERSECT(array1, array2)`` | Returns an array of all the distinct elements | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise, I'm not sure this should be distinct. This is the intersect of the array, not set. ARRAY_INTERSECT([1,2,2,2,3],[2,2,3])
should, IMHO, return [2,2,3]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presto and Teradata have this similar array_intersect() function, and they specify that the returned array does not have duplicates (same for array_except). I think it makes sense to follow the same standard from both databases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed, that's why it was this way to begin with ;)
| | | the rqeuested length or offset are negative the | | ||
| | | entire array is returend. | | ||
+------------------------+------------------------------------------------------------+---------------------------------------------------+ | ||
| ARRAY_UNION | ``ARRAY_UNION(array1, array2)`` | Returns an array of all the distinct values from | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise, don't think this should be distinct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to above, this is modelled after the comparable functions in other systems (teradata nand presto). This case is also more "normal" from the rdbms world - consider the difference between UNION
and UNION ALL
in standard sql (the first returns a distinct result list, the second requires an additional keyword to specify "please keep all the duplicates")
| ARRAY_UNION | ``ARRAY_UNION(array1, array2)`` | Returns an array of all the distinct values from | | ||
| | | all input arrays, or NULL if either input is NULL.| | ||
+------------------------+------------------------------------------------------------+---------------------------------------------------+ | ||
| ARRAYS_TO_MAP | ``ARRAYS_TO_MAP(key_array, map_array)`` | Creates a map from a passed array of keys and an | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would benefit from documentation of NULL behaviour, as you have for others.
String thisKey = inputKeys.get(i); | ||
Object thisValue = inputValues.get(i); | ||
if (thisValue != null) { | ||
outputMap.put(inputKeys.get(i), thisValue); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure what your intention is here... the if/else branches both do this same thing.
Does your map type allow null keys? Good one to test & document!
} | ||
|
||
@SuppressWarnings("rawtypes") | ||
@Udf(description = "Returns the element at the specified index (counting from 0) in an array.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't SQL convention normally to be base-1, rather than base-0? Or has this changed with newer products?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes indeed - this was done just to align with the rest of ksql back at the time. seeing as we've subsequently updated other parts of the code to be 1-based, the same change should be applied here too
import io.confluent.ksql.function.udf.Udf; | ||
import io.confluent.ksql.function.udf.UdfDescription; | ||
|
||
@UdfDescription(name = "map_concat", author = "Confluent", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's MAP_UNION
in the docs...
@Test | ||
public void shouldDistinctArray() { | ||
final List<Object> result = udf.distinct(Arrays.asList("foo", " ", "foo", "bar")); | ||
assertThat(result, containsInAnyOrder(" ", "bar", "foo")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should maintain order.
@SuppressWarnings("rawtypes") | ||
@Test | ||
public void shouldReturnNullForNullInput() { | ||
List result = udf.except(null, null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you split into two tests, each with one null param, please? (It catches places where one null check is hiding a second NPE).
Note: this PR was put on hold some months ago due to issues with getting complex types (map/array/struct) into and out of the UDF invocation framework. I think a more recent PR has at least partially alleviated that situation, so i'd be happy to take another pass at this. I do also recall addressing several of @big-andy-coates 's code comments from above, not sure what happened to those fixes - likely i have them in a private branch somewhere. Will go dig those up. |
| | | those elements which are present in both inputs. | | ||
| | | NULL is returned if either input array is NULL. | | ||
+------------------------+------------------------------------------------------------+---------------------------------------------------+ | ||
| ARRAY_SLICE | ``ARRAY_SLICE(array_col, start, length)`` | Returns a subsection of an array, of requested | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presto and Terdata has a similar function called 'slice(x, start, length)' which does the same thing.
Snowflake has a similar function too called 'array_slice(x, from, to)' which does the same thing (but using end of array instead of length).
This function for KSQL looks pretty similar to Presto and Teradata. Should we use the same name 'slice' instead of 'array_slice'?
| ARRAYCONTAINS | ``ARRAYCONTAINS('[1, 2, 3]', 3)`` | Given JSON or AVRO array checks if a search | | ||
| | | value contains in it. | | ||
+------------------------+------------------------------------------------------------+---------------------------------------------------+ | ||
| CARDINALITY | ``CARDINALITY(array | map)`` | Returns the number of entries in the specified | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presto and Teradata has a similar function that returns the size of the array (no map). Snowflake has only 'array_size(x)' for the same result.
For KSQL, should we split this function into 'array_size(x)' and 'map_size(x)' for a better meaning? Or just 'size(x)', for both, size(array) or size(map)?
if (input == null) { | ||
return null; | ||
} | ||
return Lists.newArrayList(Sets.newHashSet(input)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Set will not preserve the order of the original array. What should the result be? Keeping the order (just removing duplicates), or ignoring the order of the array?
If could use a stream, like input.stream().distinct().collect(Collectors.toList()), which I think will preserve the order. But I read that will not perform very good. How do others databases work?
.filter(e -> !rhs.contains(e)) | ||
.collect(Collectors.toSet()); | ||
final List result = Lists.newArrayList(distinct); | ||
return result; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be reduced to one line using the Collectors.toList() instead?
Like:
return lhs.stream().filter(e -> !rhs.contains(e)).collect(Collectors.toList());
Also, the above rhs.contains(e) is linear, so if you have a large array on rhs, then the performance won't be great. What about converting the rhs to a set, and then use the .contains() in the filter method?
investigation reveals this PR to still be blocked on #1791 , unable to bind UDFs which declare non-concrete parameter and return types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@blueedgenic, Very cool stuff! This is a very old version of the Syntax Reference topic. Can you please move this content to the new topic at https://github.com/confluentinc/ksql/blob/master/docs/developer-guide/ksqldb-reference/scalar-functions.md? Note that all of the new ksqlDB content is in markdown. Thanks!
hey @JimGalasyn - apologies, i have no idea how you got requested to review this again, it's totally superseded by the 2 PRs mentioned just above (#5536 and #5548) which already added these functions nad their docs (to the right docs file!). I'll close out this dangling PR |
@blueedgenick Heh, no worries! |
Description
A new set of SQL functions to make it possible to manipulate columns of type map and array:
All of the above added to syntax guide.
Testing done
Unit tests and a quick manual test of some queries.
Reviewer checklist