-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Adds udf regexp_split_to_array #5501
feat: Adds udf regexp_split_to_array #5501
Conversation
ddebccd
to
b440e6e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe consider renaming to regexp_split_to_array
and adding a regexp_split_to_table
UDTF? (As per Postges). Or at least renaming so that naming aligns when we add the other one later?
description = "Splits a string into an array of substrings based on a regexp. " | ||
+ "If the regexp is found at the beginning of the string, end of the string, or there " | ||
+ "are contiguous matches in the string, then empty strings are added to the array. " | ||
+ "If the regexp is not found, then the original string is returned as the only " | ||
+ "element in the array. If the regexp is empty, then all characters in the string are " | ||
+ "split.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the bit about empty adding empty elements from the syntax-reference.md
in here to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the regexp is found at the beginning of the string, end of the string, or there
are contiguous matches in the string, then empty strings are added to the array.
It's there in the beginning.
|
||
private Pattern getPattern(final String regexp) { | ||
try { | ||
return Pattern.compile(regexp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compiling what might be the same pattern on every invocation ain't great, but I guess we can address this when we enhance the UDF framework to detect/support liternals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that seems reasonable since it would be great if the system knew that the same value would be passed in every time.
"name": "regexp_split", | ||
"statements": [ | ||
"CREATE STREAM TEST (K STRING KEY, input_string VARCHAR) WITH (kafka_topic='test_topic', value_format='JSON');", | ||
"CREATE STREAM OUTPUT AS SELECT K, REGEXP_SPLIT(input_string, '(ab|cd)') AS EXTRACTED FROM TEST;" | ||
], | ||
"inputs": [ | ||
{"topic": "test_topic", "value": {"input_string": "aabcda"}}, | ||
{"topic": "test_topic", "value": {"input_string": "aabdcda"}}, | ||
{"topic": "test_topic", "value": {"input_string": "zxy"}}, | ||
{"topic": "test_topic", "value": {"input_string": null}} | ||
], | ||
"outputs": [ | ||
{"topic": "OUTPUT", "value": {"EXTRACTED": ["a", "", "a"]}}, | ||
{"topic": "OUTPUT", "value": {"EXTRACTED": ["a", "d", "a"]}}, | ||
{"topic": "OUTPUT", "value": {"EXTRACTED": ["zxy"]}}, | ||
{"topic": "OUTPUT", "value": {"EXTRACTED": null}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be nice if this test case covered the second param being null
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
If the regular expression is found at the beginning or end | ||
of the string, or there are contiguous delimiters, | ||
then an empty space is added to the array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then an empty space is added to the array. | |
an empty space is added to the array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, with a few suggestions.
I renamed it to |
Co-authored-by: Jim Galasyn <[email protected]> Co-authored-by: Andy Coates <[email protected]>
cb1f768
to
7104eb2
Compare
Description
Adds the a new UDF regexp_split that splits a string into an array of substrings based on a regexp.
Fixes: #5492
Testing done
Wrote unit tests.
Reviewer checklist