Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Adds udf regexp_split_to_array #5501

Merged
merged 9 commits into from
Jun 1, 2020

Conversation

AlanConfluent
Copy link
Member

Description

Adds the a new UDF regexp_split that splits a string into an array of substrings based on a regexp.

Fixes: #5492

Testing done

Wrote unit tests.

Reviewer checklist

  • Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
  • Ensure relevant issues are linked (description should include text like "Fixes #")

Copy link
Contributor

@big-andy-coates big-andy-coates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe consider renaming to regexp_split_to_array and adding a regexp_split_to_table UDTF? (As per Postges). Or at least renaming so that naming aligns when we add the other one later?

docs/developer-guide/ksqldb-reference/scalar-functions.md Outdated Show resolved Hide resolved
Comment on lines 31 to 36
description = "Splits a string into an array of substrings based on a regexp. "
+ "If the regexp is found at the beginning of the string, end of the string, or there "
+ "are contiguous matches in the string, then empty strings are added to the array. "
+ "If the regexp is not found, then the original string is returned as the only "
+ "element in the array. If the regexp is empty, then all characters in the string are "
+ "split.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the bit about empty adding empty elements from the syntax-reference.md in here to?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the regexp is found at the beginning of the string, end of the string, or there
are contiguous matches in the string, then empty strings are added to the array.

It's there in the beginning.


private Pattern getPattern(final String regexp) {
try {
return Pattern.compile(regexp);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compiling what might be the same pattern on every invocation ain't great, but I guess we can address this when we enhance the UDF framework to detect/support liternals.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that seems reasonable since it would be great if the system knew that the same value would be passed in every time.

Comment on lines 89 to 106
"name": "regexp_split",
"statements": [
"CREATE STREAM TEST (K STRING KEY, input_string VARCHAR) WITH (kafka_topic='test_topic', value_format='JSON');",
"CREATE STREAM OUTPUT AS SELECT K, REGEXP_SPLIT(input_string, '(ab|cd)') AS EXTRACTED FROM TEST;"
],
"inputs": [
{"topic": "test_topic", "value": {"input_string": "aabcda"}},
{"topic": "test_topic", "value": {"input_string": "aabdcda"}},
{"topic": "test_topic", "value": {"input_string": "zxy"}},
{"topic": "test_topic", "value": {"input_string": null}}
],
"outputs": [
{"topic": "OUTPUT", "value": {"EXTRACTED": ["a", "", "a"]}},
{"topic": "OUTPUT", "value": {"EXTRACTED": ["a", "d", "a"]}},
{"topic": "OUTPUT", "value": {"EXTRACTED": ["zxy"]}},
{"topic": "OUTPUT", "value": {"EXTRACTED": null}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice if this test case covered the second param being null.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


If the regular expression is found at the beginning or end
of the string, or there are contiguous delimiters,
then an empty space is added to the array.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
then an empty space is added to the array.
an empty space is added to the array.

Copy link
Member

@JimGalasyn JimGalasyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with a few suggestions.

@AlanConfluent
Copy link
Member Author

Maybe consider renaming to regexp_split_to_array and adding a regexp_split_to_table UDTF? (As per Postges). Or at least renaming so that naming aligns when we add the other one later?

I renamed it to regexp_split_to_array

@AlanConfluent AlanConfluent changed the title feat: Adds udf regexp_split feat: Adds udf regexp_split_to_array Jun 1, 2020
@AlanConfluent AlanConfluent merged commit 3766129 into confluentinc:master Jun 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add regexp_split function
3 participants