feat: Adds udf regexp_split_to_array #5501

AlanConfluent · 2020-05-28T19:22:39Z

Description

Adds the a new UDF regexp_split that splits a string into an array of substrings based on a regexp.

Fixes: #5492

Testing done

Wrote unit tests.

Reviewer checklist

Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
Ensure relevant issues are linked (description should include text like "Fixes #")

big-andy-coates

Maybe consider renaming to regexp_split_to_array and adding a regexp_split_to_table UDTF? (As per Postges). Or at least renaming so that naming aligns when we add the other one later?

docs/developer-guide/ksqldb-reference/scalar-functions.md

big-andy-coates · 2020-06-01T11:02:27Z

ksqldb-engine/src/main/java/io/confluent/ksql/function/udf/string/RegexpSplit.java

+    description = "Splits a string into an array of substrings based on a regexp. "
+        + "If the regexp is found at the beginning of the string, end of the string, or there "
+        + "are contiguous matches in the string, then empty strings are added to the array. "
+        + "If the regexp is not found, then the original string is returned as the only "
+        + "element in the array. If the regexp is empty, then all characters in the string are "
+        + "split.")


Add the bit about empty adding empty elements from the syntax-reference.md in here to?

If the regexp is found at the beginning of the string, end of the string, or there
are contiguous matches in the string, then empty strings are added to the array.

It's there in the beginning.

big-andy-coates · 2020-06-01T11:05:09Z

ksqldb-engine/src/main/java/io/confluent/ksql/function/udf/string/RegexpSplit.java

+
+  private Pattern getPattern(final String regexp) {
+    try {
+      return Pattern.compile(regexp);


compiling what might be the same pattern on every invocation ain't great, but I guess we can address this when we enhance the UDF framework to detect/support liternals.

Yeah, that seems reasonable since it would be great if the system knew that the same value would be passed in every time.

big-andy-coates · 2020-06-01T11:05:33Z

ksqldb-functional-tests/src/test/resources/query-validation-tests/split.json

+      "name": "regexp_split",
+      "statements": [
+        "CREATE STREAM TEST (K STRING KEY, input_string VARCHAR) WITH (kafka_topic='test_topic', value_format='JSON');",
+        "CREATE STREAM OUTPUT AS SELECT K, REGEXP_SPLIT(input_string, '(ab|cd)') AS EXTRACTED FROM TEST;"
+      ],
+      "inputs": [
+        {"topic": "test_topic", "value": {"input_string": "aabcda"}},
+        {"topic": "test_topic", "value": {"input_string": "aabdcda"}},
+        {"topic": "test_topic", "value": {"input_string": "zxy"}},
+        {"topic": "test_topic", "value": {"input_string": null}}
+      ],
+      "outputs": [
+        {"topic": "OUTPUT", "value": {"EXTRACTED": ["a", "", "a"]}},
+        {"topic": "OUTPUT", "value": {"EXTRACTED": ["a", "d", "a"]}},
+        {"topic": "OUTPUT", "value": {"EXTRACTED": ["zxy"]}},
+        {"topic": "OUTPUT", "value": {"EXTRACTED": null}}


would be nice if this test case covered the second param being null.

docs/developer-guide/ksqldb-reference/scalar-functions.md

JimGalasyn · 2020-06-01T18:53:04Z

docs/developer-guide/ksqldb-reference/scalar-functions.md

+
+If the regular expression is found at the beginning or end
+of the string, or there are contiguous delimiters,
+then an empty space is added to the array.


Suggested change

then an empty space is added to the array.

an empty space is added to the array.

JimGalasyn

LGTM, with a few suggestions.

AlanConfluent · 2020-06-01T21:52:24Z

Maybe consider renaming to regexp_split_to_array and adding a regexp_split_to_table UDTF? (As per Postges). Or at least renaming so that naming aligns when we add the other one later?

I renamed it to regexp_split_to_array

Co-authored-by: Jim Galasyn <[email protected]> Co-authored-by: Andy Coates <[email protected]>

AlanConfluent requested review from a team and JimGalasyn as code owners May 28, 2020 19:22

AlanConfluent force-pushed the udf_regexp_split branch from ddebccd to b440e6e Compare May 29, 2020 17:48

big-andy-coates approved these changes Jun 1, 2020

View reviewed changes

JimGalasyn reviewed Jun 1, 2020

View reviewed changes

docs/developer-guide/ksqldb-reference/scalar-functions.md Outdated Show resolved Hide resolved

JimGalasyn reviewed Jun 1, 2020

View reviewed changes

docs/developer-guide/ksqldb-reference/scalar-functions.md Outdated Show resolved Hide resolved

JimGalasyn reviewed Jun 1, 2020

View reviewed changes

JimGalasyn approved these changes Jun 1, 2020

View reviewed changes

AlanConfluent changed the title ~~feat: Adds udf regexp_split~~ feat: Adds udf regexp_split_to_array Jun 1, 2020

AlanConfluent and others added 9 commits June 1, 2020 15:31

feat: Adds udf regexp_split

e16b18e

Added test case

dc3e90c

Makes linter pass

1fb25a5

Adds doc and more tests

483e377

Historical plans

947c339

Apply suggestions from code review

532134c

Co-authored-by: Jim Galasyn <[email protected]> Co-authored-by: Andy Coates <[email protected]>

Feedback

a029a78

Fix doc

d81ed46

Fix doc again

7104eb2

AlanConfluent force-pushed the udf_regexp_split branch from cb1f768 to 7104eb2 Compare June 1, 2020 22:36

AlanConfluent merged commit 3766129 into confluentinc:master Jun 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Adds udf regexp_split_to_array #5501

feat: Adds udf regexp_split_to_array #5501

AlanConfluent commented May 28, 2020

big-andy-coates left a comment

big-andy-coates Jun 1, 2020

AlanConfluent Jun 1, 2020

big-andy-coates Jun 1, 2020

AlanConfluent Jun 1, 2020

big-andy-coates Jun 1, 2020

AlanConfluent Jun 1, 2020

JimGalasyn Jun 1, 2020

JimGalasyn left a comment

AlanConfluent commented Jun 1, 2020

	then an empty space is added to the array.
	an empty space is added to the array.

feat: Adds udf regexp_split_to_array #5501

feat: Adds udf regexp_split_to_array #5501

Conversation

AlanConfluent commented May 28, 2020

Description

Testing done

Reviewer checklist

big-andy-coates left a comment

Choose a reason for hiding this comment

big-andy-coates Jun 1, 2020

Choose a reason for hiding this comment

AlanConfluent Jun 1, 2020

Choose a reason for hiding this comment

big-andy-coates Jun 1, 2020

Choose a reason for hiding this comment

AlanConfluent Jun 1, 2020

Choose a reason for hiding this comment

big-andy-coates Jun 1, 2020

Choose a reason for hiding this comment

AlanConfluent Jun 1, 2020

Choose a reason for hiding this comment

JimGalasyn Jun 1, 2020

Choose a reason for hiding this comment

JimGalasyn left a comment

Choose a reason for hiding this comment

AlanConfluent commented Jun 1, 2020