Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add json_records, to_json_string, and json_concat UDFs #8632

Merged
merged 9 commits into from
Jan 27, 2022

Conversation

Gerrrr
Copy link
Contributor

@Gerrrr Gerrrr commented Jan 20, 2022

Description

This PR introduces 2 new UDFs - json_records and to_json_string::

  • json_records returns an array of structs with JSON keys and values. See the corresponding section in KLIP-59 for more details.
  • to_json_string serializes given argument into a JSON string. See the corresponding section in KLIP-59 for more details.
  • json_concat concatenates given JSON strings. See the corresponding section in KLIP-59 for more details.

This PR is rebased on top of #8600 to get UdfJsonMapper#parseJson - this is the reason why the change log contains is_json_string and friends.

Testing done

Added unit and functional tests.

Reviewer checklist

  • Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
  • Ensure relevant issues are linked (description should include text like "Fixes #")

@Gerrrr Gerrrr changed the title WIP: feat: Add json_records UDF feat: Add json_records and to_json_string UDFs Jan 21, 2022
@Gerrrr Gerrrr marked this pull request as ready for review January 21, 2022 13:30
@Gerrrr Gerrrr requested review from JimGalasyn and a team as code owners January 21, 2022 13:30
@Gerrrr Gerrrr force-pushed the feat-udf-json_records branch 3 times, most recently from 1e50b6a to 1a1a182 Compare January 21, 2022 13:35
Since: 0.25.0

```sql
json_records(json_string) -> Array<Struct<json_key:String, json_value:String>>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may have already be covered earlier, but could this return a Map<Key,Key> instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The discussion about return types is in the KLIP - #8550 (comment). If you prefer, we can discuss it there with a wider audience.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice; after I started reviewing, I realized that there was a KLIP for this. (And that I'm late to the party; thanks for the link!)

I'm not sure where the discussion should be between the two places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are definitely not too late! We can continue the discussion in the KLIP or here. In my view, a Struct is a strictly better type than a Map<String, String> in this specific case because the former provides more comprehensive information and leaves less room for errors (and therefore error checking.)

For example, consider a user program that receives Map<String, String> from a function. A careful programmer has to check the size of the map and that it contains promised key-value pairs. On the other hand, the struct type is self-documenting - there will be only 2 keys of the given type. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. I think that Map<String,String> and Array<Struct<String,String>> are equivalent types in some sense. If I have a map, I ought to be able to iterate over it like an Array, and if given an Array, one will invariably want to do something that looks like a map.get().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that your argument is about using Map<String,String> as the return type rather than a replacement for the Struct in the array, i.e. Array<Map<String, String>>. Sorry about that!

I agree that Map<String, String> has as much type information as Array<Struct<String,String>>, so it could be used instead. I really don't have any strong arguments one way or another. As a kinda weak argument, its neighbour function, json_keys, returns an array, so it is probably more consistent to return an array here as well assuming everything else is equal.

I don't have a strong opinion here, so if you think that Map<String, String is a better return type, let's discuss it with a wider audience in this thread in the KLIP.

json_records(json_string) -> Array<Struct<json_key:String, json_value:String>>
```

Given a string, parses it as a JSON object and returns a ksqlDB array of structs containing 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so this may be likely out-of-scope for this ticket, but I'm wondering if there are future functions which would produce JSON which should be inverse of this.

E.g., I think there's interest in struct->json functions once other issues are resolved. Have we thought through the consistency issues there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CAST is arguably the inverse of this. It does work for the primitive types. Containers are not yet supported, unfortunately:

// STRING:
.put(key(STRING, BOOLEAN), nonNullSafeCode("SqlBooleans.parseBoolean(%s.trim())"))
.put(key(STRING, INTEGER), nonNullSafeCode("Integer.parseInt(%s.trim())"))
.put(key(STRING, BIGINT), nonNullSafeCode("Long.parseLong(%s.trim())"))
.put(key(STRING, DECIMAL), CastEvaluator::castToDecimal)
.put(key(STRING, DOUBLE), nonNullSafeCode("SqlDoubles.parseDouble(%s.trim())"))
.put(key(STRING, TIMESTAMP), nonNullSafeCode("SqlTimeTypes.parseTimestamp(%s.trim())"))
.put(key(STRING, TIME), nonNullSafeCode("SqlTimeTypes.parseTime(%s.trim())"))
.put(key(STRING, DATE), nonNullSafeCode("SqlTimeTypes.parseDate(%s.trim())"))

Here is a test I tried to see how this will work out:

ksql> CREATE STREAM test (k STRING KEY, v INT)  WITH (kafka_topic='test_topic', value_format='JSON');
ksql> CREATE STREAM test (k STRING KEY, v INT)  WITH (kafka_topic='test_topic', value_format='JSON');

 Message
----------------
 Stream created
----------------
ksql> CREATE STREAM to_json AS SELECT k, to_json_STRING(v) AS JSON_v FROM test;

 Message
---------------------------------------
 Created query with ID CSAS_TO_JSON_17
---------------------------------------
ksql> insert into test (k, v) VALUES ('a', 1);
ksql> select * from to_json;
+-------------------------------------------------------------------------------+-------------------------------------------------------------------------------+
|K                                                                              |JSON_V                                                                         |
+-------------------------------------------------------------------------------+-------------------------------------------------------------------------------+
|a                                                                              |[1,2,3]                                                                        |
|a                                                                              |1                                                                              |

NOTE: the [1,2,3] is a leftover from a previous test that I didn't cut out to show how does an unsupported cast look.

ksql> select json_v, cast(json_v as int) as back_to_int from to_json;
+-------------------------------------------------------------------------------+-------------------------------------------------------------------------------+
|JSON_V                                                                         |BACK_TO_INT                                                                    |
+-------------------------------------------------------------------------------+-------------------------------------------------------------------------------+
|[1,2,3]                                                                        |null                                                                           |
|1                                                                              |1                                                                              |

@jnh5y
Copy link
Member

jnh5y commented Jan 21, 2022

Ok, it looks like json_records and to_json_string ought to be close to being inverses of each other.

Is it worth verifying that? Or seeing if there are failures doing that?

@Gerrrr Gerrrr changed the title feat: Add json_records and to_json_string UDFs feat: Add json_records, to_json_string, and json_concat UDFs Jan 24, 2022
@Gerrrr Gerrrr force-pushed the feat-udf-json_records branch 3 times, most recently from 094382b to eeb1066 Compare January 25, 2022 18:14
@Gerrrr
Copy link
Contributor Author

Gerrrr commented Jan 25, 2022

@jnh5y @JimGalasyn this PR is ready for review.

Copy link
Member

@jnh5y jnh5y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!


Given N strings, parse them as JSON values and return a string representing their concatenation.
Concatenation rules are identical to PostgreSQL's [|| operator](https://www.postgresql.org/docs/14/functions-json.html):
* If all strings deserialize into JSON objects, then return an object with a union of the input key,
Copy link
Member

@JimGalasyn JimGalasyn Jan 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* If all strings deserialize into JSON objects, then return an object with a union of the input key,
* If all strings deserialize into JSON objects, return an object with a union of the input key.

Given N strings, parse them as JSON values and return a string representing their concatenation.
Concatenation rules are identical to PostgreSQL's [|| operator](https://www.postgresql.org/docs/14/functions-json.html):
* If all strings deserialize into JSON objects, then return an object with a union of the input key,
taking values from the last object in the case of duplicates.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
taking values from the last object in the case of duplicates.
If there are duplicate objects, take values from the last object.

Concatenation rules are identical to PostgreSQL's [|| operator](https://www.postgresql.org/docs/14/functions-json.html):
* If all strings deserialize into JSON objects, then return an object with a union of the input key,
taking values from the last object in the case of duplicates.
* If all strings deserialize into JSON arrays, then return the result of array concatenation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* If all strings deserialize into JSON arrays, then return the result of array concatenation.
* If all strings deserialize into JSON arrays, return the result of array concatenation.

* If all strings deserialize into JSON objects, then return an object with a union of the input key,
taking values from the last object in the case of duplicates.
* If all strings deserialize into JSON arrays, then return the result of array concatenation.
* If at least one of the deserialized values is not an object, then convert non-array inputs to a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* If at least one of the deserialized values is not an object, then convert non-array inputs to a
* If at least one of the deserialized values is not an object, convert non-array inputs to a

* If all strings deserialize into JSON arrays, then return the result of array concatenation.
* If at least one of the deserialized values is not an object, then convert non-array inputs to a
single-element array and return the result of array concatenation.
* If at least one of the input strings is `NULL` or can't be deserialized as JSON, then return `NULL`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* If at least one of the input strings is `NULL` or can't be deserialized as JSON, then return `NULL`.
* If at least one of the input strings is `NULL` or can't be deserialized as JSON, return `NULL`.

single-element array and return the result of array concatenation.
* If at least one of the input strings is `NULL` or can't be deserialized as JSON, then return `NULL`.

Akin to PostgreSQL's `||` operator, this function merges only top-level object keys or arrays.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Akin to PostgreSQL's `||` operator, this function merges only top-level object keys or arrays.
Similar to the PostgreSQL `||` operator, this function merges only top-level object keys or arrays.

Copy link
Member

@JimGalasyn JimGalasyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with a few suggestions.

@Gerrrr
Copy link
Contributor Author

Gerrrr commented Jan 27, 2022

Test failure is unrelated and should be fixed by #8664.

@Gerrrr Gerrrr merged commit 3b6d828 into master Jan 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants