-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: klip-59: JSON functions #8550
Conversation
7aa72b4
to
bca057e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Thanks @Gerrrr
Given a string, parses it as a JSON object and returns a ksqlDB array of two-element arrays | ||
representing the top-level keys and values. Returns `NULL` if the string can't be interpreted as a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
json objects don't have to have the same value type, right? what would happen if we passed in {"a": 1, "b": "X"}
(subtext: ksqlDB arrays are strongly typed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the example below, I think the result in ksql types would be: Array[Array['a', '1'], Array['b', '"X"']]
. Note: double-quotes surrounding the value of b
. Essentially each inner value (meaning the second item in the nested arrays) in the result is equivalent to calling to_json_string
on the JSON value input.
I assume the relevant the UDF's return type is Array<Array<string>>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, the return type should be Array<Array<string>>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should update the KLIP to call out the exact type of the result array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw: Why do we return an array of 2-element arrays? Would it be better to return an array of structs with two string field instead, ie, Array<Struct<json_key:String,json_value:String>>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This SO answer helped me see that things could be tricky: https://stackoverflow.com/a/53529204
I don't know what to recommend here. This is definitely a sharp edge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about duplicate keys?
Duplicat keys is either not an issue or a way bigger issue, depending on how we look at them. Jackson does not support duplicate keys, as ObjectNode
maintains its fields in LinkedHashMap
that does not allow duplicates (link) and it is unlikely that they will support them in the near future (FasterXML/jackson-databind#2809).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like the options are:
- Go with Array<Struct<String,String>> (since that's the existing code and it'd cover things if there was a desire to support the JSON spec to the letter).
- Switch to Map<String, String> (since this is what the library supports).
Thoughts? At this point, I am unsure what to suggest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be a product call, if we should follow the spec or the limitations of Jackson... \cc @colinhicks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this offline and agreed not to support duplicate keys for now and proceed with Map<String, String>
return type here. I updated the doc in 85de9ab.
* A JSON datatype | ||
* Binary JSON support |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't immediately see any problems with omitting this from scope, but if we do ever introduce these would we expect the semantics of any of these to change at all (e.g. return a JSON
type instead of a string or somethign else)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather introduce a separate set of functions or operators (see PostgreSQL syntax) for the JSON
type instead of changing the return type of the existing functions. The reasons are:
- We can provide functions for built-in
JSON
type with more elegant syntax AND avoid wrapping the output in strings. - Having a function that behaves differently given different argument types might confuse the users. Consider
json_concat
that kinda does it already. Its behavior description is three times longer than descriptions for any other proposed function, which is a good proxy for cognitive complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we would add a JSON
type, how would we name functions that accept JSON
compared to the function names that are proposed in this KLIP? Just curious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about your question, I realized that function names we picked in this KLIP would actually be perfect for the JSON
type. I propose to rename functions here in the following way:
is_json
->is_json_string
. The former name would better correspond to a function that returnstrue
if the argument is ofJSON
type.json_array_length
->json_string_array_length
json_keys
->json_string_keys
json_records
->json_string_records
json_concat
->json_string_concat
Thoughts @mjsax @agavra @colinhicks @suhas-satish ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the motivation for having different function names depending on the argument type? From an implementation perspective, if we were to implement a JSON type, wouldn't this just mean adding an overload that accepts an argument of that type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have json_concat(string, string) -> string and json_concat(json, json) -> json it seems to be well defined?
I think that it might be confusing for users if a function returns a different type based on its arguments type.
Besides, this would be inconsistent with our current practice - e.g., we have ARRAY_UNION and MAP_UNION rather than a single function UNION
that is overloaded for different containers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that it might be confusing for users if a function returns a different type based on its arguments type.
Not really (at least not for me). And we do this already in ksqlDB. For example FLOOR()
(https://github.com/confluentinc/ksql/blob/master/ksqldb-engine/src/main/java/io/confluent/ksql/function/udf/math/Floor.java) is defined for int,long,double,decimal and it's return type matches it's input type (many other function work similar).
I am not sure why we have ARRAY_UNION
and MAP_UNION
. It does not make too much sense to me to have both, instead of UNION()
(even if UNION()
is a little bit tricky as there is also the SQL UNION
keyword -- even if we don't support UNION
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as the original author of MAP_UNION and ARRAY_UNION i will chime in to say that is exactly (one of) the reasons why: i.e. we don't want to introduce a function whose name is the same as a SQL keyword. Secondary reason is that it follows the naming convention established in other database-like systems with comparable functionality (Presto in this case, IIRC)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So are you saying you would prefer an overloaded json_concat
or to add json_string_concat
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we should override json_concat
for json and string types. Thank you for the discussion!
is_json("1") // returns true | ||
is_json("\"abc\"") // returns true | ||
is_json("null") // returns true | ||
is_json("") // returns false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my own education: why does this return false
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An empty string does not contain valid JSON because it can't be decoded into any of the primitive or structured types JSON can represent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it's just not in the json spec? (I guess the spec could define what it mean if they wanted to, eg, ""
== "null"
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it is not in the spec.
``` | ||
json_concat("{\"a\": 1}", "{\"b\": 2}") // returns "{\"a\": 1, \"b\": 2}" | ||
json_concat("{\"a\": {\"3\": 4}}", "{\"a\": {\"5\": 6}}") // returns "{\"a\": {\"5\": 6}}" | ||
json_concat("{\"a\": {\"5\": 6}}", "{\"a\": {\"3\": 4}}") // returns "{\"a\": {\"3\": 4}}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example seems to be redundant to the previous one in L165?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was non-intuitive to me. I would have expected union of json values as the result. ie, "{"a": [{"3": 4}, {"5": 6}] }". Is this too difficult and error-prone to implement and not how its done in postgres?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@suhas-satish In my experience, this behavior for map union is widespread. Some examples are:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@suhas-satish I agree that it not totally intuitive, but we don't need to have two example for the same case. L165 and L166 are just the same.
json_concat("{\"a\": {\"3\": 4}}", "{\"a\": {\"5\": 6}}") // returns "{\"a\": {\"5\": 6}}" | ||
json_concat("{\"a\": {\"5\": 6}}", "{\"a\": {\"3\": 4}}") // returns "{\"a\": {\"3\": 4}}" | ||
json_concat("{}", "{}") // returns "{}" | ||
json_concat("[1, 2]", "[3, 4]") // returns "[1,2,3,4]" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems this case is described via second bullet-point above -- but it's not really explicit. Should we update the bullet points above to be easier to follow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a separate bullet point for the case when both strings deserialize into lists in f519ea9.
|
||
Given any ksqlDB type returns the equivalent JSON string. | ||
|
||
#### Examples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add examples for all types? Missing: boolean, time, date, timestamp.
What about custom types? (Guess it's straightforward, but maybe worth to mention).
Might also be good to have a few example with nested complex types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in 4cd7926.
While implementing the first functions from this KLIP, I noticed that existing functions throw on deserialization errors. I updated the KLIP so that new functions are consistent with the existing ones - 515f8f5. |
I guess from a user point of view, it does not really make a difference if a function returns |
Good point @mjsax! I will revert the change. |
5538b98
to
a3d1430
Compare
a3d1430
to
50efc86
Compare
### json_concat | ||
|
||
``` | ||
json_concat(json_string, json_string) -> String |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Aleks suggested I comment here.). Should/could this method take varargs?
As one thing to think about, should we have a UDAF to concat multiple entries? (Which would/should likely be a separate function.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is great idea. ksqlDB's UDFs support varargs, so we can just change this method to json_concat(json_strings...)
without any downsides.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the UDF frameworks supports it, sure!
I am going to merge the PR; happy to continue the discussion here and create follow-up PRs if necessary. |
Description
KLIP for adding new JSON functions.
Reviewer checklist