Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: klip-59: JSON functions #8550

Merged
merged 2 commits into from
Jan 27, 2022
Merged

docs: klip-59: JSON functions #8550

merged 2 commits into from
Jan 27, 2022

Conversation

Gerrrr
Copy link
Contributor

@Gerrrr Gerrrr commented Dec 22, 2021

Description

KLIP for adding new JSON functions.

Reviewer checklist

  • Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
  • Ensure relevant issues are linked (description should include text like "Fixes #")

@Gerrrr Gerrrr marked this pull request as ready for review December 22, 2021 15:30
@Gerrrr Gerrrr requested a review from a team as a code owner December 22, 2021 15:30
Copy link
Contributor

@agavra agavra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks @Gerrrr

Comment on lines 114 to 115
Given a string, parses it as a JSON object and returns a ksqlDB array of two-element arrays
representing the top-level keys and values. Returns `NULL` if the string can't be interpreted as a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

json objects don't have to have the same value type, right? what would happen if we passed in {"a": 1, "b": "X"} (subtext: ksqlDB arrays are strongly typed)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the example below, I think the result in ksql types would be: Array[Array['a', '1'], Array['b', '"X"']]. Note: double-quotes surrounding the value of b. Essentially each inner value (meaning the second item in the nested arrays) in the result is equivalent to calling to_json_string on the JSON value input.

I assume the relevant the UDF's return type is Array<Array<string>>.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, the return type should be Array<Array<string>>.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should update the KLIP to call out the exact type of the result array.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw: Why do we return an array of 2-element arrays? Would it be better to return an array of structs with two string field instead, ie, Array<Struct<json_key:String,json_value:String>>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This SO answer helped me see that things could be tricky: https://stackoverflow.com/a/53529204

I don't know what to recommend here. This is definitely a sharp edge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about duplicate keys?

Duplicat keys is either not an issue or a way bigger issue, depending on how we look at them. Jackson does not support duplicate keys, as ObjectNode maintains its fields in LinkedHashMap that does not allow duplicates (link) and it is unlikely that they will support them in the near future (FasterXML/jackson-databind#2809).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the options are:

  • Go with Array<Struct<String,String>> (since that's the existing code and it'd cover things if there was a desire to support the JSON spec to the letter).
  • Switch to Map<String, String> (since this is what the library supports).

Thoughts? At this point, I am unsure what to suggest.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be a product call, if we should follow the spec or the limitations of Jackson... \cc @colinhicks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this offline and agreed not to support duplicate keys for now and proceed with Map<String, String> return type here. I updated the doc in 85de9ab.

Comment on lines +34 to +35
* A JSON datatype
* Binary JSON support
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't immediately see any problems with omitting this from scope, but if we do ever introduce these would we expect the semantics of any of these to change at all (e.g. return a JSON type instead of a string or somethign else)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather introduce a separate set of functions or operators (see PostgreSQL syntax) for the JSON type instead of changing the return type of the existing functions. The reasons are:

  • We can provide functions for built-in JSON type with more elegant syntax AND avoid wrapping the output in strings.
  • Having a function that behaves differently given different argument types might confuse the users. Consider json_concat that kinda does it already. Its behavior description is three times longer than descriptions for any other proposed function, which is a good proxy for cognitive complexity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we would add a JSON type, how would we name functions that accept JSON compared to the function names that are proposed in this KLIP? Just curious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about your question, I realized that function names we picked in this KLIP would actually be perfect for the JSON type. I propose to rename functions here in the following way:

  • is_json -> is_json_string. The former name would better correspond to a function that returns true if the argument is of JSON type.
  • json_array_length -> json_string_array_length
  • json_keys -> json_string_keys
  • json_records -> json_string_records
  • json_concat -> json_string_concat

Thoughts @mjsax @agavra @colinhicks @suhas-satish ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the motivation for having different function names depending on the argument type? From an implementation perspective, if we were to implement a JSON type, wouldn't this just mean adding an overload that accepts an argument of that type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have json_concat(string, string) -> string and json_concat(json, json) -> json it seems to be well defined?

I think that it might be confusing for users if a function returns a different type based on its arguments type.

Besides, this would be inconsistent with our current practice - e.g., we have ARRAY_UNION and MAP_UNION rather than a single function UNION that is overloaded for different containers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it might be confusing for users if a function returns a different type based on its arguments type.

Not really (at least not for me). And we do this already in ksqlDB. For example FLOOR() (https://github.com/confluentinc/ksql/blob/master/ksqldb-engine/src/main/java/io/confluent/ksql/function/udf/math/Floor.java) is defined for int,long,double,decimal and it's return type matches it's input type (many other function work similar).

I am not sure why we have ARRAY_UNION and MAP_UNION. It does not make too much sense to me to have both, instead of UNION() (even if UNION() is a little bit tricky as there is also the SQL UNION keyword -- even if we don't support UNION).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as the original author of MAP_UNION and ARRAY_UNION i will chime in to say that is exactly (one of) the reasons why: i.e. we don't want to introduce a function whose name is the same as a SQL keyword. Secondary reason is that it follows the naming convention established in other database-like systems with comparable functionality (Presto in this case, IIRC)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So are you saying you would prefer an overloaded json_concat or to add json_string_concat?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we should override json_concat for json and string types. Thank you for the discussion!

is_json("1") // returns true
is_json("\"abc\"") // returns true
is_json("null") // returns true
is_json("") // returns false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own education: why does this return false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An empty string does not contain valid JSON because it can't be decoded into any of the primitive or structured types JSON can represent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it's just not in the json spec? (I guess the spec could define what it mean if they wanted to, eg, "" == "null").

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it is not in the spec.

```
json_concat("{\"a\": 1}", "{\"b\": 2}") // returns "{\"a\": 1, \"b\": 2}"
json_concat("{\"a\": {\"3\": 4}}", "{\"a\": {\"5\": 6}}") // returns "{\"a\": {\"5\": 6}}"
json_concat("{\"a\": {\"5\": 6}}", "{\"a\": {\"3\": 4}}") // returns "{\"a\": {\"3\": 4}}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example seems to be redundant to the previous one in L165?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was non-intuitive to me. I would have expected union of json values as the result. ie, "{"a": [{"3": 4}, {"5": 6}] }". Is this too difficult and error-prone to implement and not how its done in postgres?

Copy link
Contributor Author

@Gerrrr Gerrrr Jan 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mjsax removed in 41e8e75.

@suhas-satish In my experience, this behavior for map union is widespread. Some examples are:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@suhas-satish I agree that it not totally intuitive, but we don't need to have two example for the same case. L165 and L166 are just the same.

json_concat("{\"a\": {\"3\": 4}}", "{\"a\": {\"5\": 6}}") // returns "{\"a\": {\"5\": 6}}"
json_concat("{\"a\": {\"5\": 6}}", "{\"a\": {\"3\": 4}}") // returns "{\"a\": {\"3\": 4}}"
json_concat("{}", "{}") // returns "{}"
json_concat("[1, 2]", "[3, 4]") // returns "[1,2,3,4]"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems this case is described via second bullet-point above -- but it's not really explicit. Should we update the bullet points above to be easier to follow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a separate bullet point for the case when both strings deserialize into lists in f519ea9.


Given any ksqlDB type returns the equivalent JSON string.

#### Examples
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add examples for all types? Missing: boolean, time, date, timestamp.

What about custom types? (Guess it's straightforward, but maybe worth to mention).

Might also be good to have a few example with nested complex types?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 4cd7926.

@Gerrrr
Copy link
Contributor Author

Gerrrr commented Jan 18, 2022

While implementing the first functions from this KLIP, I noticed that existing functions throw on deserialization errors. I updated the KLIP so that new functions are consistent with the existing ones - 515f8f5.

@mjsax
Copy link
Member

mjsax commented Jan 19, 2022

I guess from a user point of view, it does not really make a difference if a function returns null or throws an exception? SQL does not support exceptions, and errors (like exceptions) result in NULL for the user? -- For the purpose of a KLIP, it seems that throwing an exception is an implementation detail?

@Gerrrr
Copy link
Contributor Author

Gerrrr commented Jan 20, 2022

Good point @mjsax! I will revert the change.

### json_concat

```
json_concat(json_string, json_string) -> String
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Aleks suggested I comment here.). Should/could this method take varargs?

As one thing to think about, should we have a UDAF to concat multiple entries? (Which would/should likely be a separate function.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is great idea. ksqlDB's UDFs support varargs, so we can just change this method to json_concat(json_strings...) without any downsides.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the UDF frameworks supports it, sure!

@Gerrrr
Copy link
Contributor Author

Gerrrr commented Jan 27, 2022

I am going to merge the PR; happy to continue the discussion here and create follow-up PRs if necessary.

@Gerrrr Gerrrr merged commit 76f02b7 into master Jan 27, 2022
@jnh5y jnh5y deleted the docs-json-functions-klip branch February 10, 2022 20:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants