JSON Case Sensitivity needs to be addressed #3666

agavra · 2019-10-24T18:44:12Z

Currently, this is the behavior when deserializing JSON in KsqlJsonDeserializer:

If there's an exact case-sensitive match, this does nothing because the value isn't used
If there's only one case-insensitive match and no case-sensitive match (e.g. schema foo VARCHAR value {"foo": "bar"} then it maps field foo to "bar".
If there's more than one case-insensitive match and no case-sensitive match for either (e.g. schema foo VARCHAR and value {"foo": "bar", "fOO": "baz"}) then the behavior is undefined - it will choose the latter of the two

I think this behavior is probably OK. It's unfortunate that JSON allows case sensitive field names because we have two options:

We force case sensitive names for KSQL when using JSON. This is crappy user experience for the 99% of use cases that don't hit this issue
We accept this limitation, but have the workaround that you can specify case sensitive names if you need to.
We don't accept the case-insensitive fields when it's ambiguous (throw an error) but accept it in all other scenarios. The workaround in 2 is still available.

I prefer the third because it makes the simple things simple and the complicated things possible, but it's backwards incompatible technically - I think it might be OK because we'd be changing a super buggy behavior. I'll create a separate PR for the test and change.

Originally posted by @agavra in #3588

The text was updated successfully, but these errors were encountered:

big-andy-coates · 2019-10-25T14:38:43Z

We force case sensitive names for KSQL when using JSON. This is crappy user experience for the 99% of use cases that don't hit this issue

That's not going to float ;)

We accept this limitation, but have the workaround that you can specify case sensitive names if you need to.

Surely, this falls under the remit of the quoted identifier work? As part of that it's probably worth checking that if a user supplies a quoted column name that we pick the right one.

We don't accept the case-insensitive fields when it's ambiguous (throw an error) but accept it in all other scenarios. The workaround in 2 is still available.

This is not really an option. We don't know what the data looks like when the user is providing us with the schema. What's more, the data can evolve and change over time. So maybe the first record we see only has foo in it, but the next record has foo and Foo.

Also, while I'm a fan of 'fail early and fail hard' for many things, I don't think throwing an error here is the right option. If an upstream system suddenly starts publishing a second case-variant of one of the fields you're using, you'll be mighty pissed if KSQL suddenly starts rejecting all the records.

I don't think there is a perfect answer here. But I'd probably think about going with the approach that:

if the user supplies a quoted column name, then only that case should match. (Same for avro).
If the user supplies an unquoted column name, then it should match one of the fields. Which field should be deterministic, e.g. the last encountered (as it is now), or maybe first/last given some other sort order - though I'm not sure what a different sort order would give us above 'the last one we encounter'.

Does our recent work on quoted identifiers get us anywhere close to these two points?

big-andy-coates · 2019-10-25T14:42:23Z

Also.. doesn't the same issue exist in Avro or does Avro have more strict naming rules?

Regardless, we should consider fixing this in a more generic way as other formats may encounter the same issues.

agavra · 2019-10-25T15:04:56Z

Surely, this falls under the remit of the quoted identifier work? As part of that it's probably worth checking that if a user supplies a quoted column name that we pick the right one.

I've got a test for this and will publish a PR that shows it!

Does our recent work on quoted identifiers get us anywhere close to these two points?

The first point (if the user supplies a quoted column name, then only that case should match) is true except if you supply a quoted column name that is all uppercase. I'm not sure there is a good way to solve that problem efficiently.

Also, while I'm a fan of 'fail early and fail hard' for many things, I don't think throwing an error here is the right option. If an upstream system suddenly starts publishing a second case-variant of one of the fields you're using, you'll be mighty pissed if KSQL suddenly starts rejecting all the records.

You might also be pissed of KSQL started picking that one up and didn't tell you 😉

Also.. doesn't the same issue exist in Avro or does Avro have more strict naming rules?

I think you are right there. For some reason i assumed Avro wasn't case sensitive but it looks like it is.

agavra mentioned this issue Oct 24, 2019

feat: support numeric json serde for decimals #3588

Merged

2 tasks

agavra self-assigned this Oct 24, 2019

agavra mentioned this issue Oct 28, 2019

JSON stream/table with non-uppercase field names returns nulls #2551

Closed

agavra closed this as completed Sep 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSON Case Sensitivity needs to be addressed #3666

JSON Case Sensitivity needs to be addressed #3666

agavra commented Oct 24, 2019

big-andy-coates commented Oct 25, 2019 •

edited

Loading

big-andy-coates commented Oct 25, 2019 •

edited

Loading

agavra commented Oct 25, 2019

JSON Case Sensitivity needs to be addressed #3666

JSON Case Sensitivity needs to be addressed #3666

Comments

agavra commented Oct 24, 2019

big-andy-coates commented Oct 25, 2019 • edited Loading

big-andy-coates commented Oct 25, 2019 • edited Loading

agavra commented Oct 25, 2019

big-andy-coates commented Oct 25, 2019 •

edited

Loading

big-andy-coates commented Oct 25, 2019 •

edited

Loading