Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON Case Sensitivity needs to be addressed #3666

Closed
agavra opened this issue Oct 24, 2019 · 3 comments
Closed

JSON Case Sensitivity needs to be addressed #3666

agavra opened this issue Oct 24, 2019 · 3 comments
Assignees

Comments

@agavra
Copy link
Contributor

agavra commented Oct 24, 2019

Currently, this is the behavior when deserializing JSON in KsqlJsonDeserializer:

  • If there's an exact case-sensitive match, this does nothing because the value isn't used
  • If there's only one case-insensitive match and no case-sensitive match (e.g. schema foo VARCHAR value {"foo": "bar"} then it maps field foo to "bar".
  • If there's more than one case-insensitive match and no case-sensitive match for either (e.g. schema foo VARCHAR and value {"foo": "bar", "fOO": "baz"}) then the behavior is undefined - it will choose the latter of the two

I think this behavior is probably OK. It's unfortunate that JSON allows case sensitive field names because we have two options:

  1. We force case sensitive names for KSQL when using JSON. This is crappy user experience for the 99% of use cases that don't hit this issue
  2. We accept this limitation, but have the workaround that you can specify case sensitive names if you need to.
  3. We don't accept the case-insensitive fields when it's ambiguous (throw an error) but accept it in all other scenarios. The workaround in 2 is still available.

I prefer the third because it makes the simple things simple and the complicated things possible, but it's backwards incompatible technically - I think it might be OK because we'd be changing a super buggy behavior. I'll create a separate PR for the test and change.

Originally posted by @agavra in #3588

@big-andy-coates
Copy link
Contributor

big-andy-coates commented Oct 25, 2019

  • We force case sensitive names for KSQL when using JSON. This is crappy user experience for the 99% of use cases that don't hit this issue

That's not going to float ;)

  • We accept this limitation, but have the workaround that you can specify case sensitive names if you need to.

Surely, this falls under the remit of the quoted identifier work? As part of that it's probably worth checking that if a user supplies a quoted column name that we pick the right one.

  • We don't accept the case-insensitive fields when it's ambiguous (throw an error) but accept it in all other scenarios. The workaround in 2 is still available.

This is not really an option. We don't know what the data looks like when the user is providing us with the schema. What's more, the data can evolve and change over time. So maybe the first record we see only has foo in it, but the next record has foo and Foo.

Also, while I'm a fan of 'fail early and fail hard' for many things, I don't think throwing an error here is the right option. If an upstream system suddenly starts publishing a second case-variant of one of the fields you're using, you'll be mighty pissed if KSQL suddenly starts rejecting all the records.

I don't think there is a perfect answer here. But I'd probably think about going with the approach that:

  • if the user supplies a quoted column name, then only that case should match. (Same for avro).
  • If the user supplies an unquoted column name, then it should match one of the fields. Which field should be deterministic, e.g. the last encountered (as it is now), or maybe first/last given some other sort order - though I'm not sure what a different sort order would give us above 'the last one we encounter'.

Does our recent work on quoted identifiers get us anywhere close to these two points?

@big-andy-coates
Copy link
Contributor

big-andy-coates commented Oct 25, 2019

Also.. doesn't the same issue exist in Avro or does Avro have more strict naming rules?

Regardless, we should consider fixing this in a more generic way as other formats may encounter the same issues.

@agavra
Copy link
Contributor Author

agavra commented Oct 25, 2019

Surely, this falls under the remit of the quoted identifier work? As part of that it's probably worth checking that if a user supplies a quoted column name that we pick the right one.

I've got a test for this and will publish a PR that shows it!

Does our recent work on quoted identifiers get us anywhere close to these two points?

The first point (if the user supplies a quoted column name, then only that case should match) is true except if you supply a quoted column name that is all uppercase. I'm not sure there is a good way to solve that problem efficiently.

Also, while I'm a fan of 'fail early and fail hard' for many things, I don't think throwing an error here is the right option. If an upstream system suddenly starts publishing a second case-variant of one of the fields you're using, you'll be mighty pissed if KSQL suddenly starts rejecting all the records.

You might also be pissed of KSQL started picking that one up and didn't tell you 😉

Also.. doesn't the same issue exist in Avro or does Avro have more strict naming rules?

I think you are right there. For some reason i assumed Avro wasn't case sensitive but it looks like it is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants