fix: standardize KSQL up-casting #3516

big-andy-coates · 2019-10-09T17:09:48Z

Description

Logic for controlling what types can be up-cast to another, e.g. up-casting an INT to a BIGINT was spread across different parts of the code base, had no tests to ensure all parts were consistent and hence was inconsistent.

Up-casting is primarily used during arithmetic binary operations, e.g. adding two numbers together, and when trying to coerce user supplied values in INSERT VALUE statements to the required SQL types.

Numeric Up-casting rules for KSQL are:

INT can be up-cast to BIGINT
BIGINT can be up-cast to DECIMAL
DECIMAL can be up-cast to DOUBLE.

In the existing code:

SqlBaseType has a canUpCast method, but it doesn't take DECIMALs into account.
SqlBaseType has an isNumber method, but it doesn't treat DECIMAL as a numeric type.
SchemaUtil has the logic on how to resolve the resulting (connect) schema given an Operation and two input types.
DefaultSqlValueCoercer, allowed coercion of DOUBLE -> DECIMAL, which is against out up-casting rules.

This PR looks to make the code a bit more object orientated and hopefully better structured.

With this change:

SqlBaseType's canUpCast and isNumber methodss correctly handle DECIMAL.
Any type can be up-cast to itself. Only numeric types can be up-cast to other types and the rules are encapsulated in SqlBaseType.canUpCast.
The logic on how to resolve the resulting SQL type given an Operation and two input types now lives in Operation, making use of the decimal specific handling which now lives in SqlDecimal.
The SqlDecimal type an INT or BIGINT is up-cast to is now stored in SqlTypes.

However, the main benefit of this commit is the addition of tests in DefaultSqlValueCoercer and OperatorTest to ensure that these two classes follow the up-casting rules.

In the future we can look to ensure more behaviour into our type system.

Testing done

Suitable tests added, including awesome metatests to ensure different parts of the system are consistent.

Reviewer checklist

Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
Ensure relevant issues are linked (description should include text like "Fixes #")

Outstanding discussion points on decimals (non-blocking):

Scale and precision of operations like division
Implicit conversion of DECIMAL to DOUBLE
STRING -> DECIMAL implicit parsing

Logic for controlling what types can be up-cast to another, e.g. up-casting an INT to a BIGINT was spread across different parts of the code base, had no tests to ensure all parts were consistent and hence was inconsistent. Up-casting is primarily used during arithmetic binary operations, e.g. adding two numbers together, and when trying to coerce user supplied values in `INSERT VALUE` statements to the required SQL types. Numeric Up-casting rules for KSQL are: - `INT` can be up-cast to `BIGINT` - `BIGINT` can be up-cast to `DECIMAL` - `DECIMAL` can be up-cast to `DOUBLE`. In the existing code: - `SqlBaseType` has a `canUpCast` method, but it doesn't take `DECIMAL`s into account. - `SqlBaseType` has an `isNumber` method, but it doesn't treat `DECIMAL` as a numeric type. - `SchemaUtil` has the logic on how to resolve the resulting (connect) schema given an `Operation` and two input types. - `DefaultSqlValueCoercer`, allowed coercion of `DOUBLE` -> `DECIMAL`, which is against out up-casting rules. This PR looks to make the code a bit more object orientated and hopefully better structured. With this change: - `SqlBaseType`'s `canUpCast` and `isNumber` methodss correctly handle `DECIMAL`. - Any type can be up-cast to itself. Only numeric types can be up-cast to other types and the rules are encapsulated in `SqlBaseType.canUpCast`. - The logic on how to resolve the resulting SQL type given an `Operation` and two input types now lives in `Operation`, making use of the decimal specific handling which now lives in `SqlDecimal`. - The `SqlDecimal` type an `INT` or `BIGINT` is up-cast to is now stored in `SqlTypes`. However, the main benefit of this commit is the addition of tests in `DefaultSqlValueCoercer` and `OperatorTest` to ensure that these two classes follow the up-casting rules.

big-andy-coates · 2019-10-09T17:11:31Z

ksql-functional-tests/src/main/java/io/confluent/ksql/test/model/RecordNode.java

@@ -47,7 +47,7 @@
      @JsonProperty("window") final WindowData window
  ) {
    this.topicName = topicName == null ? "" : topicName;
-    this.key = key == null ? "" : key;
+    this.key = key;


In case you're wondering... a null key is value and so we shouldn't be converting to an empty string. Without this change the new tests in insert-values.json would fail.

purplefox · 2019-10-10T08:01:39Z

A general question: When upcasting from DECIMAL to DOUBLE how do we deal with cases where there will be a loss of precision? (I.e. some values that can be represented exactly as decimals can't be represented exactly as doubles). Do we just silently do the conversion, or do we warn?

Perhaps we could log a warning when the query is parsed "Possible loss of precision in conversion from DECIMAL->DOUBLE"

purplefox

Perhaps I am misreading the intent of the code, but regarding the approach taken here, there seems to be an assumption that it's possible to calculate the required scale and precision for a decimal to hold the result of an operation based just on the scale and precision of the operands.

I don't think that holds in general (divide is an example), the actual values need to be known.

I'm thinking that for an expression involving a decimal we should always assume scale and precision is maximum possible throughout the calculation, then we should only round right at the end if necessary if the user has specified a specific decimal scale and precision for a column type.

purplefox · 2019-10-10T08:21:55Z

ksql-common/src/main/java/io/confluent/ksql/schema/ksql/types/SqlDecimal.java

+        + Math.max(left.precision - left.scale, right.precision - right.scale)
+        + 1;
+
+    final int scale = Math.max(left.scale, right.scale);


I wonder why we don't leave it up to BigDecimal to figure out the scale and precision?

See my comment below your review.

purplefox · 2019-10-10T08:24:26Z

ksql-common/src/main/java/io/confluent/ksql/schema/ksql/types/SqlDecimal.java

+    final int precision = left.precision - left.scale + right.scale
+        + Math.max(6, left.scale + right.precision + 1);
+
+    final int scale = Math.max(6, left.scale + right.precision + 1);


Why do we limit to 6 decimal places?
If we were dividing 1 by 3 and the result was a recurring fraction, wouldn't it be better to have as many dps as possible to give the best overall precision? 0.33333333333333333333333

My gut feeling would be to leave all the scale and precision calculations up to BigDecimal and only readjust the scale/precision at the end of the calculation if too many trailing zeros aren't desired.

As I understand this code is trying to calculate the required scale and precision of a decimal for a particular operation given the scale and precision of the operands.
While that is possible for some operations (add, subtract and multiply) I don't think it's possible in general for all opeations, e.g. for divide as the required scale and precision for the most exact number depends on the actual numbers and not just the types of the numbers.

See my comment below your review.

purplefox · 2019-10-10T08:38:10Z

ksql-common/src/test/java/io/confluent/ksql/schema/ksql/SqlBaseTypeTest.java

  @Test
  public void shouldNotDownCastDouble() {
    assertThat(SqlBaseType.DOUBLE.canUpCast(SqlBaseType.INTEGER), is(false));
    assertThat(SqlBaseType.DOUBLE.canUpCast(SqlBaseType.BIGINT), is(false));
+    assertThat(SqlBaseType.DOUBLE.canUpCast(SqlBaseType.DECIMAL), is(false));


We should perhaps allow casting of double to decimal in the case the double is smaller than a certain value as it can be exactly represented as a decimal.
BTW, imho I don't like the term "upcasting", why not just "casting"?

We should perhaps allow casting of double to decimal in the case the double is smaller than a certain value as it can be exactly represented as a decimal.

Not sure I follow: Casting of double to decimal is supported. Do you mean implicit down casting? I don't think we can support that as the casting happens when calculating the output type of a mathematical binary operation on two types, not values. So we can't take the value into account and must always have a deterministic result type based purely on the input types.

Again, see my comment below your review.

BTW, imho I don't like the term "upcasting", why not just "casting"?

Not a term I introduced, this was from @agavra. Though I'm assuming it's to differentiate from down-casting, which is different. In this context, the term up-casting is used to mean the ability for KSQL to implicitly convert one number type to another type, e.g. INT to BIGINT.

Might be better if the call was called canImplicitlyUpCast?

In general, the decimal code must calculate a fixed scale and precision based purely on the scale and precision of the input types, as the decimal values are not known when building the schema of the column into which the result of the computation will be stored, i.e. we build the schema long before processing any data.

I would argue that we shouldn't attempt to calculate a fixed scale and precision for the expression, instead just use maximum precision if there is any arithmetic where we cannot calculate precision properly in advance. Perhaps I am misunderatanding what is going on here but Rounding division to 6 dps seems just wrong to me.

Would you mind reviewing purely on the code changes, rather than the functionality that is being moved? :D

Sure, but I think this should be captured in another issue :)

BTW, imho I don't like the term "upcasting", why not just "casting"?

Not a term I introduced, this was from @agavra. Though I'm assuming it's to differentiate from down-casting, which is different. In this context, the term up-casting is used to mean the ability for KSQL to implicitly convert one number type to another type, e.g. INT to BIGINT.

Might be better if the call was called canImplicitlyUpCast?

Ok, so I think "upcasting" just means "implicitly cast". (I still don't like the term 'upcasting')

And we should only do that where we can guaranteed it's safe to do so. However implicitly casting from DECIMAL to DOUBLE is not always safe. I think we currently assume it's safe because a double can hold bigger values than a decimal, but for some conversions a loss of precision might occur. Hence my previous comment about logging a warning?

Another way of dealing with this would be to not allow DECIMAL->DOUBLE implicit conversions, and require the user to do an explicit cast. That's another way of telling the user "hey this isn't safe, but if you really want to do it, go ahead and cast"

I think you're right about not using upcast/downcast - this was probably a terminology error on my end. Standard terminology should be widening and narrowing https://docs.oracle.com/javase/specs/jls/se7/html/jls-5.html#jls-5.1.2

I think calling it canImplicitlyCastTo would be better as we're talking about implicit casting, not casting in general.

I think canWiden is probably the technically correct term.

big-andy-coates · 2019-10-10T10:23:08Z

Hey @purplefox, thanks for the review!

However... all of your comments are on existing code that's just been moved in this PR. Don't get me wrong, you may have valid points and concerns and we should not ignore these. But we should look to discuss and address outside of the scope of this PR, which is just refactoring the code and, for the most part, maintaining the existing functionality.

In general, the decimal code must calculate a fixed scale and precision based purely on the scale and precision of the input types, as the decimal values are not known when building the schema of the column into which the result of the computation will be stored, i.e. we build the schema of a table long before processing any data.

Of course, what scale and precision KSQL determines for any specific input can be up for debate. @agavra was the original developer of the decimal code and may be able to comment more. I believe his implementation is heavily influenced by the approach taken by other DBs, e.g. Postgres.

A general question: When upcasting from DECIMAL to DOUBLE how do we deal with cases where there will be a loss of precision? (I.e. some values that can be represented exactly as decimals can't be represented exactly as doubles). Do we just silently do the conversion, or do we warn?

Perhaps we could log a warning when the query is parsed "Possible loss of precision in conversion from DECIMAL->DOUBLE"

Again, this is not something I'm proposing in this PR, only enforcing and standardizing on. This rule preexisted this PR. As an aside, I had a long discussion with @agavra on this very subject as I'm not convinced by this rule either. But that's a discussion outside the scope of this PR.

Would you mind reviewing purely on the code changes, rather than the functionality that is being moved? :D

Conflicting files ksql-common/src/main/java/io/confluent/ksql/util/SchemaUtil.java

purplefox

Approving on the basis that this is just a code refactor, but I think there should be a discussions on the points raised in here regarding:

Scale and precision of operations like division
Implicit conversion of DECIMAL to DOUBLE

purplefox · 2019-10-10T11:03:51Z

Hey @agavra - do you have some insight on the comments raised in this PR?

big-andy-coates · 2019-10-10T15:56:17Z

Another point for discussion is that the INSERT VALUE logic does implicit conversion from STRING to DECIMAL, but not STRING to any other numeric type.

agavra · 2019-10-10T16:55:19Z

Responding to the two big points:

With regards to the scale/precision of mathematical operations on decimals, I took this directly from SqlServer semantics

With regards to consider a DECIMAL -> DOUBLE conversion widening and not narrowing, there were a few things that convinced me it should be that way:

Decimals are exact representation of numbers while doubles are usually intended as approximations (e.g. there is no way to represent even something simple like 0.1 exactly in a double, it's likely that anyone who typed 0.1d did not intend 0.1000000000000000055511151231257827021181583404541015625). I think it would be misleading to convert a double to its decimal representation and then do math on them as if it were DECIMAL(55,55), and anything else would be incorrect.
There's no way to predict what the output precision and scale should be if we're converting a double to a decimal (e.g. 1.0 can be represented exactly as a double, so it would be DECIMAL(2,1) but 0.1 needs to be DECIMAL(55,55) to keep its exact value) so even if we did decide to convert the double to exactly a decimal, it would be different every time for every decimal value unless we always chose the biggest possible decimal scale/precision - which is impractical.
While there are DECIMAL values that cannot be expressed as DOUBLE, the opposite is also true (namely, NaN, Infinity and negative infinity)

I've given it a lot of thought, and I think what really draws me to my conclusion is that I think it's "safer" to output a double to indicate it is imprecise than do some magic and call it exact.

Test fixed by switching the RecordNode to having a custom parser that differentiates between a missing node and a node explicitly set to null.

purplefox · 2019-10-12T07:26:55Z

Responding to the two big points:

With regards to the scale/precision of mathematical operations on decimals, I took this directly from SqlServer semantics

I wonder what the rationale for SQLServer rounding to 6 dps is? It seems odd to me. If there are multiple operations in a single statement, the errors are going to multiply fast.

I still believe that it would be better to do any calculation at maximum precision and only round at the end of the calculation, if necessary to make the result conform to the column schema.

With regards to consider a DECIMAL -> DOUBLE conversion widening and not narrowing, there were a few things that convinced me it should be that way:

Decimals are exact representation of numbers while doubles are usually intended as approximations (e.g. there is no way to represent even something simple like 0.1 exactly in a double, it's likely that anyone who typed 0.1d did not intend 0.1000000000000000055511151231257827021181583404541015625). I think it would be misleading to convert a double to its decimal representation and then do math on them as if it were DECIMAL(55,55), and anything else would be incorrect.

Perhaps we are talking at cross purposes? ;)

The point I raised wasn't about converting from doubles to decimals. It was about automatically converting from decimals to doubles and losing precision.

I agree the calculation should be performed using doubles, but I think either a) the CAST from decimal to double should be explicit or b) a warning should be logged ("possible loss of precision in implicit cast from decimal to double").

purplefox · 2019-10-12T08:21:42Z

I got up too early and was a bit bored so I dug into my SQLServer does this rounding.

It seems that SQLServer does this to avoid exceeding max precision fo its decimal type which is 38. https://www.sqlteam.com/forums/topic.asp?TOPIC_ID=92608

But in our case, we have BIgDecimal for calculations which has no such limitation on precision. So I don't think it makes a lot of sense to adopt the SQL Server way of doing things (unless there's a requirement for our behaviour to be exactly like SQL Server!)

(of course we might need to round down at the end of the calculation to conform to the schema of the result column, but I'm referring to during the calculation here)

agavra · 2019-10-14T15:44:08Z

But in our case, we have BIgDecimal for calculations which has no such limitation on precision. So I don't think it makes a lot of sense to adopt the SQL Server way of doing things (unless there's a requirement for our behaviour to be exactly like SQL Server!)

@purplefox I think this makes sense - my calculus is usually to just, in order to avoid decision paralysis and move quickly, do what other established systems do because they set a decent precedent 95% of the time. If there's examples of that precedent being less than ideal (as you pointed out in that thread) I'm more than happy having a discussion on what the "correct" behavior should be.

I agree the calculation should be performed using doubles, but I think either a) the CAST from decimal to double should be explicit or b) a warning should be logged ("possible loss of precision in implicit cast from decimal to double").

I think logging in our case doesn't really work well because it would happen per-event (maybe we could log once when materializing the query, we should have the schemas at that point 🤔) so the only other option is to fail hard. Failing hard is pretty rare in SQL systems, which seem to try hard to make something work (albeit magically) rather than failing. As an infrastructure engineer that goes against all of my intuitions, but I think if we're developing a SQL product we should develop something that "feels" familiar to that audience.

big-andy-coates requested a review from a team as a code owner October 9, 2019 17:09

big-andy-coates commented Oct 9, 2019

View reviewed changes

big-andy-coates requested a review from agavra October 9, 2019 22:25

purplefox reviewed Oct 10, 2019

View reviewed changes

chore: merge from master

c42d07d

Conflicting files ksql-common/src/main/java/io/confluent/ksql/util/SchemaUtil.java

purplefox approved these changes Oct 10, 2019

View reviewed changes

big-andy-coates added 2 commits October 10, 2019 17:56

fix: qtt tests

2703d3f

chore: fix tests

547e9a9

Test fixed by switching the RecordNode to having a custom parser that differentiates between a missing node and a node explicitly set to null.

big-andy-coates merged commit 7fe8772 into confluentinc:master Oct 11, 2019

big-andy-coates deleted the consisent_up_cast branch October 11, 2019 09:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: standardize KSQL up-casting #3516

fix: standardize KSQL up-casting #3516

big-andy-coates commented Oct 9, 2019 •

edited

Loading

big-andy-coates Oct 9, 2019

purplefox commented Oct 10, 2019 •

edited

Loading

purplefox left a comment •

edited

Loading

purplefox Oct 10, 2019

big-andy-coates Oct 10, 2019

purplefox Oct 10, 2019

purplefox Oct 10, 2019

purplefox Oct 10, 2019

big-andy-coates Oct 10, 2019

purplefox Oct 10, 2019

big-andy-coates Oct 10, 2019

purplefox Oct 10, 2019

purplefox Oct 10, 2019

purplefox Oct 10, 2019

agavra Oct 10, 2019

big-andy-coates Oct 10, 2019

agavra Oct 10, 2019 •

edited

Loading

big-andy-coates commented Oct 10, 2019 •

edited

Loading

purplefox left a comment

purplefox commented Oct 10, 2019

big-andy-coates commented Oct 10, 2019

agavra commented Oct 10, 2019

purplefox commented Oct 12, 2019 •

edited

Loading

purplefox commented Oct 12, 2019

agavra commented Oct 14, 2019

fix: standardize KSQL up-casting #3516

fix: standardize KSQL up-casting #3516

Conversation

big-andy-coates commented Oct 9, 2019 • edited Loading

Description

Testing done

Reviewer checklist

Outstanding discussion points on decimals (non-blocking):

Choose a reason for hiding this comment

purplefox commented Oct 10, 2019 • edited Loading

purplefox left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agavra Oct 10, 2019 • edited Loading

Choose a reason for hiding this comment

big-andy-coates commented Oct 10, 2019 • edited Loading

purplefox left a comment

Choose a reason for hiding this comment

purplefox commented Oct 10, 2019

big-andy-coates commented Oct 10, 2019

agavra commented Oct 10, 2019

purplefox commented Oct 12, 2019 • edited Loading

purplefox commented Oct 12, 2019

agavra commented Oct 14, 2019

big-andy-coates commented Oct 9, 2019 •

edited

Loading

purplefox commented Oct 10, 2019 •

edited

Loading

purplefox left a comment •

edited

Loading

agavra Oct 10, 2019 •

edited

Loading

big-andy-coates commented Oct 10, 2019 •

edited

Loading

purplefox commented Oct 12, 2019 •

edited

Loading