Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: standardize KSQL up-casting #3516

Merged

Conversation

big-andy-coates
Copy link
Contributor

@big-andy-coates big-andy-coates commented Oct 9, 2019

Description

Logic for controlling what types can be up-cast to another, e.g. up-casting an INT to a BIGINT was spread across different parts of the code base, had no tests to ensure all parts were consistent and hence was inconsistent.

Up-casting is primarily used during arithmetic binary operations, e.g. adding two numbers together, and when trying to coerce user supplied values in INSERT VALUE statements to the required SQL types.

Numeric Up-casting rules for KSQL are:

  • INT can be up-cast to BIGINT
  • BIGINT can be up-cast to DECIMAL
  • DECIMAL can be up-cast to DOUBLE.

In the existing code:

  • SqlBaseType has a canUpCast method, but it doesn't take DECIMALs into account.
  • SqlBaseType has an isNumber method, but it doesn't treat DECIMAL as a numeric type.
  • SchemaUtil has the logic on how to resolve the resulting (connect) schema given an Operation and two input types.
  • DefaultSqlValueCoercer, allowed coercion of DOUBLE -> DECIMAL, which is against out up-casting rules.

This PR looks to make the code a bit more object orientated and hopefully better structured.

With this change:

  • SqlBaseType's canUpCast and isNumber methodss correctly handle DECIMAL.
  • Any type can be up-cast to itself. Only numeric types can be up-cast to other types and the rules are encapsulated in SqlBaseType.canUpCast.
  • The logic on how to resolve the resulting SQL type given an Operation and two input types now lives in Operation, making use of the decimal specific handling which now lives in SqlDecimal.
  • The SqlDecimal type an INT or BIGINT is up-cast to is now stored in SqlTypes.

However, the main benefit of this commit is the addition of tests in DefaultSqlValueCoercer and OperatorTest to ensure that these two classes follow the up-casting rules.

In the future we can look to ensure more behaviour into our type system.

Testing done

Suitable tests added, including awesome metatests to ensure different parts of the system are consistent.

Reviewer checklist

  • Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
  • Ensure relevant issues are linked (description should include text like "Fixes #")

Outstanding discussion points on decimals (non-blocking):

  • Scale and precision of operations like division
  • Implicit conversion of DECIMAL to DOUBLE
  • STRING -> DECIMAL implicit parsing

Logic for controlling what types can be up-cast to another, e.g. up-casting an INT to a BIGINT was spread across different parts of the code base, had no tests to ensure all parts were consistent and hence was inconsistent.

Up-casting is primarily used during arithmetic binary operations, e.g. adding two numbers together, and when trying to coerce user supplied values in `INSERT VALUE` statements to the required SQL types.

Numeric Up-casting rules for KSQL are:

- `INT` can be up-cast to `BIGINT`
- `BIGINT` can be up-cast to `DECIMAL`
- `DECIMAL` can be up-cast to `DOUBLE`.

In the existing code:

- `SqlBaseType` has a `canUpCast` method, but it doesn't take `DECIMAL`s into account.
- `SqlBaseType` has an `isNumber` method, but it doesn't treat `DECIMAL` as a numeric type.
- `SchemaUtil` has the logic on how to resolve the resulting (connect) schema given an `Operation` and two input types.
- `DefaultSqlValueCoercer`, allowed coercion of `DOUBLE` -> `DECIMAL`, which is against out up-casting rules.

This PR looks to make the code a bit more object orientated and hopefully better structured.

With this change:
- `SqlBaseType`'s `canUpCast` and `isNumber` methodss correctly handle `DECIMAL`.
- Any type can be up-cast to itself. Only numeric types can be up-cast to other types and the rules are encapsulated in `SqlBaseType.canUpCast`.
- The logic on how to resolve the resulting SQL type given an `Operation` and two input types now lives in `Operation`, making use of the decimal specific handling which now lives in `SqlDecimal`.
- The `SqlDecimal` type an `INT` or `BIGINT` is up-cast to is now stored in `SqlTypes`.

However, the main benefit of this commit is the addition of tests in `DefaultSqlValueCoercer` and `OperatorTest` to ensure that these two classes follow the up-casting rules.
@big-andy-coates big-andy-coates requested a review from a team as a code owner October 9, 2019 17:09
@@ -47,7 +47,7 @@
@JsonProperty("window") final WindowData window
) {
this.topicName = topicName == null ? "" : topicName;
this.key = key == null ? "" : key;
this.key = key;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case you're wondering... a null key is value and so we shouldn't be converting to an empty string. Without this change the new tests in insert-values.json would fail.

@purplefox
Copy link
Contributor

purplefox commented Oct 10, 2019

A general question: When upcasting from DECIMAL to DOUBLE how do we deal with cases where there will be a loss of precision? (I.e. some values that can be represented exactly as decimals can't be represented exactly as doubles). Do we just silently do the conversion, or do we warn?

Perhaps we could log a warning when the query is parsed "Possible loss of precision in conversion from DECIMAL->DOUBLE"

Copy link
Contributor

@purplefox purplefox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I am misreading the intent of the code, but regarding the approach taken here, there seems to be an assumption that it's possible to calculate the required scale and precision for a decimal to hold the result of an operation based just on the scale and precision of the operands.

I don't think that holds in general (divide is an example), the actual values need to be known.

I'm thinking that for an expression involving a decimal we should always assume scale and precision is maximum possible throughout the calculation, then we should only round right at the end if necessary if the user has specified a specific decimal scale and precision for a column type.

+ Math.max(left.precision - left.scale, right.precision - right.scale)
+ 1;

final int scale = Math.max(left.scale, right.scale);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why we don't leave it up to BigDecimal to figure out the scale and precision?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment below your review.

final int precision = left.precision - left.scale + right.scale
+ Math.max(6, left.scale + right.precision + 1);

final int scale = Math.max(6, left.scale + right.precision + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we limit to 6 decimal places?
If we were dividing 1 by 3 and the result was a recurring fraction, wouldn't it be better to have as many dps as possible to give the best overall precision? 0.33333333333333333333333

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My gut feeling would be to leave all the scale and precision calculations up to BigDecimal and only readjust the scale/precision at the end of the calculation if too many trailing zeros aren't desired.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand this code is trying to calculate the required scale and precision of a decimal for a particular operation given the scale and precision of the operands.
While that is possible for some operations (add, subtract and multiply) I don't think it's possible in general for all opeations, e.g. for divide as the required scale and precision for the most exact number depends on the actual numbers and not just the types of the numbers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment below your review.

@Test
public void shouldNotDownCastDouble() {
assertThat(SqlBaseType.DOUBLE.canUpCast(SqlBaseType.INTEGER), is(false));
assertThat(SqlBaseType.DOUBLE.canUpCast(SqlBaseType.BIGINT), is(false));
assertThat(SqlBaseType.DOUBLE.canUpCast(SqlBaseType.DECIMAL), is(false));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should perhaps allow casting of double to decimal in the case the double is smaller than a certain value as it can be exactly represented as a decimal.
BTW, imho I don't like the term "upcasting", why not just "casting"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should perhaps allow casting of double to decimal in the case the double is smaller than a certain value as it can be exactly represented as a decimal.

Not sure I follow: Casting of double to decimal is supported. Do you mean implicit down casting? I don't think we can support that as the casting happens when calculating the output type of a mathematical binary operation on two types, not values. So we can't take the value into account and must always have a deterministic result type based purely on the input types.

Again, see my comment below your review.

BTW, imho I don't like the term "upcasting", why not just "casting"?

Not a term I introduced, this was from @agavra. Though I'm assuming it's to differentiate from down-casting, which is different. In this context, the term up-casting is used to mean the ability for KSQL to implicitly convert one number type to another type, e.g. INT to BIGINT.

Might be better if the call was called canImplicitlyUpCast?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, the decimal code must calculate a fixed scale and precision based purely on the scale and precision of the input types, as the decimal values are not known when building the schema of the column into which the result of the computation will be stored, i.e. we build the schema long before processing any data.

I would argue that we shouldn't attempt to calculate a fixed scale and precision for the expression, instead just use maximum precision if there is any arithmetic where we cannot calculate precision properly in advance. Perhaps I am misunderatanding what is going on here but Rounding division to 6 dps seems just wrong to me.

Would you mind reviewing purely on the code changes, rather than the functionality that is being moved? :D

Sure, but I think this should be captured in another issue :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, imho I don't like the term "upcasting", why not just "casting"?

Not a term I introduced, this was from @agavra. Though I'm assuming it's to differentiate from down-casting, which is different. In this context, the term up-casting is used to mean the ability for KSQL to implicitly convert one number type to another type, e.g. INT to BIGINT.

Might be better if the call was called canImplicitlyUpCast?

Ok, so I think "upcasting" just means "implicitly cast". (I still don't like the term 'upcasting')

And we should only do that where we can guaranteed it's safe to do so. However implicitly casting from DECIMAL to DOUBLE is not always safe. I think we currently assume it's safe because a double can hold bigger values than a decimal, but for some conversions a loss of precision might occur. Hence my previous comment about logging a warning?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way of dealing with this would be to not allow DECIMAL->DOUBLE implicit conversions, and require the user to do an explicit cast. That's another way of telling the user "hey this isn't safe, but if you really want to do it, go ahead and cast"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right about not using upcast/downcast - this was probably a terminology error on my end. Standard terminology should be widening and narrowing https://docs.oracle.com/javase/specs/jls/se7/html/jls-5.html#jls-5.1.2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think calling it canImplicitlyCastTo would be better as we're talking about implicit casting, not casting in general.

Copy link
Contributor

@agavra agavra Oct 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think canWiden is probably the technically correct term.

@big-andy-coates
Copy link
Contributor Author

big-andy-coates commented Oct 10, 2019

Hey @purplefox, thanks for the review!

However... all of your comments are on existing code that's just been moved in this PR. Don't get me wrong, you may have valid points and concerns and we should not ignore these. But we should look to discuss and address outside of the scope of this PR, which is just refactoring the code and, for the most part, maintaining the existing functionality.

In general, the decimal code must calculate a fixed scale and precision based purely on the scale and precision of the input types, as the decimal values are not known when building the schema of the column into which the result of the computation will be stored, i.e. we build the schema of a table long before processing any data.

Of course, what scale and precision KSQL determines for any specific input can be up for debate. @agavra was the original developer of the decimal code and may be able to comment more. I believe his implementation is heavily influenced by the approach taken by other DBs, e.g. Postgres.

A general question: When upcasting from DECIMAL to DOUBLE how do we deal with cases where there will be a loss of precision? (I.e. some values that can be represented exactly as decimals can't be represented exactly as doubles). Do we just silently do the conversion, or do we warn?

Perhaps we could log a warning when the query is parsed "Possible loss of precision in conversion from DECIMAL->DOUBLE"

Again, this is not something I'm proposing in this PR, only enforcing and standardizing on. This rule preexisted this PR. As an aside, I had a long discussion with @agavra on this very subject as I'm not convinced by this rule either. But that's a discussion outside the scope of this PR.

Would you mind reviewing purely on the code changes, rather than the functionality that is being moved? :D

Conflicting files
ksql-common/src/main/java/io/confluent/ksql/util/SchemaUtil.java
Copy link
Contributor

@purplefox purplefox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving on the basis that this is just a code refactor, but I think there should be a discussions on the points raised in here regarding:

  1. Scale and precision of operations like division
  2. Implicit conversion of DECIMAL to DOUBLE

@purplefox
Copy link
Contributor

Hey @agavra - do you have some insight on the comments raised in this PR?

@big-andy-coates
Copy link
Contributor Author

Another point for discussion is that the INSERT VALUE logic does implicit conversion from STRING to DECIMAL, but not STRING to any other numeric type.

@agavra
Copy link
Contributor

agavra commented Oct 10, 2019

Responding to the two big points:

With regards to the scale/precision of mathematical operations on decimals, I took this directly from SqlServer semantics
image

With regards to consider a DECIMAL -> DOUBLE conversion widening and not narrowing, there were a few things that convinced me it should be that way:

  • Decimals are exact representation of numbers while doubles are usually intended as approximations (e.g. there is no way to represent even something simple like 0.1 exactly in a double, it's likely that anyone who typed 0.1d did not intend 0.1000000000000000055511151231257827021181583404541015625). I think it would be misleading to convert a double to its decimal representation and then do math on them as if it were DECIMAL(55,55), and anything else would be incorrect.
  • There's no way to predict what the output precision and scale should be if we're converting a double to a decimal (e.g. 1.0 can be represented exactly as a double, so it would be DECIMAL(2,1) but 0.1 needs to be DECIMAL(55,55) to keep its exact value) so even if we did decide to convert the double to exactly a decimal, it would be different every time for every decimal value unless we always chose the biggest possible decimal scale/precision - which is impractical.
  • While there are DECIMAL values that cannot be expressed as DOUBLE, the opposite is also true (namely, NaN, Infinity and negative infinity)

I've given it a lot of thought, and I think what really draws me to my conclusion is that I think it's "safer" to output a double to indicate it is imprecise than do some magic and call it exact.

Test fixed by switching the RecordNode to having a custom parser that differentiates between a missing node and a node explicitly set to null.
@big-andy-coates big-andy-coates merged commit 7fe8772 into confluentinc:master Oct 11, 2019
@big-andy-coates big-andy-coates deleted the consisent_up_cast branch October 11, 2019 09:49
@purplefox
Copy link
Contributor

purplefox commented Oct 12, 2019

Responding to the two big points:

With regards to the scale/precision of mathematical operations on decimals, I took this directly from SqlServer semantics
image

I wonder what the rationale for SQLServer rounding to 6 dps is? It seems odd to me. If there are multiple operations in a single statement, the errors are going to multiply fast.

I still believe that it would be better to do any calculation at maximum precision and only round at the end of the calculation, if necessary to make the result conform to the column schema.

With regards to consider a DECIMAL -> DOUBLE conversion widening and not narrowing, there were a few things that convinced me it should be that way:

  • Decimals are exact representation of numbers while doubles are usually intended as approximations (e.g. there is no way to represent even something simple like 0.1 exactly in a double, it's likely that anyone who typed 0.1d did not intend 0.1000000000000000055511151231257827021181583404541015625). I think it would be misleading to convert a double to its decimal representation and then do math on them as if it were DECIMAL(55,55), and anything else would be incorrect.

Perhaps we are talking at cross purposes? ;)

The point I raised wasn't about converting from doubles to decimals. It was about automatically converting from decimals to doubles and losing precision.

I agree the calculation should be performed using doubles, but I think either a) the CAST from decimal to double should be explicit or b) a warning should be logged ("possible loss of precision in implicit cast from decimal to double").

@purplefox
Copy link
Contributor

I got up too early and was a bit bored so I dug into my SQLServer does this rounding.

It seems that SQLServer does this to avoid exceeding max precision fo its decimal type which is 38. https://www.sqlteam.com/forums/topic.asp?TOPIC_ID=92608

But in our case, we have BIgDecimal for calculations which has no such limitation on precision. So I don't think it makes a lot of sense to adopt the SQL Server way of doing things (unless there's a requirement for our behaviour to be exactly like SQL Server!)

(of course we might need to round down at the end of the calculation to conform to the schema of the result column, but I'm referring to during the calculation here)

@agavra
Copy link
Contributor

agavra commented Oct 14, 2019

But in our case, we have BIgDecimal for calculations which has no such limitation on precision. So I don't think it makes a lot of sense to adopt the SQL Server way of doing things (unless there's a requirement for our behaviour to be exactly like SQL Server!)

@purplefox I think this makes sense - my calculus is usually to just, in order to avoid decision paralysis and move quickly, do what other established systems do because they set a decent precedent 95% of the time. If there's examples of that precedent being less than ideal (as you pointed out in that thread) I'm more than happy having a discussion on what the "correct" behavior should be.

I agree the calculation should be performed using doubles, but I think either a) the CAST from decimal to double should be explicit or b) a warning should be logged ("possible loss of precision in implicit cast from decimal to double").

I think logging in our case doesn't really work well because it would happen per-event (maybe we could log once when materializing the query, we should have the schemas at that point 🤔) so the only other option is to fail hard. Failing hard is pretty rare in SQL systems, which seem to try hard to make something work (albeit magically) rather than failing. As an infrastructure engineer that goes against all of my intuitions, but I think if we're developing a SQL product we should develop something that "feels" familiar to that audience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants