Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add precision to IntervalDay and new IntervalCompound type #665

Merged
merged 9 commits into from
Aug 10, 2024

Conversation

Blizzara
Copy link
Contributor

@Blizzara Blizzara commented Jul 16, 2024

  • Update IntervalDay to support multiple subsecond precisions (0-9). This is done in a way that should be effectively compatible with old systems. Old plans should be able to be continue to be consumed treating the old values as microsecond precision.
  • Add new IntervalCompound type which is a combination of IntervalMonth and IntervalDay

BREAKING CHANGE: The encoding of IntervalDay literals has changed in a strictly backwards incompatible way. However, the logical meaning across encoding is maintained using a oneof. Moving a field into a oneof makes unset/set to zero unclear with older messages but the fields are defined such that the logical meaning of the two is indistinct. If neither microseconds nor precision is set, the value can be considered a precision 6 value. If you aren't using IntervalDay type, you will not need to make any changes.

BREAKING CHANGE: TypeExpression and Parametertized type protobufs (used to serialize output derivation) are updated to match the now compound nature of IntervalDay. If you use protobuf to serialize output derivation that refer to IntervalDay type, you will need to rework that logic.

fixes #664

Copy link
Contributor

@jacques-n jacques-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've posted my main comments on #664 . One more cosmetic comment here.

site/docs/types/type_classes.md Outdated Show resolved Hide resolved
@Blizzara Blizzara force-pushed the avo/interval-month-day-nano branch from 421012d to 53d48dc Compare July 29, 2024 07:49
@jacques-n
Copy link
Contributor

Per other comment, I think the discussion is complete in #664. Would be great to get PR update.

…seconds, deprecating the previous microsecond field. Consumers of IntervalDay literals now need to inspect which of the two oneof options in literals are used (microsecond or subsecond).

BREAKING CHANGE: IntervalDay type is now parameterized by subsecond precision.
@jacques-n jacques-n changed the title feat: add IntervalMonthToNano literal and IntervalMonth type feat: add precision to IntervalDay and new IntervalCompound type Aug 8, 2024
jacques-n
jacques-n previously approved these changes Aug 8, 2024
@jacques-n
Copy link
Contributor

I've updated the PR here. @westonpace @EpsilonPrime @vbarua , can you all take a look?

This is technically a breaking change although the way I did it, it isn't logically one. Nonetheless, I think we should have sufficient +1s before merging.

proto/substrait/algebra.proto Outdated Show resolved Hide resolved
proto/substrait/algebra.proto Outdated Show resolved Hide resolved
proto/substrait/type.proto Outdated Show resolved Hide resolved
proto/substrait/type.proto Outdated Show resolved Hide resolved
int64 subseconds = 5;
}

message IntervalCompound {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially this is struct<iyear, iday>. To specify it in text one would need to use the nested struct syntax. Would this be better flattened with the fields from both types included at the top level?

Copy link
Contributor

@jacques-n jacques-n Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you talking about the JSON representation or something else? I wasn't really worried about the json representation. Felt like this more clearly communicated the nature as opposed to field copying. It's a weak preference though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both JSON and the Substrait Text format (which was what I was thinking about since it'll require implementation) would be affected. I prefer struct<iyear, iday> to the compound approach. I would also classify my preference as weak.

Copy link
Contributor

@jacques-n jacques-n Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect the substrait text format might have the fields be flattened. I don't think it is critical for it to follow the same structure as the protobufs as long as lossible translation from one to the other is possible, right?

EpsilonPrime
EpsilonPrime previously approved these changes Aug 8, 2024
jacques-n
jacques-n previously approved these changes Aug 8, 2024
westonpace
westonpace previously approved these changes Aug 8, 2024
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could potentially make this a non-breaking change if we documented something like:

If the microseconds value is non-zero then ignore the other two fields and assume microsecond precision.

However, that burdens consumers with (a slight amount) of technical debt.

From a quick glance it appears that:

  • Arrow is incapable of producing IntervalDayToSecond
  • Arrow is consuming IntervalDayToSecond incorrectly already (it is completely ignoring the microseconds field)
  • Datafusion can produce IntervalDayToSecond with a microseconds value (this change will break anyone using this behavior)
  • Datafusion consumes IntervalDayToSecond but, since DF only has millisecond precision, it silently discards some precision. (this change will break users relying on this behavior but at least we have some positive benefit here of turning that silent refusal into a noisy refusal).

Overall, I think I'm on the fence but will lean towards progress.

@jacques-n jacques-n dismissed stale reviews from westonpace, EpsilonPrime, and themself via 090b43f August 8, 2024 18:47
@jacques-n
Copy link
Contributor

Given @westonpace seeing a lack of clarity about the fact that the literal is backwards compatible, I added an additional comment to the IntervalDay literal to better clarify that consumers need to understand that two different variations are supported. @westonpace and @EpsilonPrime, if you're both good, I'll merge this. Thanks

@jacques-n
Copy link
Contributor

Just need an approval from @vbarua or @EpsilonPrime and/or @cpcloud and then we can merge.

Copy link
Member

@EpsilonPrime EpsilonPrime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it and the text format should be fine reaching down into the other two types as an implementation detail. This approach is good enough.

@vbarua
Copy link
Member

vbarua commented Aug 9, 2024

Taking a look at this today. I want to verify that backwards compatibility of this internally.

@@ -43,6 +42,8 @@ Compound type classes are type classes that need to be configured by means of a
| MAP&lt;K, V&gt; | An unordered list of type K keys with type V values. Keys may be repeated. While the key type could be nullable, keys may not be null. | `repeated KeyValue` (in turn two `Literal`s), all key types matching K and all value types matching V
| PRECISIONTIMESTAMP&lt;P&gt; | A timestamp with fractional second precision (P, number of digits) 0 <= P <= 9. Does not include timezone information and can thus not be unambiguously mapped to a moment on the timeline without context. Similar to naive datetime in Python. | `int64` seconds, milliseconds, microseconds or nanoseconds since 1970-01-01 00:00:00.000000000 (in an unspecified timezone)
| PRECISIONTIMESTAMPTZ&lt;P&gt; | A timezone-aware timestamp, with fractional second precision (P, number of digits) 0 <= P <= 9. Similar to aware datetime in Python. | `int64` seconds, milliseconds, microseconds or nanoseconds since 1970-01-01 00:00:00.000000000 UTC
| INTERVAL_DAY&lt;P&gt; | Interval day to second. Supports a range of [-3,650,000..3,650,000] days with fractional second precision (P, number of digits) 0 <= P <= 9. Usually stored as separate integers for various components, but only the total number of fractional seconds is significant, i.e. `1d 0s` is considered equal to `0d 86400s`. | `int32` days, `int32` seconds, and `int64` fractional seconds, with the added constraint that each component can never independently specify more than 10,000 years, even if the components have opposite signs (e.g. `3650001d -86400s 0us` is **not** allowed)
| INTERVAL_COMPOUND&lt;P&gt; | A compound interval type that is composed of elements of the underlying elements and rules of both interval_month and interval_day to express arbitrary durations across multiple grains. Substrait gives no definition for the conversion of values between independent grains (e.g. months to days).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

across multiple grains

What is a grain in this context?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

year/month vs day/second. there are no consistent reliable translations between each.

@@ -43,6 +42,8 @@ Compound type classes are type classes that need to be configured by means of a
| MAP&lt;K, V&gt; | An unordered list of type K keys with type V values. Keys may be repeated. While the key type could be nullable, keys may not be null. | `repeated KeyValue` (in turn two `Literal`s), all key types matching K and all value types matching V
| PRECISIONTIMESTAMP&lt;P&gt; | A timestamp with fractional second precision (P, number of digits) 0 <= P <= 9. Does not include timezone information and can thus not be unambiguously mapped to a moment on the timeline without context. Similar to naive datetime in Python. | `int64` seconds, milliseconds, microseconds or nanoseconds since 1970-01-01 00:00:00.000000000 (in an unspecified timezone)
| PRECISIONTIMESTAMPTZ&lt;P&gt; | A timezone-aware timestamp, with fractional second precision (P, number of digits) 0 <= P <= 9. Similar to aware datetime in Python. | `int64` seconds, milliseconds, microseconds or nanoseconds since 1970-01-01 00:00:00.000000000 UTC
| INTERVAL_DAY&lt;P&gt; | Interval day to second. Supports a range of [-3,650,000..3,650,000] days with fractional second precision (P, number of digits) 0 <= P <= 9. Usually stored as separate integers for various components, but only the total number of fractional seconds is significant, i.e. `1d 0s` is considered equal to `0d 86400s`. | `int32` days, `int32` seconds, and `int64` fractional seconds, with the added constraint that each component can never independently specify more than 10,000 years, even if the components have opposite signs (e.g. `3650001d -86400s 0us` is **not** allowed)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that this is a parameterised type, do we need to update all the function extensions that refer to it as interval_day to use interval_day<P> instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'm inclined to do in a follow-along once this is merged.

Copy link
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for an interval type covering both months and (sub)seconds
5 participants