Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-34615][SQL] Support java.time.Period as an external type of the year-month interval type #31765

Closed
wants to merge 1 commit into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Mar 6, 2021

What changes were proposed in this pull request?

In the PR, I propose to extend Spark SQL API to accept java.time.Period as an external type of recently added new Catalyst type - YearMonthIntervalType (see #31614). The Java class java.time.Period has similar semantic to ANSI SQL year-month interval type, and it is the most suitable to be an external type for YearMonthIntervalType. In more details:

  1. Added PeriodConverter which converts java.time.Period instances to/from internal representation of the Catalyst type YearMonthIntervalType (to Int type). The PeriodConverter object uses new methods of IntervalUtils:
    • periodToMonths() converts the input period to the total length in months. If this period is too large to fit Int, the method throws the exception ArithmeticException. Note: the input period has "days" precision, the method just ignores the days unit.
    • monthToPeriod() obtains a java.time.Period representing a number of months.
  2. Support new type YearMonthIntervalType in RowEncoder via the methods createDeserializerForPeriod() and createSerializerForJavaPeriod().
  3. Extended the Literal API to construct literals from java.time.Period instances.

Why are the changes needed?

  1. To allow users parallelization of java.time.Period collections, and construct year-month interval columns. Also to collect such columns back to the driver side.
  2. This will allow to write tests in other sub-tasks of SPARK-27790.

Does this PR introduce any user-facing change?

The PR extends existing functionality. So, users can parallelize instances of the java.time.Duration class and collect them back:

scala> val ds = Seq(java.time.Period.ofYears(10).withMonths(2)).toDS
ds: org.apache.spark.sql.Dataset[java.time.Period] = [value: yearmonthinterval]

scala> ds.collect
res0: Array[java.time.Period] = Array(P10Y2M)

How was this patch tested?

  • Added a few tests to CatalystTypeConvertersSuite to check conversion from/to java.time.Period.
  • Checking row encoding by new tests in RowEncoderSuite.
  • Making literals of YearMonthIntervalType are tested in LiteralExpressionSuite.
  • Check collecting by DatasetSuite and JavaDatasetSuite.
  • New tests in IntervalUtilsSuites to check conversions java.time.Period <-> months.

@github-actions github-actions bot added the SQL label Mar 6, 2021
@SparkQA
Copy link

SparkQA commented Mar 6, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40414/

@SparkQA
Copy link

SparkQA commented Mar 6, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40414/

@MaxGekk MaxGekk changed the title [WIP][SPARK-34615][SQL] Support java.time.Period as an external type of the year-month interval type [SPARK-34615][SQL] Support java.time.Period as an external type of the year-month interval type Mar 7, 2021
@MaxGekk
Copy link
Member Author

MaxGekk commented Mar 7, 2021

@cloud-fan @yaooqinn @srielau @HyukjinKwon Could you review this PR, please

* @return The total number of months in the period, may be negative
* @throws ArithmeticException If numeric overflow occurs
*/
def periodToMonths(period: Period): Int = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we fail if the day field is not 0? Or at least give a warning?

Copy link
Member Author

@MaxGekk MaxGekk Mar 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't fail when we convert:

  1. java.sql.Date has time component with millisecond precision but we ignore it when we convert to days at
    val julianDays = Math.toIntExact(Math.floorDiv(millisLocal, MILLIS_PER_DAY))
  2. java.sql.Timestamp which has nanoseconds precision:
    val micros = millisToMicros(t.getTime) + (t.getNanos / NANOS_PER_MICROS) % MICROS_PER_MILLIS
  3. java.time.Instant which contains nanoseconds, and we don't fail when we convert it to microseconds:
    val result = Math.addExact(us, NANOSECONDS.toMicros(instant.getNano))

To be consistent with current implementation for other types, I do believe we should not fail.

Or at least give a warning?

This will just fill in the logs by useless records, and this is again inconsistent with current implementation.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in e10bf64 Mar 8, 2021
@yaooqinn
Copy link
Member

yaooqinn commented Mar 8, 2021

Late LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants