Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-35111][SQL] Support Cast string to year-month interval #32266

Closed
wants to merge 28 commits into from

Conversation

AngersZhuuuu
Copy link
Contributor

@AngersZhuuuu AngersZhuuuu commented Apr 21, 2021

What changes were proposed in this pull request?

Support Cast string to year-month interval
Supported format as below

ANSI_STYLE, like
INTERVAL -'-10-1' YEAR TO MONTH
HIVE_STYLE like
10-1 or -10-1


Rules from the SQL standard about ANSI_STYLE:

<interval literal> ::=
  INTERVAL [ <sign> ] <interval string> <interval qualifier>
<interval string> ::=
  <quote> <unquoted interval string> <quote>
<unquoted interval string> ::=
  [ <sign> ] { <year-month literal> | <day-time literal> }
<year-month literal> ::=
  <years value> [ <minus sign> <months value> ]
  | <months value>
<years value> ::=
  <datetime value>
<months value> ::=
  <datetime value>
<datetime value> ::=
  <unsigned integer>
<unsigned integer> ::= <digit>...

Why are the changes needed?

Support Cast string to year-month interval

Does this PR introduce any user-facing change?

User can cast year month interval string to YearMonthIntervalType

How was this patch tested?

Added UT

@AngersZhuuuu
Copy link
Contributor Author

FYI @MaxGekk

@github-actions github-actions bot added the SQL label Apr 21, 2021
@SparkQA
Copy link

SparkQA commented Apr 21, 2021

Test build #137721 has finished for PR 32266 at commit 9546a0f.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -1774,6 +1774,19 @@ class CastSuite extends CastSuiteBase {
assert(e3.contains("Casting 2147483648 to int causes overflow"))
}
}

test("SPARK-35111: Cast string to year-month interval") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add round trip tests: string -> year-month interval -> string, and year-month interval -> string -> year-month interval

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you test corner cases when arithmetic overflow happens. Also test upper and lower cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you test corner cases when arithmetic overflow happens. Also test upper and lower cases.

Current test is not correct according to #32281, will update after #32281 merged

Copy link
Member

@MaxGekk MaxGekk Apr 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, check the input in lower case. For example:

    checkEvaluation(cast(Literal.create("  interval '1-0' YEAR TO MONTH  "),
      YearMonthIntervalType), 12)

if (input == null || input.toString == null) {
throw new IllegalArgumentException("Interval year-month string must be not null")
} else {
val regex = "INTERVAL '([-|+]?[0-9]+-[-|+]?[0-9]+)' YEAR TO MONTH".r
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The keywords INTERVAL and YEAR TO MONTH are not mandatory by SQL standard. We should be able to parse only payload too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also could you move the regexp out of the method. So, we can avoid its "compilation" per every call.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding to the pattern '([-|+]?[0-9]+-[-|+]?[0-9]+)', should we handle the second sign before month? Take a look at the val above yearMonthPattern

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding to the pattern '([-|+]?[0-9]+-[-|+]?[0-9]+)', should we handle the second sign before month? Take a look at the val above yearMonthPattern

How about current

Copy link
Contributor Author

@AngersZhuuuu AngersZhuuuu Apr 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGekk Can you help to list the case we need to support here,

INTERVAL [-|+]'[+|-]0-0' YEAR TO MONTH
[+|-]0-0

Any more case need to support here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, we must support our own output (in case insensitive manner by default but ideally we should respect the SQL config spark.sql.caseSensitive):

  1. ANSI_STYLE, like
    INTERVAL -'-10-1' YEAR TO MONTH
  2. HIVE_STYLE like
    10-1 or -10-1

Rules from the SQL standard:

<interval literal> ::=
  INTERVAL [ <sign> ] <interval string> <interval qualifier>
<interval string> ::=
  <quote> <unquoted interval string> <quote>
<unquoted interval string> ::=
  [ <sign> ] { <year-month literal> | <day-time literal> }
<year-month literal> ::=
  <years value> [ <minus sign> <months value> ]
  | <months value>
<years value> ::=
  <datetime value>
<months value> ::=
  <datetime value>
<datetime value> ::=
  <unsigned integer>
<unsigned integer> ::= <digit>...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but ideally we should respect the SQL config spark.sql.caseSensitive):

Spark Catalyst parser not respect spark.sql.caseSensitive too, should we respect it here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For parsing I think it's OK to be always case insensitive. The Spark SQL parser is case insensitive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, how about current approach

@SparkQA
Copy link

SparkQA commented Apr 21, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42248/

@SparkQA
Copy link

SparkQA commented Apr 21, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42248/

@SparkQA
Copy link

SparkQA commented Apr 21, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42252/

@SparkQA
Copy link

SparkQA commented Apr 21, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42252/

@SparkQA
Copy link

SparkQA commented Apr 21, 2021

Test build #137725 has finished for PR 32266 at commit 691c1f4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 22, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42320/

@SparkQA
Copy link

SparkQA commented Apr 22, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42320/

@SparkQA
Copy link

SparkQA commented Apr 22, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42332/

@SparkQA
Copy link

SparkQA commented Apr 22, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42332/

@SparkQA
Copy link

SparkQA commented Apr 22, 2021

Test build #137791 has finished for PR 32266 at commit 62d175b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 22, 2021

Test build #137803 has finished for PR 32266 at commit 6d14414.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -93,6 +93,23 @@ object IntervalUtils {

private val yearMonthPattern = "^([+|-])?(\\d+)-(\\d+)$".r

private val yearMonthFuzzyPattern = "[^-|+]*?([+|-]?[\\d]+-[\\d]+).*".r
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need a new regex?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need a new regex?

Since yearMonthPattern only support [+|-]yy-mmmm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do we want to support here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am discussing this here #32266 (comment) with @MaxGekk ,
Do you have any other supplementary suggestions?

@SparkQA
Copy link

SparkQA commented Apr 23, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42393/

@SparkQA
Copy link

SparkQA commented Apr 23, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42393/

@SparkQA
Copy link

SparkQA commented Apr 29, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42593/

@SparkQA
Copy link

SparkQA commented Apr 29, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42593/

@SparkQA
Copy link

SparkQA commented Apr 29, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42592/

@SparkQA
Copy link

SparkQA commented Apr 29, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42595/

@SparkQA
Copy link

SparkQA commented Apr 29, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42595/

@SparkQA
Copy link

SparkQA commented Apr 29, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42596/

@SparkQA
Copy link

SparkQA commented Apr 29, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42596/

@SparkQA
Copy link

SparkQA commented Apr 29, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42599/

@SparkQA
Copy link

SparkQA commented Apr 29, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42599/

@SparkQA
Copy link

SparkQA commented Apr 29, 2021

Test build #138072 has finished for PR 32266 at commit ca19c09.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 29, 2021

Test build #138073 has finished for PR 32266 at commit 3adde87.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 29, 2021

Test build #138075 has finished for PR 32266 at commit 2c8785b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 29, 2021

Test build #138076 has finished for PR 32266 at commit 253c70e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 29, 2021

Test build #138079 has finished for PR 32266 at commit 9c70b88.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

"^(INTERVAL\\s+)([+|-])?(')([+|-])?(\\d+)-(\\d+)(')(\\s+YEAR\\s+TO\\s+MONTH)$".r

def castStringToYMInterval(input: UTF8String): Int = {
input.trimAll().toString.toUpperCase(Locale.ROOT) match {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can specify that your pattern is case insensitive, and remove .toUpperCase(Locale.ROOT)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't see any tests that check proper spaces handling - I mean trimAll(). Could you add such checks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wonder why don't you trim whitespaces by regular expression?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regular

Match case -1-0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't see any tests that check proper spaces handling - I mean trimAll(). Could you add such checks.

done

@@ -92,6 +93,25 @@ object IntervalUtils {
}

private val yearMonthPattern = "^([+|-])?(\\d+)-(\\d+)$".r
private val yearMonthStringPattern =
"^(INTERVAL\\s+)([+|-])?(')([+|-])?(\\d+)-(\\d+)(')(\\s+YEAR\\s+TO\\s+MONTH)$".r
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"^(INTERVAL\\s+)([+|-])?(')([+|-])?(\\d+)-(\\d+)(')(\\s+YEAR\\s+TO\\s+MONTH)$".r
"(?i)^(INTERVAL\\s+)([+|-])?(')([+|-])?(\\d+)-(\\d+)(')(\\s+YEAR\\s+TO\\s+MONTH)$".r

The (?i) is placed at the beginning of the pattern to enable case-insensitivity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for you share

private[this] def castToYearMonthIntervalCode(from: DataType): CastFunction = from match {
case StringType =>
val util = IntervalUtils.getClass.getCanonicalName.stripSuffix("$")
(c, evPrim, evNull) => code"$evPrim = $util.castStringToYMInterval($c);"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: evNull is not used

Suggested change
(c, evPrim, evNull) => code"$evPrim = $util.castStringToYMInterval($c);"
(c, evPrim, _) => code"$evPrim = $util.castStringToYMInterval($c);"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general except of a few comments.

@SparkQA
Copy link

SparkQA commented Apr 30, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42616/

@SparkQA
Copy link

SparkQA commented Apr 30, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42616/

@MaxGekk
Copy link
Member

MaxGekk commented Apr 30, 2021

+1, LGTM. GA passed. Merging to master.
Thank you @AngersZhuuuu , and @cloud-fan for review.

@MaxGekk MaxGekk closed this in 11ea255 Apr 30, 2021
@SparkQA
Copy link

SparkQA commented Apr 30, 2021

Test build #138096 has finished for PR 32266 at commit 5ca83ab.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants