Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged master branch from apache #1

Merged
merged 2,831 commits into from
May 15, 2021
Merged

Merged master branch from apache #1

merged 2,831 commits into from
May 15, 2021
This pull request is big! We’re only showing the most recent 250 commits.

Commits on Apr 15, 2021

  1. [SPARK-34225][CORE][FOLLOWUP] Replace Hadoop's Path with Utils.resolv…

    …eURI to make the way to get URI simple
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to replace Hadoop's `Path` with `Utils.resolveURI` to make the way to get URI simple in `SparkContext`.
    
    ### Why are the changes needed?
    
    Keep the code simple.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Existing tests.
    
    Closes #32164 from sarutak/followup-SPARK-34225.
    
    Authored-by: Kousuke Saruta <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    sarutak authored and dongjoon-hyun committed Apr 15, 2021
    Configuration menu
    Copy the full SHA
    767ea86 View commit details
    Browse the repository at this point in the history
  2. [SPARK-35070][SQL] TRANSFORM not support alias in inputs

    ### What changes were proposed in this pull request?
    Normal function parameters should not support alias, hive not support too
    ![image](https://user-images.githubusercontent.com/46485123/114645556-4a7ff400-9d0c-11eb-91eb-bc679ea0039a.png)
    In this pr we forbid use alias in `TRANSFORM`'s inputs
    
    ### Why are the changes needed?
    Fix bug
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Added UT
    
    Closes #32165 from AngersZhuuuu/SPARK-35070.
    
    Authored-by: Angerszhuuuu <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    AngersZhuuuu authored and cloud-fan committed Apr 15, 2021
    Configuration menu
    Copy the full SHA
    71133e1 View commit details
    Browse the repository at this point in the history
  3. [MINOR][CORE] Correct the number of started fetch requests in log

    ### What changes were proposed in this pull request?
    
    When counting the number of started fetch requests, we should exclude the deferred requests.
    
    ### Why are the changes needed?
    
    Fix the wrong number in the log.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, users see the correct number of started requests in logs.
    
    ### How was this patch tested?
    
    Manually tested.
    
    Closes #32180 from Ngone51/count-deferred-request.
    
    Lead-authored-by: yi.wu <[email protected]>
    Co-authored-by: wuyi <[email protected]>
    Signed-off-by: attilapiros <[email protected]>
    Ngone51 authored and attilapiros committed Apr 15, 2021
    Configuration menu
    Copy the full SHA
    2cb962b View commit details
    Browse the repository at this point in the history
  4. [SPARK-34995] Port/integrate Koalas remaining codes into PySpark

    ### What changes were proposed in this pull request?
    
    There are some more changes in Koalas such as [databricks/koalas#2141](databricks/koalas@c8f803d), [databricks/koalas#2143](databricks/koalas@913d688) after the main code porting, this PR is to synchronize those changes with the `pyspark.pandas`.
    
    ### Why are the changes needed?
    
    We should port the whole Koalas codes into PySpark and synchronize them.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Fixed some incompatible behavior with pandas 1.2.0 and added more to the `to_markdown` docstring.
    
    ### How was this patch tested?
    
    Manually tested in local.
    
    Closes #32154 from itholic/SPARK-34995.
    
    Authored-by: itholic <[email protected]>
    Signed-off-by: HyukjinKwon <[email protected]>
    itholic authored and HyukjinKwon committed Apr 15, 2021
    Configuration menu
    Copy the full SHA
    9689c44 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    637f593 View commit details
    Browse the repository at this point in the history
  6. [SPARK-34843][SQL][FOLLOWUP] Fix a test failure in OracleIntegrationS…

    …uite
    
    ### What changes were proposed in this pull request?
    
    This PR fixes a test failure in `OracleIntegrationSuite`.
    After SPARK-34843 (#31965), the way to divide partitions is changed and `OracleIntegrationSuites` is affected.
    ```
    [info] - SPARK-22814 support date/timestamp types in partitionColumn *** FAILED *** (230 milliseconds)
    [info]   Set(""D" < '2018-07-11' or "D" is null", ""D" >= '2018-07-11' AND "D" < '2018-07-15'", ""D" >= '2018-07-15'") did not equal Set(""D" < '2018-07-10' or "D" is null", ""D" >= '2018-07-10' AND "D" < '2018-07-14'", ""D" >= '2018-07-14'") (OracleIntegrationSuite.scala:448)
    [info]   Analysis:
    [info]   Set(missingInLeft: ["D" < '2018-07-10' or "D" is null, "D" >= '2018-07-10' AND "D" < '2018-07-14', "D" >= '2018-07-14'], missingInRight: ["D" < '2018-07-11' or "D" is null, "D" >= '2018-07-11' AND "D" < '2018-07-15', "D" >= '2018-07-15'])
    ```
    
    ### Why are the changes needed?
    
    To follow the previous change.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    The modified test.
    
    Closes #32186 from sarutak/fix-oracle-date-error.
    
    Authored-by: Kousuke Saruta <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    sarutak authored and dongjoon-hyun committed Apr 15, 2021
    Configuration menu
    Copy the full SHA
    ba92de0 View commit details
    Browse the repository at this point in the history
  7. [SPARK-35032][PYTHON] Port Koalas Index unit tests into PySpark

    ### What changes were proposed in this pull request?
    Now that we merged the Koalas main code into the PySpark code base (#32036), we should port the Koalas Index unit tests to PySpark.
    
    ### Why are the changes needed?
    Currently, the pandas-on-Spark modules are not tested fully. We should enable the Index unit tests.
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    Enable Index unit tests.
    
    Closes #32139 from xinrong-databricks/port.indexes_tests.
    
    Authored-by: Xinrong Meng <[email protected]>
    Signed-off-by: HyukjinKwon <[email protected]>
    xinrong-meng authored and HyukjinKwon committed Apr 15, 2021
    Configuration menu
    Copy the full SHA
    4aee19e View commit details
    Browse the repository at this point in the history

Commits on Apr 16, 2021

  1. [SPARK-35099][SQL] Convert ANSI interval literals to SQL string in AN…

    …SI style
    
    ### What changes were proposed in this pull request?
    Handle `YearMonthIntervalType` and `DayTimeIntervalType` in the `sql()` and `toString()` method of `Literal`, and format the ANSI interval in the ANSI style.
    
    ### Why are the changes needed?
    To improve readability and UX with Spark SQL. For example, a test output before the changes:
    ```
    -- !query
    select timestamp'2011-11-11 11:11:11' - interval '2' day
    -- !query schema
    struct<TIMESTAMP '2011-11-11 11:11:11' - 172800000000:timestamp>
    -- !query output
    2011-11-09 11:11:11
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    Should not since the new intervals haven't been released yet.
    
    ### How was this patch tested?
    By running new tests:
    ```
    $ ./build/sbt "test:testOnly *LiteralExpressionSuite"
    ```
    
    Closes #32196 from MaxGekk/literal-ansi-interval-sql.
    
    Authored-by: Max Gekk <[email protected]>
    Signed-off-by: Max Gekk <[email protected]>
    MaxGekk committed Apr 16, 2021
    Configuration menu
    Copy the full SHA
    3f4c32b View commit details
    Browse the repository at this point in the history
  2. [SPARK-35083][CORE] Support remote scheduler pool files

    ### What changes were proposed in this pull request?
    
    Use hadoop FileSystem instead of FileInputStream.
    
    ### Why are the changes needed?
    
    Make `spark.scheduler.allocation.file` suport remote file. When using Spark as a server (e.g. SparkThriftServer), it's hard for user to specify a local path as the scheduler pool.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, a minor feature.
    
    ### How was this patch tested?
    
    Pass `core/src/test/scala/org/apache/spark/scheduler/PoolSuite.scala` and manul test
    After add config `spark.scheduler.allocation.file=hdfs:///tmp/fairscheduler.xml`. We intrudoce the configed pool.
    ![pool1](https://user-images.githubusercontent.com/12025282/114810037-df065700-9ddd-11eb-8d7a-54b59a07ee7b.jpg)
    
    Closes #32184 from ulysses-you/SPARK-35083.
    
    Authored-by: ulysses-you <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    ulysses-you authored and dongjoon-hyun committed Apr 16, 2021
    Configuration menu
    Copy the full SHA
    345c380 View commit details
    Browse the repository at this point in the history
  3. [SPARK-35104][SQL] Fix ugly indentation of multiple JSON records in a…

    … single split file generated by JacksonGenerator when pretty option is true
    
    ### What changes were proposed in this pull request?
    
    This issue fixes an issue that indentation of multiple output JSON records in a single split file are broken except for the first record in the split when `pretty` option is `true`.
    ```
    // Run in the Spark Shell.
    // Set spark.sql.leafNodeDefaultParallelism to 1 for the current master.
    // Or set spark.default.parallelism for the previous releases.
    spark.conf.set("spark.sql.leafNodeDefaultParallelism", 1)
    val df = Seq("a", "b", "c").toDF
    df.write.option("pretty", "true").json("/path/to/output")
    
    # Run in a Shell
    $ cat /path/to/output/*.json
    {
      "value" : "a"
    }
     {
      "value" : "b"
    }
     {
      "value" : "c"
    }
    ```
    
    ### Why are the changes needed?
    
    It's not pretty even though `pretty` option is true.
    
    ### Does this PR introduce _any_ user-facing change?
    
    I think "No". Indentation style is changed but JSON format is not changed.
    
    ### How was this patch tested?
    
    New test.
    
    Closes #32203 from sarutak/fix-ugly-indentation.
    
    Authored-by: Kousuke Saruta <[email protected]>
    Signed-off-by: Max Gekk <[email protected]>
    sarutak authored and MaxGekk committed Apr 16, 2021
    Configuration menu
    Copy the full SHA
    95db7e6 View commit details
    Browse the repository at this point in the history
  4. [SPARK-34995] Port/integrate Koalas remaining codes into PySpark

    ### What changes were proposed in this pull request?
    
    There are some more changes in Koalas such as [databricks/koalas#2141](databricks/koalas@c8f803d), [databricks/koalas#2143](databricks/koalas@913d688) after the main code porting, this PR is to synchronize those changes with the `pyspark.pandas`.
    
    ### Why are the changes needed?
    
    We should port the whole Koalas codes into PySpark and synchronize them.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Fixed some incompatible behavior with pandas 1.2.0 and added more to the `to_markdown` docstring.
    
    ### How was this patch tested?
    
    Manually tested in local.
    
    Closes #32197 from itholic/SPARK-34995-fix.
    
    Authored-by: itholic <[email protected]>
    Signed-off-by: HyukjinKwon <[email protected]>
    itholic authored and HyukjinKwon committed Apr 16, 2021
    Configuration menu
    Copy the full SHA
    91bd384 View commit details
    Browse the repository at this point in the history

Commits on Apr 17, 2021

  1. [MINOR][DOCS] Soften security warning and keep it in cluster manageme…

    …nt docs only
    
    ### What changes were proposed in this pull request?
    
    Soften security warning and keep it in cluster management docs only, not in the main doc page, where it's not necessarily relevant.
    
    ### Why are the changes needed?
    
    The statement is perhaps unnecessarily 'frightening' as the first section in the main docs page. It applies to clusters not local mode, anyhow.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Just a docs change.
    
    ### How was this patch tested?
    
    N/A
    
    Closes #32206 from srowen/SecurityStatement.
    
    Authored-by: Sean Owen <[email protected]>
    Signed-off-by: Sean Owen <[email protected]>
    srowen committed Apr 17, 2021
    Configuration menu
    Copy the full SHA
    2e1e1f8 View commit details
    Browse the repository at this point in the history
  2. [SPARK-34787][CORE] Option variable in Spark historyServer log should…

    … be displayed as actual value instead of Some(XX)
    
    ### What changes were proposed in this pull request?
    Make the attemptId in the log of historyServer to be more easily to read.
    
    ### Why are the changes needed?
    Option variable in Spark historyServer log should be displayed as actual value instead of Some(XX)
    
    ### Does this PR introduce any user-facing change?
    No
    
    ### How was this patch tested?
    manual test
    
    Closes #32189 from kyoty/history-server-print-option-variable.
    
    Authored-by: kyoty <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    echohlne authored and dongjoon-hyun committed Apr 17, 2021
    Configuration menu
    Copy the full SHA
    94849af View commit details
    Browse the repository at this point in the history

Commits on Apr 18, 2021

  1. [SPARK-35101][INFRA] Add GitHub status check in PR instead of a comment

    ### What changes were proposed in this pull request?
    
    TL;DR: now it shows green yellow read status of tests instead of relying on a comment in a PR, **see HyukjinKwon#41 for an example**.
    
    This PR proposes the GitHub status checks instead of a comment that link to the build (from forked repository) in PRs.
    
    This is how it works:
    
    1. **forked repo**: "Build and test" workflow is triggered when you create a branch to create a PR which uses your resources in GitHub Actions.
    1. **main repo**: "Notify test workflow" (previously created a comment) now creates a in-progress status (yellow status) as a GitHub Actions check to your current PR.
    1.  **main repo**: "Update build status workflow" regularly (every 15 mins) checks open PRs, and updates the status of GitHub Actions checks at PRs according to the status of workflows in the forked repositories (status sync).
    
    **NOTE** that creating/updating statuses in the PRs is only allowed from the main repo. That's why the flow is as above.
    
    ### Why are the changes needed?
    
    The GitHub status shows a green although the tests are running, which is confusing.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No, dev-only.
    
    ### How was this patch tested?
    
    Manually tested at:
    - HyukjinKwon#41
    - HyukjinKwon#42
    - HyukjinKwon#43
    - HyukjinKwon#37
    
    **queued**:
    <img width="861" alt="Screen Shot 2021-04-16 at 10 56 03 AM" src="https://user-images.githubusercontent.com/6477701/114960831-c9a73080-9ea2-11eb-8442-ddf3f6008a45.png">
    
    **in progress**:
    <img width="871" alt="Screen Shot 2021-04-16 at 12 14 39 PM" src="https://user-images.githubusercontent.com/6477701/114966359-59ea7300-9ead-11eb-98cb-1e63323980ad.png">
    
    **passed**:
    ![Screen Shot 2021-04-16 at 2 04 07 PM](https://user-images.githubusercontent.com/6477701/114974045-a12c3000-9ebc-11eb-9be5-653393a863e6.png)
    
    **failure**:
    ![Screen Shot 2021-04-16 at 10 46 10 PM](https://user-images.githubusercontent.com/6477701/115033584-90ec7300-9f05-11eb-8f2e-0fc2ef986a70.png)
    
    Closes #32193 from HyukjinKwon/update-checks-pr-poc.
    
    Lead-authored-by: HyukjinKwon <[email protected]>
    Co-authored-by: Hyukjin Kwon <[email protected]>
    Co-authored-by: Yikun Jiang <[email protected]>
    Signed-off-by: HyukjinKwon <[email protected]>
    HyukjinKwon and Yikun committed Apr 18, 2021
    Configuration menu
    Copy the full SHA
    2bdb26b View commit details
    Browse the repository at this point in the history
  2. [MINOR][INFRA] Upgrade Jira client to 2.0.0

    ### What changes were proposed in this pull request?
    
    SPARK-10498 added the initial Jira client requirement with 1.0.3 five year ago (2016 January). As of today, it causes `dev/merge_spark_pr.py` failure with `Python 3.9.4` due to this old dependency. This PR aims to upgrade it to the latest version, 2.0.0. The latest version is also a little old (2018 July).
    - https://pypi.org/project/jira/#history
    
    ### Why are the changes needed?
    
    `Jira==2.0.0` works well with both Python 3.8/3.9 while `Jira==1.0.3` fails with Python 3.9.
    
    **BEFORE**
    ```
    $ pyenv global 3.9.4
    $ pip freeze | grep jira
    jira==1.0.3
    $ dev/merge_spark_pr.py
    Traceback (most recent call last):
      File "/Users/dongjoon/APACHE/spark-merge/dev/merge_spark_pr.py", line 39, in <module>
        import jira.client
      File "/Users/dongjoon/.pyenv/versions/3.9.4/lib/python3.9/site-packages/jira/__init__.py", line 5, in <module>
        from .config import get_jira
      File "/Users/dongjoon/.pyenv/versions/3.9.4/lib/python3.9/site-packages/jira/config.py", line 17, in <module>
        from .client import JIRA
      File "/Users/dongjoon/.pyenv/versions/3.9.4/lib/python3.9/site-packages/jira/client.py", line 165
        validate=False, get_server_info=True, async=False, logging=True, max_retries=3):
                                              ^
    SyntaxError: invalid syntax
    ```
    
    **AFTER**
    ```
    $ pip install jira==2.0.0
    $ dev/merge_spark_pr.py
    git rev-parse --abbrev-ref HEAD
    Which pull request would you like to merge? (e.g. 34):
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    
    No. This is a committer-only script.
    
    ### How was this patch tested?
    
    Manually.
    
    Closes #32215 from dongjoon-hyun/jira.
    
    Authored-by: Dongjoon Hyun <[email protected]>
    Signed-off-by: HyukjinKwon <[email protected]>
    dongjoon-hyun authored and HyukjinKwon committed Apr 18, 2021
    Configuration menu
    Copy the full SHA
    7f6dee8 View commit details
    Browse the repository at this point in the history
  3. [SPARK-35116][SQL][TESTS] The generated data fits the precision of Da…

    …yTimeIntervalType in spark
    
    ### What changes were proposed in this pull request?
    The precision of `java.time.Duration` is nanosecond, but when it is used as `DayTimeIntervalType` in Spark, it is microsecond.
    At present, the `DayTimeIntervalType` data generated in the implementation of `RandomDataGenerator` is accurate to nanosecond, which will cause the `DayTimeIntervalType` to be converted to long, and then back to `DayTimeIntervalType` to lose the accuracy, which will cause the test to fail. For example: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137390/testReport/org.apache.spark.sql.hive.execution/HashAggregationQueryWithControlledFallbackSuite/udaf_with_all_data_types/
    
    ### Why are the changes needed?
    Improve `RandomDataGenerator` so that the generated data fits the precision of DayTimeIntervalType in spark.
    
    ### Does this PR introduce _any_ user-facing change?
    'No'. Just change the test class.
    
    ### How was this patch tested?
    Jenkins test.
    
    Closes #32212 from beliefer/SPARK-35116.
    
    Authored-by: beliefer <[email protected]>
    Signed-off-by: Max Gekk <[email protected]>
    beliefer authored and MaxGekk committed Apr 18, 2021
    Configuration menu
    Copy the full SHA
    03191e8 View commit details
    Browse the repository at this point in the history
  4. [SPARK-35114][SQL][TESTS] Add checks for ANSI intervals to `LiteralEx…

    …pressionSuite`
    
    ### What changes were proposed in this pull request?
    In the PR, I propose to add additional checks for ANSI interval types `YearMonthIntervalType` and `DayTimeIntervalType` to `LiteralExpressionSuite`.
    
    Also, I replaced some long literal values by `CalendarInterval` to check `CalendarIntervalType` that the tests were supposed to check.
    
    ### Why are the changes needed?
    To improve test coverage and have the same checks for ANSI types as for `CalendarIntervalType`.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    By running the modified test suite:
    ```
    $ build/sbt "test:testOnly *LiteralExpressionSuite"
    ```
    
    Closes #32213 from MaxGekk/interval-literal-tests.
    
    Authored-by: Max Gekk <[email protected]>
    Signed-off-by: Max Gekk <[email protected]>
    MaxGekk committed Apr 18, 2021
    Configuration menu
    Copy the full SHA
    d04b467 View commit details
    Browse the repository at this point in the history
  5. [SPARK-34716][SQL] Support ANSI SQL intervals by the aggregate functi…

    …on `sum`
    
    ### What changes were proposed in this pull request?
    Extend the `Sum` expression to  to support `DayTimeIntervalType` and `YearMonthIntervalType` added by #31614.
    
    Note: the expressions can throw the overflow exception independently from the SQL config `spark.sql.ansi.enabled`. In this way, the modified expressions always behave in the ANSI mode for the intervals.
    
    ### Why are the changes needed?
    Extend `org.apache.spark.sql.catalyst.expressions.aggregate.Sum` to support `DayTimeIntervalType` and `YearMonthIntervalType`.
    
    ### Does this PR introduce _any_ user-facing change?
    'No'.
    Should not since new types have not been released yet.
    
    ### How was this patch tested?
    Jenkins test
    
    Closes #32107 from beliefer/SPARK-34716.
    
    Lead-authored-by: gengjiaan <[email protected]>
    Co-authored-by: beliefer <[email protected]>
    Co-authored-by: Hyukjin Kwon <[email protected]>
    Signed-off-by: Max Gekk <[email protected]>
    3 people authored and MaxGekk committed Apr 18, 2021
    Configuration menu
    Copy the full SHA
    12abfe7 View commit details
    Browse the repository at this point in the history
  6. [SPARK-35115][SQL][TESTS] Check ANSI intervals in `MutableProjectionS…

    …uite`
    
    ### What changes were proposed in this pull request?
    Add checks for `YearMonthIntervalType` and `DayTimeIntervalType` to `MutableProjectionSuite`.
    
    ### Why are the changes needed?
    To improve test coverage, and the same checks as for `CalendarIntervalType`.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    By running the modified test suite:
    ```
    $ build/sbt "test:testOnly *MutableProjectionSuite"
    ```
    
    Closes #32225 from MaxGekk/test-ansi-intervals-in-MutableProjectionSuite.
    
    Authored-by: Max Gekk <[email protected]>
    Signed-off-by: Takeshi Yamamuro <[email protected]>
    MaxGekk authored and maropu committed Apr 18, 2021
    Configuration menu
    Copy the full SHA
    074f770 View commit details
    Browse the repository at this point in the history

Commits on Apr 19, 2021

  1. [SPARK-35092][UI] the auto-generated rdd's name in the storage tab sh…

    …ould be truncated if it is too long
    
    ### What changes were proposed in this pull request?
    the auto-generated rdd's name in the storage tab should be truncated  as a single line if it is too long.
    
    ### Why are the changes needed?
    to make the ui shows more friendly.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    just a simple modifition in css, manual test works well like below:
    
    before modified:
    ![the rdd title in storage page shows too long](https://user-images.githubusercontent.com/52202080/115009655-17da2500-9edf-11eb-86a7-088bed7ef8f7.png)
    
    after modified:
    Tht titile  needs just one line:
    
    ![storage标题过长修改后](https://user-images.githubusercontent.com/52202080/114872091-8c07c080-9e2c-11eb-81a8-0c097b1a77bf.png)
    
    Closes #32191 from kyoty/storage-rdd-titile-display-improve.
    
    Authored-by: kyoty <[email protected]>
    Signed-off-by: Kousuke Saruta <[email protected]>
    echohlne authored and sarutak committed Apr 19, 2021
    Configuration menu
    Copy the full SHA
    978cd0b View commit details
    Browse the repository at this point in the history
  2. [SPARK-35109][SQL] Fix minor exception messages of HashedRelation and…

    … HashJoin
    
    ### What changes were proposed in this pull request?
    
    It seems that we miss classifying one `SparkOutOfMemoryError` in `HashedRelation`. Add the error classification for it. In addition, clean up two errors definition of `HashJoin` as they are not used.
    
    ### Why are the changes needed?
    
    Better error classification.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Existing tests.
    
    Closes #32211 from c21/error-message.
    
    Authored-by: Cheng Su <[email protected]>
    Signed-off-by: Takeshi Yamamuro <[email protected]>
    c21 authored and maropu committed Apr 19, 2021
    Configuration menu
    Copy the full SHA
    fd08c93 View commit details
    Browse the repository at this point in the history
  3. [SPARK-34581][SQL] Don't optimize out grouping expressions from aggre…

    …gate expressions without aggregate function
    
    ### What changes were proposed in this pull request?
    This PR:
    - Adds a new expression `GroupingExprRef` that can be used in aggregate expressions of `Aggregate` nodes to refer grouping expressions by index. These expressions capture the data type and nullability of the referred grouping expression.
    - Adds a new rule `EnforceGroupingReferencesInAggregates` that inserts the references in the beginning of the optimization phase.
    - Adds a new rule `UpdateGroupingExprRefNullability` to update nullability of `GroupingExprRef` expressions as nullability of referred grouping expression can change during optimization.
    
    ### Why are the changes needed?
    If aggregate expressions (without aggregate functions) in an `Aggregate` node are complex then the `Optimizer` can optimize out grouping expressions from them and so making aggregate expressions invalid.
    
    Here is a simple example:
    ```
    SELECT not(t.id IS NULL) , count(*)
    FROM t
    GROUP BY t.id IS NULL
    ```
    In this case the `BooleanSimplification` rule does this:
    ```
    === Applying Rule org.apache.spark.sql.catalyst.optimizer.BooleanSimplification ===
    !Aggregate [isnull(id#222)], [NOT isnull(id#222) AS (NOT (id IS NULL))#226, count(1) AS c#224L]   Aggregate [isnull(id#222)], [isnotnull(id#222) AS (NOT (id IS NULL))#226, count(1) AS c#224L]
     +- Project [value#219 AS id#222]                                                                 +- Project [value#219 AS id#222]
        +- LocalRelation [value#219]                                                                     +- LocalRelation [value#219]
    ```
    where `NOT isnull(id#222)` is optimized to `isnotnull(id#222)` and so it no longer refers to any grouping expression.
    
    Before this PR:
    ```
    == Optimized Logical Plan ==
    Aggregate [isnull(id#222)], [isnotnull(id#222) AS (NOT (id IS NULL))#234, count(1) AS c#232L]
    +- Project [value#219 AS id#222]
       +- LocalRelation [value#219]
    ```
    and running the query throws an error:
    ```
    Couldn't find id#222 in [isnull(id#222)#230,count(1)#226L]
    java.lang.IllegalStateException: Couldn't find id#222 in [isnull(id#222)#230,count(1)#226L]
    ```
    
    After this PR:
    ```
    == Optimized Logical Plan ==
    Aggregate [isnull(id#222)], [NOT groupingexprref(0) AS (NOT (id IS NULL))#234, count(1) AS c#232L]
    +- Project [value#219 AS id#222]
       +- LocalRelation [value#219]
    ```
    and the query works.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes, the query works.
    
    ### How was this patch tested?
    Added new UT.
    
    Closes #31913 from peter-toth/SPARK-34581-keep-grouping-expressions.
    
    Authored-by: Peter Toth <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    peter-toth authored and cloud-fan committed Apr 19, 2021
    Configuration menu
    Copy the full SHA
    c8d78a7 View commit details
    Browse the repository at this point in the history
  4. [SPARK-35122][SQL] Migrate CACHE/UNCACHE TABLE to use AnalysisOnlyCom…

    …mand
    
    ### What changes were proposed in this pull request?
    
    Now that `AnalysisOnlyCommand` in introduced in #32032, `CacheTable` and `UncacheTable` can extend `AnalysisOnlyCommand` to simplify the code base. For example, the logic to handle these commands such that the tables are only analyzed is scattered across different places.
    
    ### Why are the changes needed?
    
    To simplify the code base to handle these two commands.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No, just internal refactoring.
    
    ### How was this patch tested?
    
    The existing tests (e.g., `CachedTableSuite`) cover the changes in this PR. For example, if I make `CacheTable`/`UncacheTable` extend `LeafCommand`, there are few failures in `CachedTableSuite`.
    
    Closes #32220 from imback82/cache_cmd_analysis_only.
    
    Authored-by: Terry Kim <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    imback82 authored and cloud-fan committed Apr 19, 2021
    Configuration menu
    Copy the full SHA
    7a06cdd View commit details
    Browse the repository at this point in the history
  5. [SPARK-31937][SQL] Support processing ArrayType/MapType/StructType da…

    …ta using no-serde mode script transform
    
    ### What changes were proposed in this pull request?
    Support no-serde mode script transform use ArrayType/MapType/StructStpe data.
    
    ### Why are the changes needed?
    Make user can process array/map/struct data
    
    ### Does this PR introduce _any_ user-facing change?
    Yes, user can process array/map/struct data in script transform `no-serde` mode
    
    ### How was this patch tested?
    Added UT
    
    Closes #30957 from AngersZhuuuu/SPARK-31937.
    
    Lead-authored-by: Angerszhuuuu <[email protected]>
    Co-authored-by: angerszhu <[email protected]>
    Co-authored-by: AngersZhuuuu <[email protected]>
    Signed-off-by: HyukjinKwon <[email protected]>
    AngersZhuuuu authored and HyukjinKwon committed Apr 19, 2021
    Configuration menu
    Copy the full SHA
    a74f601 View commit details
    Browse the repository at this point in the history
  6. [SPARK-35045][SQL][FOLLOW-UP] Add a configuration for CSV input buffe…

    …r size
    
    ### What changes were proposed in this pull request?
    
    This PR makes the input buffer configurable (as an internal configuration). This is mainly to work around the regression in uniVocity/univocity-parsers#449.
    
    This is particularly useful for SQL workloads that requires to rewrite the `CREATE TABLE` with options.
    
    ### Why are the changes needed?
    
    To work around uniVocity/univocity-parsers#449.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No, it's only internal option.
    
    ### How was this patch tested?
    
    Manually tested by modifying the unittest added in #31858 as below:
    
    ```diff
    diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
    index fd25a79619d..705f38dbfbd 100644
    --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
    +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
     -2456,6 +2456,7  abstract class CSVSuite
       test("SPARK-34768: counting a long record with ignoreTrailingWhiteSpace set to true") {
         val bufSize = 128
         val line = "X" * (bufSize - 1) + "| |"
    +    spark.conf.set("spark.sql.csv.parser.inputBufferSize", 128)
         withTempPath { path =>
           Seq(line).toDF.write.text(path.getAbsolutePath)
           assert(spark.read.format("csv")
    ```
    
    Closes #32231 from HyukjinKwon/SPARK-35045-followup.
    
    Authored-by: HyukjinKwon <[email protected]>
    Signed-off-by: HyukjinKwon <[email protected]>
    HyukjinKwon committed Apr 19, 2021
    Configuration menu
    Copy the full SHA
    70b606f View commit details
    Browse the repository at this point in the history
  7. [SPARK-34837][SQL] Support ANSI SQL intervals by the aggregate functi…

    …on `avg`
    
    ### What changes were proposed in this pull request?
    Extend the `Average` expression to support `DayTimeIntervalType` and `YearMonthIntervalType` added by #31614.
    
    Note: the expressions can throw the overflow exception independently from the SQL config `spark.sql.ansi.enabled`. In this way, the modified expressions always behave in the ANSI mode for the intervals.
    
    ### Why are the changes needed?
    Extend `org.apache.spark.sql.catalyst.expressions.aggregate.Average` to support `DayTimeIntervalType` and `YearMonthIntervalType`.
    
    ### Does this PR introduce _any_ user-facing change?
    'No'.
    Should not since new types have not been released yet.
    
    ### How was this patch tested?
    Jenkins test
    
    Closes #32229 from beliefer/SPARK-34837.
    
    Authored-by: gengjiaan <[email protected]>
    Signed-off-by: Max Gekk <[email protected]>
    beliefer authored and MaxGekk committed Apr 19, 2021
    Configuration menu
    Copy the full SHA
    8dc455b View commit details
    Browse the repository at this point in the history
  8. [SPARK-35107][SQL] Parse unit-to-unit interval literals to ANSI inter…

    …vals
    
    ### What changes were proposed in this pull request?
    Parse the year-month interval literals like `INTERVAL '1-1' YEAR TO MONTH` to values of `YearMonthIntervalType`, and day-time interval literals to `DayTimeIntervalType` values. Currently, Spark SQL supports:
    - DAY TO HOUR
    - DAY TO MINUTE
    - DAY TO SECOND
    - HOUR TO MINUTE
    - HOUR TO SECOND
    - MINUTE TO SECOND
    
    All such interval literals are converted to `DayTimeIntervalType`, and `YEAR TO MONTH` to `YearMonthIntervalType` while loosing info about `from` and `to` units.
    
    **Note**: new behavior is under the SQL config `spark.sql.legacy.interval.enabled` which is `false` by default. When the config is set to `true`, the interval literals are parsed to `CaledarIntervalType` values.
    
    Closes #32176
    
    ### Why are the changes needed?
    To conform the ANSI SQL standard which assumes conversions of interval literals to year-month or day-time interval but not to mixed interval type like Catalyst's `CalendarIntervalType`.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes.
    
    Before:
    ```sql
    spark-sql> SELECT INTERVAL '1 01:02:03.123' DAY TO SECOND;
    1 days 1 hours 2 minutes 3.123 seconds
    spark-sql> SELECT typeof(INTERVAL '1 01:02:03.123' DAY TO SECOND);
    interval
    ```
    
    After:
    ```sql
    spark-sql> SELECT INTERVAL '1 01:02:03.123' DAY TO SECOND;
    1 01:02:03.123000000
    spark-sql> SELECT typeof(INTERVAL '1 01:02:03.123' DAY TO SECOND);
    day-time interval
    ```
    
    ### How was this patch tested?
    1. By running the affected test suites:
    ```
    $ ./build/sbt "test:testOnly *.ExpressionParserSuite"
    $ SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *SQLQueryTestSuite -- -z interval.sql"
    $ SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *SQLQueryTestSuite -- -z create_view.sql"
    $ SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *SQLQueryTestSuite -- -z date.sql"
    $ SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *SQLQueryTestSuite -- -z timestamp.sql"
    ```
    2. PostgresSQL tests are executed with `spark.sql.legacy.interval.enabled` is set to `true` to keep compatibility with PostgreSQL output:
    ```sql
    > SELECT interval '999' second;
    0 years 0 mons 0 days 0 hours 16 mins 39.00 secs
    ```
    
    Closes #32209 from MaxGekk/parse-ansi-interval-literals.
    
    Authored-by: Max Gekk <[email protected]>
    Signed-off-by: Max Gekk <[email protected]>
    MaxGekk committed Apr 19, 2021
    Configuration menu
    Copy the full SHA
    1d1ed3e View commit details
    Browse the repository at this point in the history
  9. [SPARK-34715][SQL][TESTS] Add round trip tests for period <-> month a…

    …nd duration <-> micros
    
    ### What changes were proposed in this pull request?
    Similarly to the test from the PR #31799, add tests:
    1. Months -> Period -> Months
    2. Period -> Months -> Period
    3. Duration -> micros -> Duration
    
    ### Why are the changes needed?
    Add round trip tests for period <-> month and duration <-> micros
    
    ### Does this PR introduce _any_ user-facing change?
    'No'. Just test cases.
    
    ### How was this patch tested?
    Jenkins test
    
    Closes #32234 from beliefer/SPARK-34715.
    
    Authored-by: gengjiaan <[email protected]>
    Signed-off-by: Max Gekk <[email protected]>
    beliefer authored and MaxGekk committed Apr 19, 2021
    Configuration menu
    Copy the full SHA
    7f34035 View commit details
    Browse the repository at this point in the history
  10. [SPARK-35125][K8S] Upgrade K8s client to 5.3.0 to support K8s 1.20

    ### What changes were proposed in this pull request?
    
    Although AS-IS master branch already works with K8s 1.20, this PR aims to upgrade K8s client to 5.3.0 to support K8s 1.20 officially.
    - https://github.com/fabric8io/kubernetes-client#compatibility-matrix
    
    The following are the notable breaking API changes.
    
    1. Remove Doneable (5.0+):
        - fabric8io/kubernetes-client#2571
    2. Change Watcher.onClose signature (5.0+):
        - fabric8io/kubernetes-client#2616
    3. Change Readiness (5.1+)
        - fabric8io/kubernetes-client#2796
    
    ### Why are the changes needed?
    
    According to the compatibility matrix, this makes Apache Spark and its external cluster manager extension support all K8s 1.20 features officially for Apache Spark 3.2.0.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, this is a dev dependency change which affects K8s cluster extension users.
    
    ### How was this patch tested?
    
    Pass the CIs.
    
    This is manually tested with K8s IT.
    ```
    KubernetesSuite:
    - Run SparkPi with no resources
    - Run SparkPi with a very long application name.
    - Use SparkLauncher.NO_RESOURCE
    - Run SparkPi with a master URL without a scheme.
    - Run SparkPi with an argument.
    - Run SparkPi with custom labels, annotations, and environment variables.
    - All pods have the same service account by default
    - Run extraJVMOptions check on driver
    - Run SparkRemoteFileTest using a remote data file
    - Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties
    - Run SparkPi with env and mount secrets.
    - Run PySpark on simple pi.py example
    - Run PySpark to test a pyfiles example
    - Run PySpark with memory customization
    - Run in client mode.
    - Start pod creation from template
    - PVs with local storage
    - Launcher client dependencies
    - SPARK-33615: Launcher client archives
    - SPARK-33748: Launcher python client respecting PYSPARK_PYTHON
    - SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python
    - Launcher python client dependencies using a zip file
    - Test basic decommissioning
    - Test basic decommissioning with shuffle cleanup
    - Test decommissioning with dynamic allocation & shuffle cleanups
    - Test decommissioning timeouts
    - Run SparkR on simple dataframe.R example
    Run completed in 17 minutes, 44 seconds.
    Total number of tests run: 27
    Suites: completed 2, aborted 0
    Tests: succeeded 27, failed 0, canceled 0, ignored 0, pending 0
    All tests passed.
    ```
    
    Closes #32221 from dongjoon-hyun/SPARK-K8S-530.
    
    Authored-by: Dongjoon Hyun <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    dongjoon-hyun committed Apr 19, 2021
    Configuration menu
    Copy the full SHA
    425dc58 View commit details
    Browse the repository at this point in the history
  11. [SPARK-35102][SQL] Make spark.sql.hive.version read-only, not depreca…

    …ted and meaningful
    
    ### What changes were proposed in this pull request?
    
    Firstly let's take a look at the definition and comment.
    
    ```
    // A fake config which is only here for backward compatibility reasons. This config has no effect
    // to Spark, just for reporting the builtin Hive version of Spark to existing applications that
    // already rely on this config.
    val FAKE_HIVE_VERSION = buildConf("spark.sql.hive.version")
      .doc(s"deprecated, please use ${HIVE_METASTORE_VERSION.key} to get the Hive version in Spark.")
      .version("1.1.1")
      .fallbackConf(HIVE_METASTORE_VERSION)
    ```
    It is used for reporting the built-in Hive version but the current status is unsatisfactory, as it is could be changed in many ways e.g. --conf/SET syntax.
    
    It is marked as deprecated but kept a long way until now. I guess it is hard for us to remove it and not even necessary.
    
    On second thought, it's actually good for us to keep it to work with the `spark.sql.hive.metastore.version`. As when `spark.sql.hive.metastore.version` is changed, it could be used to report the compiled hive version statically, it's useful when an error occurs in this case. So this parameter should be fixed to compiled hive version.
    
    ### Why are the changes needed?
    
    `spark.sql.hive.version` is useful in certain cases and should be read-only
    
    ### Does this PR introduce _any_ user-facing change?
    
    `spark.sql.hive.version` now is read-only
    
    ### How was this patch tested?
    
    new test cases
    
    Closes #32200 from yaooqinn/SPARK-35102.
    
    Authored-by: Kent Yao <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    yaooqinn authored and cloud-fan committed Apr 19, 2021
    Configuration menu
    Copy the full SHA
    2d161cb View commit details
    Browse the repository at this point in the history
  12. [SPARK-35136] Remove initial null value of LiveStage.info

    ### What changes were proposed in this pull request?
    To prevent potential NullPointerExceptions, this PR changes the `LiveStage` constructor to take `info` as a constructor parameter and adds a nullcheck in  `AppStatusListener.activeStages`.
    
    ### Why are the changes needed?
    The `AppStatusListener.getOrCreateStage` would create a LiveStage object with the `info` field set to null and right after that set it to a specific StageInfo object. This can lead to a race condition when the `livestages` are read in between those calls. This could then lead to a null pointer exception in, for instance: `AppStatusListener.activeStages`.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Regular CI/CD tests
    
    Closes #32233 from sander-goos/SPARK-35136-livestage.
    
    Authored-by: Sander Goos <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    sander-goos authored and cloud-fan committed Apr 19, 2021
    Configuration menu
    Copy the full SHA
    d37d18d View commit details
    Browse the repository at this point in the history
  13. [SPARK-35138][SQL] Remove Antlr4 workaround

    ### What changes were proposed in this pull request?
    
    Remove Antlr 4.7 workaround.
    
    ### Why are the changes needed?
    
    The antlr/antlr4@ac9f7530 has been fixed in upstream, so remove the workaround to simplify code.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Existed UTs.
    
    Closes #32238 from pan3793/antlr-minor.
    
    Authored-by: Cheng Pan <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    pan3793 authored and dongjoon-hyun committed Apr 19, 2021
    Configuration menu
    Copy the full SHA
    0c2e9b9 View commit details
    Browse the repository at this point in the history
  14. [SPARK-35120][INFRA] Guide users to sync branch and enable GitHub Act…

    …ions in their forked repository
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to add messages when the workflow fails to find the workflow run in a forked repository, for example as below:
    
    **Before**
    
    ![Screen Shot 2021-04-19 at 9 41 52 PM](https://user-images.githubusercontent.com/6477701/115238011-28e19b00-a158-11eb-8c5c-6374ca1e9790.png)
    
    ![Screen Shot 2021-04-19 at 9 42 00 PM](https://user-images.githubusercontent.com/6477701/115237984-22ebba00-a158-11eb-9b0f-11fe11072830.png)
    
    **After**
    
    ![Screen Shot 2021-04-19 at 9 25 32 PM](https://user-images.githubusercontent.com/6477701/115237507-9c36dd00-a157-11eb-8ba7-f5f88caa1058.png)
    
    ![Screen Shot 2021-04-19 at 9 23 13 PM](https://user-images.githubusercontent.com/6477701/115236793-c2a84880-a156-11eb-98fc-1bb7d4bc31dd.png)
    (typo `foce` in the image was fixed)
    
    See this example: https://github.com/HyukjinKwon/spark/runs/2380644793
    
    ### Why are the changes needed?
    
    To guide users to enable Github Actions in their forked repositories (and sync their branch to the latest `master` in Apache Spark).
    
    ### Does this PR introduce _any_ user-facing change?
    
    No, dev-only.
    
    ### How was this patch tested?
    
    Manually tested in:
    - HyukjinKwon#47
    - HyukjinKwon#46
    
    Closes #32235 from HyukjinKwon/test-test-test.
    
    Authored-by: HyukjinKwon <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    HyukjinKwon authored and dongjoon-hyun committed Apr 19, 2021
    Configuration menu
    Copy the full SHA
    dc7d41e View commit details
    Browse the repository at this point in the history
  15. [SPARK-35131][K8S] Support early driver service clean-up during app t…

    …ermination
    
    ### What changes were proposed in this pull request?
    
    This PR aims to support a new configuration, `spark.kubernetes.driver.service.deleteOnTermination`, to clean up `Driver Service` resource during app termination.
    
    ### Why are the changes needed?
    
    The K8s service is one of the important resources and sometimes it's controlled by quota.
    ```
    $ k describe quota
    Name:       service
    Namespace:  default
    Resource    Used  Hard
    --------    ----  ----
    services    1     3
    ```
    
    Apache Spark creates a service for driver whose lifecycle is the same with driver pod.
    It means a new Spark job submission fails if the number of completed Spark jobs equals the number of service quota.
    
    **BEFORE**
    ```
    $ k get pod
    NAME                                                        READY   STATUS      RESTARTS   AGE
    org-apache-spark-examples-sparkpi-a32c9278e7061b4d-driver   0/1     Completed   0          31m
    org-apache-spark-examples-sparkpi-a9f1f578e721ef62-driver   0/1     Completed   0          78s
    
    $ k get svc
    NAME                                                            TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                      AGE
    kubernetes                                                      ClusterIP   10.96.0.1    <none>        443/TCP                      80m
    org-apache-spark-examples-sparkpi-a32c9278e7061b4d-driver-svc   ClusterIP   None         <none>        7078/TCP,7079/TCP,4040/TCP   31m
    org-apache-spark-examples-sparkpi-a9f1f578e721ef62-driver-svc   ClusterIP   None         <none>        7078/TCP,7079/TCP,4040/TCP   80s
    
    $ k describe quota
    Name:       service
    Namespace:  default
    Resource    Used  Hard
    --------    ----  ----
    services    3     3
    
    $ bin/spark-submit...
    Exception in thread "main" io.fabric8.kubernetes.client.KubernetesClientException:
    Failure executing: POST at: https://192.168.64.50:8443/api/v1/namespaces/default/services.
    Message: Forbidden! User minikube doesn't have permission.
    services "org-apache-spark-examples-sparkpi-843f6978e722819c-driver-svc" is forbidden:
    exceeded quota: service, requested: services=1, used: services=3, limited: services=3.
    ```
    
    **AFTER**
    ```
    $ k get pod
    NAME                                                        READY   STATUS      RESTARTS   AGE
    org-apache-spark-examples-sparkpi-23d5f278e77731a7-driver   0/1     Completed   0          26s
    org-apache-spark-examples-sparkpi-d1292278e7768ed4-driver   0/1     Completed   0          67s
    org-apache-spark-examples-sparkpi-e5bedf78e776ea9d-driver   0/1     Completed   0          44s
    
    $ k get svc
    NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
    kubernetes   ClusterIP   10.96.0.1    <none>        443/TCP   172m
    
    $ k describe quota
    Name:       service
    Namespace:  default
    Resource    Used  Hard
    --------    ----  ----
    services    1     3
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, this PR adds a new configuration, `spark.kubernetes.driver.service.deleteOnTermination`, and enables it by default.
    The change is documented at the migration guide.
    
    ### How was this patch tested?
    
    Pass the CIs.
    
    This is tested with K8s IT manually.
    
    ```
    KubernetesSuite:
    - Run SparkPi with no resources
    - Run SparkPi with a very long application name.
    - Use SparkLauncher.NO_RESOURCE
    - Run SparkPi with a master URL without a scheme.
    - Run SparkPi with an argument.
    - Run SparkPi with custom labels, annotations, and environment variables.
    - All pods have the same service account by default
    - Run extraJVMOptions check on driver
    - Run SparkRemoteFileTest using a remote data file
    - Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties
    - Run SparkPi with env and mount secrets.
    - Run PySpark on simple pi.py example
    - Run PySpark to test a pyfiles example
    - Run PySpark with memory customization
    - Run in client mode.
    - Start pod creation from template
    - PVs with local storage
    - Launcher client dependencies
    - SPARK-33615: Launcher client archives
    - SPARK-33748: Launcher python client respecting PYSPARK_PYTHON
    - SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python
    - Launcher python client dependencies using a zip file
    - Test basic decommissioning
    - Test basic decommissioning with shuffle cleanup
    - Test decommissioning with dynamic allocation & shuffle cleanups
    - Test decommissioning timeouts
    - Run SparkR on simple dataframe.R example
    Run completed in 19 minutes, 9 seconds.
    Total number of tests run: 27
    Suites: completed 2, aborted 0
    Tests: succeeded 27, failed 0, canceled 0, ignored 0, pending 0
    All tests passed.
    ```
    
    Closes #32226 from dongjoon-hyun/SPARK-35131.
    
    Authored-by: Dongjoon Hyun <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    dongjoon-hyun committed Apr 19, 2021
    Configuration menu
    Copy the full SHA
    00f06dd View commit details
    Browse the repository at this point in the history
  16. [SPARK-35103][SQL] Make TypeCoercion rules more efficient

    ## What changes were proposed in this pull request?
    This PR fixes a couple of things in TypeCoercion rules:
    - Only run the propagate types step if the children of a node have output attributes with changed dataTypes and/or nullability. This is implemented as custom tree transformation. The TypeCoercion rules now only implement a partial function.
    - Combine multiple type coercion rules into a single rule. Multiple rules are applied in single tree traversal.
    - Reduce calls to conf.get in DecimalPrecision. This now happens once per tree traversal, instead of once per matched expression.
    - Reduce the use of withNewChildren.
    
    This brings down the number of CPU cycles spend in analysis by ~28% (benchmark: 10 iterations of all TPC-DS queries on SF10).
    
    ## How was this patch tested?
    Existing tests.
    
    Closes #32208 from sigmod/coercion.
    
    Authored-by: Yingyi Bu <[email protected]>
    Signed-off-by: herman <[email protected]>
    sigmod authored and hvanhovell committed Apr 19, 2021
    Configuration menu
    Copy the full SHA
    9a6d773 View commit details
    Browse the repository at this point in the history

Commits on Apr 20, 2021

  1. [SPARK-35117][UI] Change progress bar back to highlight ratio of task…

    …s in progress
    
    ### What changes were proposed in this pull request?
    Small UI update to add highlighting the number of tasks in progress in a stage/job instead of highlighting the whole in progress stage/job. This was the behavior pre Spark 3.1 and the bootstrap 4 upgrade.
    
    ### Why are the changes needed?
    
    To add back in functionality lost between 3.0 and 3.1. This provides a great visual queue of how much of a stage/job is currently being run.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Small UI change.
    
    Before:
    ![image](https://user-images.githubusercontent.com/3536454/115216189-3fddaa00-a0d2-11eb-88e0-e3be925c92f0.png)
    
    After (and pre Spark 3.1):
    ![image](https://user-images.githubusercontent.com/3536454/115216216-48ce7b80-a0d2-11eb-9953-2adb3b377133.png)
    
    ### How was this patch tested?
    
    Updated existing UT.
    
    Closes #32214 from Kimahriman/progress-bar-started.
    
    Authored-by: Adam Binford <[email protected]>
    Signed-off-by: Kousuke Saruta <[email protected]>
    Kimahriman authored and sarutak committed Apr 20, 2021
    Configuration menu
    Copy the full SHA
    e55ff83 View commit details
    Browse the repository at this point in the history
  2. [SPARK-35080][SQL] Only allow a subset of correlated equality predica…

    …tes when a subquery is aggregated
    
    ### What changes were proposed in this pull request?
    This PR updated the `foundNonEqualCorrelatedPred` logic for correlated subqueries in `CheckAnalysis` to only allow correlated equality predicates that guarantee one-to-one mapping between inner and outer attributes, instead of all equality predicates.
    
    ### Why are the changes needed?
    To fix correctness bugs. Before this fix Spark can give wrong results for certain correlated subqueries that pass CheckAnalysis:
    Example 1:
    ```sql
    create or replace view t1(c) as values ('a'), ('b')
    create or replace view t2(c) as values ('ab'), ('abc'), ('bc')
    
    select c, (select count(*) from t2 where t1.c = substring(t2.c, 1, 1)) from t1
    ```
    Correct results: [(a, 2), (b, 1)]
    Spark results:
    ```
    +---+-----------------+
    |c  |scalarsubquery(c)|
    +---+-----------------+
    |a  |1                |
    |a  |1                |
    |b  |1                |
    +---+-----------------+
    ```
    Example 2:
    ```sql
    create or replace view t1(a, b) as values (0, 6), (1, 5), (2, 4), (3, 3);
    create or replace view t2(c) as values (6);
    
    select c, (select count(*) from t1 where a + b = c) from t2;
    ```
    Correct results: [(6, 4)]
    Spark results:
    ```
    +---+-----------------+
    |c  |scalarsubquery(c)|
    +---+-----------------+
    |6  |1                |
    |6  |1                |
    |6  |1                |
    |6  |1                |
    +---+-----------------+
    ```
    ### Does this PR introduce _any_ user-facing change?
    Yes. Users will not be able to run queries that contain unsupported correlated equality predicates.
    
    ### How was this patch tested?
    Added unit tests.
    
    Closes #32179 from allisonwang-db/spark-35080-subquery-bug.
    
    Lead-authored-by: allisonwang-db <[email protected]>
    Co-authored-by: Wenchen Fan <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    allisonwang-db and cloud-fan committed Apr 20, 2021
    Configuration menu
    Copy the full SHA
    bad4b6f View commit details
    Browse the repository at this point in the history
  3. [SPARK-35052][SQL] Use static bits for AttributeReference and Literal

    ### What changes were proposed in this pull request?
    
    - Share a static ImmutableBitSet for `treePatternBits` in all object instances of AttributeReference.
    - Share three static ImmutableBitSets for  `treePatternBits` in three kinds of Literals.
    - Add an ImmutableBitSet as a subclass of BitSet.
    
    ### Why are the changes needed?
    
    Reduce the additional memory usage caused by `treePatternBits`.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Existing tests.
    
    Closes #32157 from sigmod/leaf.
    
    Authored-by: Yingyi Bu <[email protected]>
    Signed-off-by: Gengliang Wang <[email protected]>
    sigmod authored and gengliangwang committed Apr 20, 2021
    Configuration menu
    Copy the full SHA
    f4926d1 View commit details
    Browse the repository at this point in the history
  4. [SPARK-35134][BUILD][TESTS] Manually exclude redundant netty jars in …

    …SparkBuild.scala to avoid version conflicts in test
    
    ### What changes were proposed in this pull request?
    The following logs will print  when Jenkins execute [PySpark pip packaging tests](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137500/console):
    
    ```
    copying deps/jars/netty-all-4.1.51.Final.jar -> pyspark-3.2.0.dev0/deps/jars
    copying deps/jars/netty-buffer-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars
    copying deps/jars/netty-codec-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars
    copying deps/jars/netty-common-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars
    copying deps/jars/netty-handler-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars
    copying deps/jars/netty-resolver-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars
    copying deps/jars/netty-transport-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars
    copying deps/jars/netty-transport-native-epoll-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars
    ```
    
    There will be 2 different versions of netty4 jars copied to the jars directory, but the `netty-xxx-4.1.50.Final.jar` not in maven `dependency:tree `, but spark only needs to rely on `netty-all-xxx.jar`.
    
    So this pr try to add new `ExclusionRule`s  to `SparkBuild.scala` to exclude  unnecessary netty 4 dependencies.
    
    ### Why are the changes needed?
    Make sure that only `netty-all-xxx.jar` is used in the test to avoid possible jar conflicts.
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    
    - Pass the Jenkins or GitHub Action
    - Check Jenkins log manually, there should be only
    
    `copying deps/jars/netty-all-4.1.51.Final.jar -> pyspark-3.2.0.dev0/deps/jars`
    
    and there should be no such logs as
    
    ```
    copying deps/jars/netty-buffer-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars
    copying deps/jars/netty-codec-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars
    copying deps/jars/netty-common-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars
    copying deps/jars/netty-handler-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars
    copying deps/jars/netty-resolver-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars
    copying deps/jars/netty-transport-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars
    copying deps/jars/netty-transport-native-epoll-4.1.50.Final.jar -> pyspark-3.2.0.dev0/deps/jars
    ```
    
    Closes #32230 from LuciferYang/SPARK-35134.
    
    Authored-by: yangjie01 <[email protected]>
    Signed-off-by: HyukjinKwon <[email protected]>
    LuciferYang authored and HyukjinKwon committed Apr 20, 2021
    Configuration menu
    Copy the full SHA
    670c365 View commit details
    Browse the repository at this point in the history
  5. [SPARK-35018][SQL][TESTS] Check transferring of year-month intervals …

    …via Hive Thrift server
    
    ### What changes were proposed in this pull request?
    1. Add a test to check that Thrift server is able to collect year-month intervals and transfer them via thrift protocol.
    2. Improve similar test for day-time intervals. After the changes, the test doesn't depend on the result of date subtractions. In the future, the type of date subtract can be changed. So, current PR should make the test tolerant to the changes.
    
    ### Why are the changes needed?
    To improve test coverage.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    By running the modified test suite:
    ```
    $ ./build/sbt -Phive -Phive-thriftserver "test:testOnly *SparkThriftServerProtocolVersionsSuite"
    ```
    
    Closes #32240 from MaxGekk/year-month-interval-thrift-protocol.
    
    Authored-by: Max Gekk <[email protected]>
    Signed-off-by: Max Gekk <[email protected]>
    MaxGekk committed Apr 20, 2021
    Configuration menu
    Copy the full SHA
    aa0d00d View commit details
    Browse the repository at this point in the history
  6. [SPARK-34974][SQL] Improve subquery decorrelation framework

    ### What changes were proposed in this pull request?
    This PR implements the decorrelation technique in the paper "Unnesting Arbitrary Queries" by T. Neumann; A. Kemper
    (http://www.btw-2015.de/res/proceedings/Hauptband/Wiss/Neumann-Unnesting_Arbitrary_Querie.pdf). It currently supports Filter, Project, Aggregate, Join, and UnaryNode that passes CheckAnalysis.
    
    This feature can be controlled by the config `spark.sql.optimizer.decorrelateInnerQuery.enabled` (default: true).
    
    A few notes:
    1. This PR does not relax any constraints in CheckAnalysis for correlated subqueries, even though some cases can be supported by this new framework, such as aggregate with correlated non-equality predicates. This PR focuses on adding the new framework and making sure all existing cases can be supported. Constraints can be relaxed gradually in the future via separate PRs.
    2. The new framework is only enabled for correlated scalar subqueries, as the first step. EXISTS/IN subqueries can be supported in the future.
    
    ### Why are the changes needed?
    Currently, Spark has limited support for correlated subqueries. It only allows `Filter` to reference outer query columns and does not support non-equality predicates when the subquery is aggregated. This new framework will allow more operators to host outer column references and support correlated non-equality predicates and more types of operators in correlated subqueries.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Existing unit and SQL query tests and new optimizer plan tests.
    
    Closes #32072 from allisonwang-db/spark-34974-decorrelation.
    
    Authored-by: allisonwang-db <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    allisonwang-db authored and cloud-fan committed Apr 20, 2021
    Configuration menu
    Copy the full SHA
    b6bb24c View commit details
    Browse the repository at this point in the history
  7. [SPARK-35068][SQL] Add tests for ANSI intervals to HiveThriftBinarySe…

    …rverSuite
    
    ### What changes were proposed in this pull request?
    After the PR #32209, this should be possible now.
    We can add test case for ANSI intervals to HiveThriftBinaryServerSuite
    
    ### Why are the changes needed?
    Add more test case
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Added UT
    
    Closes #32250 from AngersZhuuuu/SPARK-35068.
    
    Authored-by: Angerszhuuuu <[email protected]>
    Signed-off-by: Max Gekk <[email protected]>
    AngersZhuuuu authored and MaxGekk committed Apr 20, 2021
    Configuration menu
    Copy the full SHA
    b219e37 View commit details
    Browse the repository at this point in the history
  8. [SPARK-33976][SQL][DOCS] Add a SQL doc page for a TRANSFORM clause

    ### What changes were proposed in this pull request?
    Add doc about `TRANSFORM` and related function.
    
    ![image](https://user-images.githubusercontent.com/46485123/114332579-1627fe80-9b79-11eb-8fa7-131f0a20f72f.png)
    
    ### Why are the changes needed?
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Not need
    
    Closes #31010 from AngersZhuuuu/SPARK-33976.
    
    Lead-authored-by: Angerszhuuuu <[email protected]>
    Co-authored-by: angerszhu <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    AngersZhuuuu authored and cloud-fan committed Apr 20, 2021
    Configuration menu
    Copy the full SHA
    9c956ab View commit details
    Browse the repository at this point in the history
  9. [SPARK-34877][CORE][YARN] Add the code change for adding the Spark AM…

    … log link in spark UI
    
    ### What changes were proposed in this pull request?
    On Running Spark job with yarn and deployment mode as client, Spark Driver and Spark Application master launch in two separate containers. In various scenarios there is need to see Spark Application master logs to see the resource allocation, Decommissioning status and other information shared between yarn RM and Spark Application master.
    
    In Cluster mode Spark driver and Spark AM is on same container, So Log link of the driver already there to see the logs in Spark UI
    
    This PR is for adding the spark AM log link for spark job running in the client mode for yarn. Instead of searching the container id and then find the logs. We can directly check in the Spark UI
    
    This change is only for showing the AM log links in the Client mode when resource manager is yarn.
    
    ### Why are the changes needed?
    Till now the only way to check this by finding the container id of the AM and check the logs either using Yarn utility or Yarn RM Application History server.
    
    This PR is for adding the spark AM log link for spark job running in the client mode for yarn. Instead of searching the container id and then find the logs. We can directly check in the Spark UI
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Added the unit test also checked the Spark UI
    **In Yarn Client mode**
    Before Change
    
    ![image](https://user-images.githubusercontent.com/34540906/112644861-e1733200-8e6b-11eb-939b-c76ca9902a4e.png)
    
    After the Change - The AM info is there
    
    ![image](https://user-images.githubusercontent.com/34540906/115264198-b7075280-a153-11eb-98f3-2aed66ffad2a.png)
    
    AM Log
    
    ![image](https://user-images.githubusercontent.com/34540906/112645680-c0f7a780-8e6c-11eb-8b82-4ccc0aee927b.png)
    
    **In Yarn Cluster Mode**  - The AM log link will not be there
    
    ![image](https://user-images.githubusercontent.com/34540906/112649512-86900980-8e70-11eb-9b37-69d5c4b53ffa.png)
    
    Closes #31974 from SaurabhChawla100/SPARK-34877.
    
    Authored-by: SaurabhChawla <[email protected]>
    Signed-off-by: Thomas Graves <[email protected]>
    SaurabhChawla100 authored and tgravescs committed Apr 20, 2021
    Configuration menu
    Copy the full SHA
    1e64b4f View commit details
    Browse the repository at this point in the history
  10. [SPARK-34035][SQL] Refactor ScriptTransformation to remove input para…

    …meter and replace it by child.output
    
    ### What changes were proposed in this pull request?
    Refactor ScriptTransformation to remove input parameter and replace it by child.output
    
    ### Why are the changes needed?
    refactor code
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Existed UT
    
    Closes #32228 from AngersZhuuuu/SPARK-34035.
    
    Lead-authored-by: Angerszhuuuu <[email protected]>
    Co-authored-by: AngersZhuuuu <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    AngersZhuuuu authored and cloud-fan committed Apr 20, 2021
    Configuration menu
    Copy the full SHA
    3614448 View commit details
    Browse the repository at this point in the history
  11. [SPARK-34338][SQL] Report metrics from Datasource v2 scan

    ### What changes were proposed in this pull request?
    
    This patch proposes to leverage `CustomMetric`, `CustomTaskMetric` API to report custom metrics from DS v2 scan to Spark.
    
    ### Why are the changes needed?
    
    This is related to #31398. In SPARK-34297, we want to add a couple of metrics when reading from Kafka in SS. We need some public API change in DS v2 to make it possible. This extracts only DS v2 change and make it general for DS v2 instead of micro-batch DS v2 API.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Unit test.
    
    Implement a simple test DS v2 class locally and run it:
    
    ```scala
    scala> import org.apache.spark.sql.execution.datasources.v2._
    import org.apache.spark.sql.execution.datasources.v2._
    
    scala> classOf[CustomMetricDataSourceV2].getName
    res0: String = org.apache.spark.sql.execution.datasources.v2.CustomMetricDataSourceV2
    
    scala> val df = spark.read.format(res0).load()
    df: org.apache.spark.sql.DataFrame = [i: int, j: int]
    
    scala> df.collect
    ```
    
    <img width="703" alt="Screen Shot 2021-03-30 at 11 07 13 PM" src="https://user-images.githubusercontent.com/68855/113098080-d8a49800-91ac-11eb-8681-be408a0f2e69.png">
    
    Closes #31451 from viirya/dsv2-metrics.
    
    Authored-by: Liang-Chi Hsieh <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    viirya authored and cloud-fan committed Apr 20, 2021
    Configuration menu
    Copy the full SHA
    eb9a439 View commit details
    Browse the repository at this point in the history
  12. [SPARK-35145][SQL] CurrentOrigin should support nested invoking

    ### What changes were proposed in this pull request?
    
    `CurrentOrigin` is a thread-local variable to track the original SQL line position in plan/expression. Usually, we set `CurrentOrigin`, create `TreeNode` instances, and reset `CurrentOrigin`.
    
    This PR updates the last step to set `CurrentOrigin` to its previous value, instead of resetting it. This is necessary when we invoke `CurrentOrigin` in a nested way, like with subqueries.
    
    ### Why are the changes needed?
    
    To keep the original SQL line position in the error message in more cases.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No, only minor error message changes.
    
    ### How was this patch tested?
    
    existing tests
    
    Closes #32249 from cloud-fan/origin.
    
    Authored-by: Wenchen Fan <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    cloud-fan committed Apr 20, 2021
    Configuration menu
    Copy the full SHA
    e08c40f View commit details
    Browse the repository at this point in the history
  13. [SPARK-34472][YARN] Ship ivySettings file to driver in cluster mode

    ### What changes were proposed in this pull request?
    
    In YARN, ship the `spark.jars.ivySettings` file to the driver when using `cluster` deploy mode so that `addJar` is able to find it in order to resolve ivy paths.
    
    ### Why are the changes needed?
    
    SPARK-33084 introduced support for Ivy paths in `sc.addJar` or Spark SQL `ADD JAR`. If we use a custom ivySettings file using `spark.jars.ivySettings`, it is loaded at https://github.com/apache/spark/blob/b26e7b510bbaee63c4095ab47e75ff2a70e377d7/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L1280. However, this file is only accessible on the client machine. In YARN cluster mode, this file is not available on the driver and so `addJar` fails to find it.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Added unit tests to verify that the `ivySettings` file is localized by the YARN client and that a YARN cluster mode application is able to find to load the `ivySettings` file.
    
    Closes #31591 from shardulm94/SPARK-34472.
    
    Authored-by: Shardul Mahadik <[email protected]>
    Signed-off-by: Thomas Graves <[email protected]>
    shardulm94 authored and tgravescs committed Apr 20, 2021
    Configuration menu
    Copy the full SHA
    83f753e View commit details
    Browse the repository at this point in the history
  14. [SPARK-35153][SQL] Make textual representation of ANSI interval opera…

    …tors more readable
    
    ### What changes were proposed in this pull request?
    In the PR, I propose to override the `sql` and `toString` methods of the expressions that implement operators over ANSI intervals (`YearMonthIntervalType`/`DayTimeIntervalType`), and replace internal expression class names by operators like `*`, `/` and `-`.
    
    ### Why are the changes needed?
    Proposed methods should make the textual representation of such operators more readable, and potentially parsable by Spark SQL parser.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes. This can influence on column names.
    
    ### How was this patch tested?
    By running existing test suites for interval and datetime expressions, and re-generating the `*.sql` tests:
    ```
    $ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z interval.sql"
    $ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z datetime.sql"
    ```
    
    Closes #32262 from MaxGekk/interval-operator-sql.
    
    Authored-by: Max Gekk <[email protected]>
    Signed-off-by: Max Gekk <[email protected]>
    MaxGekk committed Apr 20, 2021
    Configuration menu
    Copy the full SHA
    e8d6992 View commit details
    Browse the repository at this point in the history
  15. [SPARK-35132][BUILD][CORE] Upgrade netty-all to 4.1.63.Final

    ### What changes were proposed in this pull request?
    There are 3 CVE problems were found after netty 4.1.51.Final as follows:
    
    - [CVE-2021-21409](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-21409)
    - [CVE-2021-21295](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-21295)
    - [CVE-2021-21290](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-21290)
    
    So the main change of this pr is upgrade netty-all to 4.1.63.Final avoid these potential risks.
    
    Another change is to clean up deprecated api usage: [Tiny caches have been merged into small caches](https://github.com/netty/netty/blob/4.1/buffer/src/main/java/io/netty/buffer/PooledByteBufAllocator.java#L447-L455)(after [netty#10267](netty/netty#10267)) and [should use  PooledByteBufAllocator(boolean, int, int, int, int, int, int, boolean, int)](https://github.com/netty/netty/blob/4.1/buffer/src/main/java/io/netty/buffer/PooledByteBufAllocator.java#L227-L239) api to create `PooledByteBufAllocator`.
    
    ### Why are the changes needed?
    Upgrade netty-all to 4.1.63.Final avoid CVE problems.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Pass the Jenkins or GitHub Action
    
    Closes #32227 from LuciferYang/SPARK-35132.
    
    Authored-by: yangjie01 <[email protected]>
    Signed-off-by: Sean Owen <[email protected]>
    LuciferYang authored and srowen committed Apr 20, 2021
    Configuration menu
    Copy the full SHA
    c7e18ad View commit details
    Browse the repository at this point in the history

Commits on Apr 21, 2021

  1. [SPARK-35044][SQL][FOLLOWUP][TEST-HADOOP2.7] Fix hadoop 2.7 test due …

    …to diff between hadoop 2.7 and hadoop 3
    
    ### What changes were proposed in this pull request?
    
    dfs.replication is inconsistent from hadoop 2.x to 3.x, so in this PR we use `dfs.hosts` to verify per #32144 (comment)
    
    ```
    == Results ==
    !== Correct Answer - 1 ==        == Spark Answer - 1 ==
    !struct<>                        struct<key:string,value:string>
    ![dfs.replication,<undefined>]   [dfs.replication,3]
    ```
    
    ### Why are the changes needed?
    
    fix Jenkins job with Hadoop 2.7
    
    ### Does this PR introduce _any_ user-facing change?
    
    test only change
    ### How was this patch tested?
    
    test only change
    
    Closes #32263 from yaooqinn/SPARK-35044-F.
    
    Authored-by: Kent Yao <[email protected]>
    Signed-off-by: HyukjinKwon <[email protected]>
    yaooqinn authored and HyukjinKwon committed Apr 21, 2021
    Configuration menu
    Copy the full SHA
    81c3cc2 View commit details
    Browse the repository at this point in the history
  2. [SPARK-35113][SQL] Support ANSI intervals in the Hash expression

    ### What changes were proposed in this pull request?
    Support ANSI interval in HashExpression and add UT
    
    ### Why are the changes needed?
    Support ANSI interval in HashExpression
    
    ### Does this PR introduce _any_ user-facing change?
    User can pass ANSI interval in HashExpression function
    
    ### How was this patch tested?
    Added UT
    
    Closes #32259 from AngersZhuuuu/SPARK-35113.
    
    Authored-by: Angerszhuuuu <[email protected]>
    Signed-off-by: Max Gekk <[email protected]>
    AngersZhuuuu authored and MaxGekk committed Apr 21, 2021
    Configuration menu
    Copy the full SHA
    d259f93 View commit details
    Browse the repository at this point in the history
  3. [SPARK-35120][INFRA][FOLLOW-UP] Try catch an error to show the correc…

    …t guidance
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to handle 404 not found, see https://github.com/apache/spark/pull/32255/checks?check_run_id=2390446579 as an example.
    
    If a fork does not have any previous workflow runs, it seems throwing 404 error instead of empty runs.
    
    ### Why are the changes needed?
    
    To show the correct guidance to contributors.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No, dev-only.
    
    ### How was this patch tested?
    
    Manually tested at HyukjinKwon#48. See https://github.com/HyukjinKwon/spark/runs/2391469416 as an example.
    
    Closes #32258 from HyukjinKwon/SPARK-35120-followup.
    
    Authored-by: HyukjinKwon <[email protected]>
    Signed-off-by: Gengliang Wang <[email protected]>
    HyukjinKwon authored and gengliangwang committed Apr 21, 2021
    Configuration menu
    Copy the full SHA
    97ec57e View commit details
    Browse the repository at this point in the history
  4. [SPARK-35096][SQL] SchemaPruning should adhere spark.sql.caseSensitiv…

    …e config
    
    ### What changes were proposed in this pull request?
    
    As a part of the SPARK-26837 pruning of nested fields from object serializers are supported. But it is missed to handle case insensitivity nature of spark
    
    In this PR I have resolved the column names to be pruned based on `spark.sql.caseSensitive ` config
    **Exception Before Fix**
    
    ```
    Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
      at org.apache.spark.sql.types.StructType.apply(StructType.scala:414)
      at org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.$anonfun$applyOrElse$3(objects.scala:216)
      at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
      at scala.collection.immutable.List.foreach(List.scala:392)
      at scala.collection.TraversableLike.map(TraversableLike.scala:238)
      at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
      at scala.collection.immutable.List.map(List.scala:298)
      at org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:215)
      at org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:203)
      at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309)
      at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
      at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309)
      at
    ```
    
    ### Why are the changes needed?
    After Upgrade to Spark 3 `foreachBatch` API throws` java.lang.ArrayIndexOutOfBoundsException`. This issue will be fixed using this PR
    
    ### Does this PR introduce _any_ user-facing change?
    No, Infact fixes the regression
    
    ### How was this patch tested?
    Added tests and also tested verified manually
    
    Closes #32194 from sandeep-katta/SPARK-35096.
    
    Authored-by: sandeep.katta <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    sandeep-katta authored and cloud-fan committed Apr 21, 2021
    Configuration menu
    Copy the full SHA
    4f309ce View commit details
    Browse the repository at this point in the history
  5. [SPARK-35152][SQL] ANSI mode: IntegralDivide throws exception on over…

    …flow
    
    ### What changes were proposed in this pull request?
    
    IntegralDivide should throw an exception on overflow in ANSI mode.
    There is only one case that can cause that:
    ```
    Long.MinValue div -1
    ```
    
    ### Why are the changes needed?
    
    ANSI compliance
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, IntegralDivide throws an exception on overflow in ANSI mode
    
    ### How was this patch tested?
    
    Unit test
    
    Closes #32260 from gengliangwang/integralDiv.
    
    Authored-by: Gengliang Wang <[email protected]>
    Signed-off-by: Gengliang Wang <[email protected]>
    gengliangwang committed Apr 21, 2021
    Configuration menu
    Copy the full SHA
    43ad939 View commit details
    Browse the repository at this point in the history
  6. [SPARK-35142][PYTHON][ML] Fix incorrect return type for `rawPredictio…

    …nUDF` in `OneVsRestModel`
    
    ### What changes were proposed in this pull request?
    
    Fixes incorrect return type for `rawPredictionUDF` in `OneVsRestModel`.
    
    ### Why are the changes needed?
    Bugfix
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Unit test.
    
    Closes #32245 from harupy/SPARK-35142.
    
    Authored-by: harupy <[email protected]>
    Signed-off-by: Weichen Xu <[email protected]>
    harupy authored and WeichenXu123 committed Apr 21, 2021
    Configuration menu
    Copy the full SHA
    b6350f5 View commit details
    Browse the repository at this point in the history
  7. [SPARK-35171][R] Declare the markdown package as a dependency of the …

    …SparkR package
    
    ### What changes were proposed in this pull request?
    Declare the markdown package as a dependency of the SparkR package
    
    ### Why are the changes needed?
    If we didn't install pandoc locally, running make-distribution.sh will fail with the following message:
    ```
    — re-building ‘sparkr-vignettes.Rmd’ using rmarkdown
    Warning in engine$weave(file, quiet = quiet, encoding = enc) :
    Pandoc (>= 1.12.3) not available. Falling back to R Markdown v1.
    Error: processing vignette 'sparkr-vignettes.Rmd' failed with diagnostics:
    The 'markdown' package should be declared as a dependency of the 'SparkR' package (e.g., in the 'Suggests' field of DESCRIPTION), because the latter contains vignette(s) built with the 'markdown' package. Please see yihui/knitr#1864 for more information.
    — failed re-building ‘sparkr-vignettes.Rmd’
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    Yes. Workaround for R packaging.
    
    ### How was this patch tested?
    Manually test. After the fix, the command `sh dev/make-distribution.sh -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pyarn` in the environment without pandoc will pass.
    
    Closes #32270 from xuanyuanking/SPARK-35171.
    
    Authored-by: Yuanjian Li <[email protected]>
    Signed-off-by: HyukjinKwon <[email protected]>
    xuanyuanking authored and HyukjinKwon committed Apr 21, 2021
    Configuration menu
    Copy the full SHA
    8e9e700 View commit details
    Browse the repository at this point in the history
  8. [SPARK-35140][INFRA] Add error message guidelines to PR template

    ### What changes were proposed in this pull request?
    
    Adds a link to the [error message guidelines](https://spark.apache.org/error-message-guidelines.html) to the PR template to increase visibility.
    
    ### Why are the changes needed?
    
    Increases visibility of the error message guidelines, which are otherwise hidden in the [Contributing guidelines](https://spark.apache.org/contributing.html).
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Not needed.
    
    Closes #32241 from karenfeng/spark-35140.
    
    Authored-by: Karen Feng <[email protected]>
    Signed-off-by: HyukjinKwon <[email protected]>
    karenfeng authored and HyukjinKwon committed Apr 21, 2021
    Configuration menu
    Copy the full SHA
    355c399 View commit details
    Browse the repository at this point in the history
  9. [SPARK-34692][SQL] Support Not(Int) and Not(InSet) propagate null in …

    …predicate
    
    ### What changes were proposed in this pull request?
    
    * Add `Not(In)` and `Not(InSet)` check in `NullPropagation` rule.
    * Add more test for `In` and `Not(In)` in `Project` level.
    
    ### Why are the changes needed?
    
    The semantics of `Not(In)` could be seen like `And(a != b, a != c)` that match the `NullIntolerant`.
    
    As we already simplify the `NullIntolerant` expression to null if it's children have null. E.g. `a != null` => `null`. It's safe to do this with `Not(In)`/`Not(InSet)`.
    
    Note that, we can only do the simplify in predicate which `ReplaceNullWithFalseInPredicate`  rule do.
    
    Let's say we have two sqls:
    ```
    select 1 not in (2, null);
    select 1 where 1 not in (2, null);
    ```
    The first sql we cannot optimize since it would return `NULL` instead of `false`. The second is postive.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Add test.
    
    Closes #31797 from ulysses-you/SPARK-34692.
    
    Authored-by: ulysses-you <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    ulysses-you authored and cloud-fan committed Apr 21, 2021
    Configuration menu
    Copy the full SHA
    81dbaed View commit details
    Browse the repository at this point in the history
  10. [SPARK-34897][SQL] Support reconcile schemas based on index after nes…

    …ted column pruning
    
    ### What changes were proposed in this pull request?
    
    It will remove `StructField` when [pruning nested columns](https://github.com/apache/spark/blob/0f2c0b53e8fb18c86c67b5dd679c006db93f94a5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala#L28-L42). For example:
    ```scala
    spark.sql(
      """
        |CREATE TABLE t1 (
        |  _col0 INT,
        |  _col1 STRING,
        |  _col2 STRUCT<c1: STRING, c2: STRING, c3: STRING, c4: BIGINT>)
        |USING ORC
        |""".stripMargin)
    
    spark.sql("INSERT INTO t1 values(1, '2', struct('a', 'b', 'c', 10L))")
    
    spark.sql("SELECT _col0, _col2.c1 FROM t1").show
    ```
    
    Before this pr. The returned schema is: ``` `_col0` INT,`_col2` STRUCT<`c1`: STRING> ``` add it will throw exception:
    ```
    java.lang.AssertionError: assertion failed: The given data schema struct<_col0:int,_col2:struct<c1:string>> has less fields than the actual ORC physical schema, no idea which columns were dropped, fail to read.
    	at scala.Predef$.assert(Predef.scala:223)
    	at org.apache.spark.sql.execution.datasources.orc.OrcUtils$.requestedColumnIds(OrcUtils.scala:160)
    ```
    
    After this pr. The returned schema is: ``` `_col0` INT,`_col1` STRING,`_col2` STRUCT<`c1`: STRING> ```.
    
    The finally schema is ``` `_col0` INT,`_col2` STRUCT<`c1`: STRING> ``` after the complete column pruning:
    https://github.com/apache/spark/blob/7a5647a93aaea9d1d78d9262e24fc8c010db04d0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L208-L213
    
    https://github.com/apache/spark/blob/e64eb75aede71a5403a4d4436e63b1fcfdeca14d/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownUtils.scala#L96-L97
    
    ### Why are the changes needed?
    
    Fix bug.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Unit test.
    
    Closes #31993 from wangyum/SPARK-34897.
    
    Authored-by: Yuming Wang <[email protected]>
    Signed-off-by: Liang-Chi Hsieh <[email protected]>
    wangyum authored and viirya committed Apr 21, 2021
    Configuration menu
    Copy the full SHA
    e609395 View commit details
    Browse the repository at this point in the history

Commits on Apr 22, 2021

  1. [SPARK-35178][BUILD] Use new Apache 'closer.lua' syntax to obtain Maven

    ### What changes were proposed in this pull request?
    
    Use new Apache 'closer.lua' syntax to obtain Maven
    
    ### Why are the changes needed?
    
    The current closer.lua redirector, which redirects to download Maven from a local mirror, has a new syntax. build/mvn does not work properly otherwise now.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Manual testing.
    
    Closes #32277 from srowen/SPARK-35178.
    
    Authored-by: Sean Owen <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    srowen authored and dongjoon-hyun committed Apr 22, 2021
    Configuration menu
    Copy the full SHA
    6860efe View commit details
    Browse the repository at this point in the history
  2. [SPARK-34692][SQL][FOLLOWUP] Add INSET to ReplaceNullWithFalseInPredi…

    …cate's pattern
    
    ### What changes were proposed in this pull request?
    
    The test added by #31797 has the [failure](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137741/testReport/org.apache.spark.sql.catalyst.optimizer/ReplaceNullWithFalseInPredicateSuite/SPARK_34692__Support_Not_Int__and_Not_InSet__propagate_null/). This is a followup to fix it.
    
    ### Why are the changes needed?
    
    Due to #32157, the rule `ReplaceNullWithFalseInPredicate` will check tree pattern before actually doing transformation. As `null` in `INSET` is not `NULL_LITERAL` pattern, we miss it and fail the newly added `not inset ...` check in `replaceNullWithFalse`.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Existing unit tests.
    
    Closes #32278 from viirya/SPARK-34692-followup.
    
    Authored-by: Liang-Chi Hsieh <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    viirya authored and dongjoon-hyun committed Apr 22, 2021
    Configuration menu
    Copy the full SHA
    548e66c View commit details
    Browse the repository at this point in the history
  3. [SPARK-34674][CORE][K8S] Close SparkContext after the Main method has…

    … finished
    
    ### What changes were proposed in this pull request?
    Close SparkContext after the Main method has finished, to allow SparkApplication on K8S to complete.
    This is fixed version of [merged and reverted PR](#32081).
    
    ### Why are the changes needed?
    if I don't call the method sparkContext.stop() explicitly, then a Spark driver process doesn't terminate even after its Main method has been completed. This behaviour is different from spark on yarn, where the manual sparkContext stopping is not required. It looks like, the problem is in using non-daemon threads, which prevent the driver jvm process from terminating.
    So I have inserted code that closes sparkContext automatically.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Manually on the production AWS EKS environment in my company.
    
    Closes #32283 from kotlovs/close-spark-context-on-exit-2.
    
    Authored-by: skotlov <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    kotlovs authored and dongjoon-hyun committed Apr 22, 2021
    Configuration menu
    Copy the full SHA
    b17a0e6 View commit details
    Browse the repository at this point in the history
  4. [SPARK-35177][SQL] Fix arithmetic overflow in parsing the minimal int…

    …erval by `IntervalUtils.fromYearMonthString`
    
    ### What changes were proposed in this pull request?
    IntervalUtils.fromYearMonthString should handle Int.MinValue months correctly.
    In current logic, just use `Math.addExact(Math.multiplyExact(years, 12), months)` to calculate  negative total months will overflow when actual total months is Int.MinValue, this pr fixes this bug.
    
    ### Why are the changes needed?
    IntervalUtils.fromYearMonthString should handle Int.MinValue months correctly
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Added UT
    
    Closes #32281 from AngersZhuuuu/SPARK-35177.
    
    Authored-by: Angerszhuuuu <[email protected]>
    Signed-off-by: Max Gekk <[email protected]>
    AngersZhuuuu authored and MaxGekk committed Apr 22, 2021
    Configuration menu
    Copy the full SHA
    bb5459f View commit details
    Browse the repository at this point in the history
  5. [SPARK-35180][BUILD] Allow to build SparkR with SBT

    ### What changes were proposed in this pull request?
    
    This PR proposes a change that allows us to build SparkR with SBT.
    
    ### Why are the changes needed?
    
    In the current master, SparkR can be built only with Maven.
    It's helpful if we can built it with SBT.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    I confirmed that I can build SparkR on Ubuntu 20.04 with the following command.
    ```
    build/sbt -Psparkr package
    ```
    
    Closes #32285 from sarutak/sbt-sparkr.
    
    Authored-by: Kousuke Saruta <[email protected]>
    Signed-off-by: hyukjinkwon <[email protected]>
    sarutak authored and HyukjinKwon committed Apr 22, 2021
    Configuration menu
    Copy the full SHA
    c0972de View commit details
    Browse the repository at this point in the history
  6. [SPARK-35127][UI] When we switch between different stage-detail pages…

    …, the entry item in the newly-opened page may be blank
    
    ### What changes were proposed in this pull request?
    
    To make sure that pageSize shoud not be shared between different stage pages.
    The screenshots of the problem are placed in the attachment of [JIRA](https://issues.apache.org/jira/browse/SPARK-35127)
    
    ### Why are the changes needed?
    fix the bug.
    
    according to reference:`https://datatables.net/reference/option/lengthMenu`
    `-1` represents display all rows, but now we use `totalTasksToShow`, it will cause the select item show as empty when we swich between different stage-detail pages.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    manual test, it is a small io problem, and the modification does not affect the function, but just an adjustment of js configuration
    
    the gif below shows how the problem can be reproduced:
    ![reproduce](https://user-images.githubusercontent.com/52202080/115204351-f7060f80-a12a-11eb-8900-a009ad0c8870.gif)
    
    ![微信截图_20210419162849](https://user-images.githubusercontent.com/52202080/115205675-629cac80-a12c-11eb-9cb8-1939c7450e99.png)
    
    the gif below shows the result after modified:
    
    ![after_modified](https://user-images.githubusercontent.com/52202080/115204886-91fee980-a12b-11eb-9ccb-d5900a99095d.gif)
    
    Closes #32223 from kyoty/stages-task-empty-pagesize.
    
    Authored-by: kyoty <[email protected]>
    Signed-off-by: Kousuke Saruta <[email protected]>
    echohlne authored and sarutak committed Apr 22, 2021
    Configuration menu
    Copy the full SHA
    7242d7f View commit details
    Browse the repository at this point in the history
  7. [SPARK-35026][SQL] Support nested CUBE/ROLLUP/GROUPING SETS in GROUPI…

    …NG SETS
    
    ### What changes were proposed in this pull request?
    PG and Oracle both support use CUBE/ROLLUP/GROUPING SETS in GROUPING SETS's grouping set as a sugar syntax.
    ![image](https://user-images.githubusercontent.com/46485123/114975588-139a1180-9eb7-11eb-8f53-498c1db934e0.png)
    
    In this PR, we support it in Spark SQL too
    
    ### Why are the changes needed?
    Keep consistent with PG and oracle
    
    ### Does this PR introduce _any_ user-facing change?
    User can write grouping analytics like
    ```
    SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(ROLLUP(a, b));
    SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS((a, b), (a), ());
    SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(GROUPING SETS((a, b), (a), ()));
    ```
    
    ### How was this patch tested?
    Added Test
    
    Closes #32201 from AngersZhuuuu/SPARK-35026.
    
    Authored-by: Angerszhuuuu <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    AngersZhuuuu authored and cloud-fan committed Apr 22, 2021
    Configuration menu
    Copy the full SHA
    b22d54a View commit details
    Browse the repository at this point in the history
  8. [SPARK-35183][SQL] Use transformAllExpressions in CombineConcats

    ### What changes were proposed in this pull request?
    
    Use transformAllExpressions instead of transformExpressionsDown in CombineConcats. The latter only transforms the root plan node.
    
    ### Why are the changes needed?
    
    It allows CombineConcats to cover more cases where `concat` are not in the root plan node.
    
    ### How was this patch tested?
    
    Unit test. The updated tests would fail without the code change.
    
    Closes #32290 from sigmod/concat.
    
    Authored-by: Yingyi Bu <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    sigmod authored and cloud-fan committed Apr 22, 2021
    Configuration menu
    Copy the full SHA
    7f7a3d8 View commit details
    Browse the repository at this point in the history
  9. [SPARK-35110][SQL] Handle ANSI intervals in WindowExecBase

    ### What changes were proposed in this pull request?
    This PR makes window frame could support `YearMonthIntervalType` and `DayTimeIntervalType`.
    
    ### Why are the changes needed?
    Extend the function of window frame
    
    ### Does this PR introduce _any_ user-facing change?
    Yes. Users could use `YearMonthIntervalType` or `DayTimeIntervalType` as the sort expression for window frame.
    
    ### How was this patch tested?
    New tests
    
    Closes #32294 from beliefer/SPARK-35110.
    
    Authored-by: beliefer <[email protected]>
    Signed-off-by: Max Gekk <[email protected]>
    beliefer authored and MaxGekk committed Apr 22, 2021
    Configuration menu
    Copy the full SHA
    6c587d2 View commit details
    Browse the repository at this point in the history
  10. [SPARK-35187][SQL] Fix failure on the minimal interval literal

    ### What changes were proposed in this pull request?
    If the sign '-' inside of interval string, everything is fine after bb5459f:
    ```
    spark-sql> SELECT INTERVAL '-178956970-8' YEAR TO MONTH;
    -178956970-8
    ```
    but the sign outside of interval string is not handled properly:
    ```
    spark-sql> SELECT INTERVAL -'178956970-8' YEAR TO MONTH;
    Error in query:
    Error parsing interval year-month string: integer overflow(line 1, pos 16)
    
    == SQL ==
    SELECT INTERVAL -'178956970-8' YEAR TO MONTH
    ----------------^^^
    ```
    This pr fix this issue
    
    ### Why are the changes needed?
    Fix bug
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Added UT
    
    Closes #32296 from AngersZhuuuu/SPARK-35187.
    
    Authored-by: Angerszhuuuu <[email protected]>
    Signed-off-by: Max Gekk <[email protected]>
    AngersZhuuuu authored and MaxGekk committed Apr 22, 2021
    Configuration menu
    Copy the full SHA
    04e2305 View commit details
    Browse the repository at this point in the history
  11. [SPARK-34999][PYTHON] Consolidate PySpark testing utils

    ### What changes were proposed in this pull request?
    Consolidate PySpark testing utils by removing `python/pyspark/pandas/testing`, and then creating a file `pandasutils` under `python/pyspark/testing` for test utilities used in `pyspark/pandas`.
    
    ### Why are the changes needed?
    
    `python/pyspark/pandas/testing` hold test utilites for pandas-on-spark, and `python/pyspark/testing` contain test utilities for pyspark. Consolidating them makes code cleaner and easier to maintain.
    
    Updated import statements are as shown below:
    - from pyspark.testing.sqlutils import SQLTestUtils
    - from pyspark.testing.pandasutils import PandasOnSparkTestCase, TestUtils
    (PandasOnSparkTestCase is the original ReusedSQLTestCase in `python/pyspark/pandas/testing/utils.py`)
    
    Minor improvements include:
    - Usage of missing library's requirement_message
    - `except ImportError` rather than `except`
    - import pyspark.pandas alias as `ps` rather than `pp`
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Unit tests under python/pyspark/pandas/tests.
    
    Closes #32177 from xinrong-databricks/port.merge_utils.
    
    Authored-by: Xinrong Meng <[email protected]>
    Signed-off-by: Takuya UESHIN <[email protected]>
    xinrong-meng authored and ueshin committed Apr 22, 2021
    Configuration menu
    Copy the full SHA
    4d2b559 View commit details
    Browse the repository at this point in the history

Commits on Apr 23, 2021

  1. [SPARK-35182][K8S] Support driver-owned on-demand PVC

    ### What changes were proposed in this pull request?
    
    This PR aims to support driver-owned on-demand PVC(Persistent Volume Claim)s. It means dynamically-created PVCs will have the `ownerReference` to `driver` pod instead of `executor` pod.
    
    ### Why are the changes needed?
    
    This allows K8s backend scheduler can reuse this later.
    
    **BEFORE**
    ```
    $ k get pvc tpcds-pvc-exec-1-pvc-0 -oyaml
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
    ...
      ownerReferences:
      - apiVersion: v1
        controller: true
        kind: Pod
        name: tpcds-pvc-exec-1
    ```
    
    **AFTER**
    ```
    $ k get pvc tpcds-pvc-exec-1-pvc-0 -oyaml
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
    ...
      ownerReferences:
      - apiVersion: v1
        controller: true
        kind: Pod
        name: tpcds-pvc
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    
    No. (The default is `false`)
    
    ### How was this patch tested?
    
    Manually check the above and pass K8s IT.
    
    ```
    KubernetesSuite:
    - Run SparkPi with no resources
    - Run SparkPi with a very long application name.
    - Use SparkLauncher.NO_RESOURCE
    - Run SparkPi with a master URL without a scheme.
    - Run SparkPi with an argument.
    - Run SparkPi with custom labels, annotations, and environment variables.
    - All pods have the same service account by default
    - Run extraJVMOptions check on driver
    - Run SparkRemoteFileTest using a remote data file
    - Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties
    - Run SparkPi with env and mount secrets.
    - Run PySpark on simple pi.py example
    - Run PySpark to test a pyfiles example
    - Run PySpark with memory customization
    - Run in client mode.
    - Start pod creation from template
    - PVs with local storage
    - Launcher client dependencies
    - SPARK-33615: Launcher client archives
    - SPARK-33748: Launcher python client respecting PYSPARK_PYTHON
    - SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python
    - Launcher python client dependencies using a zip file
    - Test basic decommissioning
    - Test basic decommissioning with shuffle cleanup
    - Test decommissioning with dynamic allocation & shuffle cleanups
    - Test decommissioning timeouts
    - Run SparkR on simple dataframe.R example
    Run completed in 16 minutes, 40 seconds.
    Total number of tests run: 27
    Suites: completed 2, aborted 0
    Tests: succeeded 27, failed 0, canceled 0, ignored 0, pending 0
    All tests passed.
    ```
    
    Closes #32288 from dongjoon-hyun/SPARK-35182.
    
    Authored-by: Dongjoon Hyun <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    dongjoon-hyun committed Apr 23, 2021
    Configuration menu
    Copy the full SHA
    6ab0048 View commit details
    Browse the repository at this point in the history
  2. [SPARK-35040][PYTHON] Remove Spark-version related codes from test codes

    ### What changes were proposed in this pull request?
    
    Removes PySpark version dependent codes from pyspark.pandas test codes.
    
    ### Why are the changes needed?
    
    There are several places to check the PySpark version and switch the logic, but now those are not necessary.
    We should remove them.
    
    We will do the same thing after we finish porting tests.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Existing tests.
    
    Closes #32300 from xinrong-databricks/port.rmv_spark_version_chk_in_tests.
    
    Authored-by: Xinrong Meng <[email protected]>
    Signed-off-by: Takuya UESHIN <[email protected]>
    xinrong-meng authored and ueshin committed Apr 23, 2021
    Configuration menu
    Copy the full SHA
    4fcbf59 View commit details
    Browse the repository at this point in the history
  3. [SPARK-35075][SQL] Add traversal pruning for subquery related rules

    ### What changes were proposed in this pull request?
    
    Added the following TreePattern enums:
    - DYNAMIC_PRUNING_SUBQUERY
    - EXISTS_SUBQUERY
    - IN_SUBQUERY
    - LIST_SUBQUERY
    - PLAN_EXPRESSION
    - SCALAR_SUBQUERY
    - FILTER
    
    Used them in the following rules:
    - ResolveSubquery
    - UpdateOuterReferences
    - OptimizeSubqueries
    - RewritePredicateSubquery
    - PullupCorrelatedPredicates
    - RewriteCorrelatedScalarSubquery (not the rule itself but an internal transform call, the full support is in SPARK-35148)
    - InsertAdaptiveSparkPlan
    - PlanAdaptiveSubqueries
    
    ### Why are the changes needed?
    
    Reduce the number of tree traversals and hence improve the query compilation latency.
    
    ### How was this patch tested?
    
    Existing tests.
    
    Closes #32247 from sigmod/subquery.
    
    Authored-by: Yingyi Bu <[email protected]>
    Signed-off-by: Gengliang Wang <[email protected]>
    sigmod authored and gengliangwang committed Apr 23, 2021
    Configuration menu
    Copy the full SHA
    47f8687 View commit details
    Browse the repository at this point in the history
  4. [SPARK-35195][SQL][TEST] Move InMemoryTable etc to org.apache.spark.s…

    …ql.connector.catalog
    
    ### What changes were proposed in this pull request?
    
    Move the following classes:
    - `InMemoryAtomicPartitionTable`
    - `InMemoryPartitionTable`
    - `InMemoryPartitionTableCatalog`
    - `InMemoryTable`
    - `InMemoryTableCatalog`
    - `StagingInMemoryTableCatalog`
    
    from `org.apache.spark.sql.connector` to `org.apache.spark.sql.connector.catalog`.
    
    ### Why are the changes needed?
    
    These classes implement catalog related interfaces but reside in `org.apache.spark.sql.connector`. A more suitable place should be `org.apache.spark.sql.connector.catalog`.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    N/A
    
    Closes #32302 from sunchao/SPARK-35195.
    
    Authored-by: Chao Sun <[email protected]>
    Signed-off-by: Liang-Chi Hsieh <[email protected]>
    sunchao authored and viirya committed Apr 23, 2021
    Configuration menu
    Copy the full SHA
    86238d0 View commit details
    Browse the repository at this point in the history
  5. [SPARK-35141][SQL] Support two level of hash maps for final hash aggr…

    …egation
    
    ### What changes were proposed in this pull request?
    
    For partial hash aggregation (code-gen path), we have two level of hash map for aggregation. First level is from `RowBasedHashMapGenerator`, which is computation faster compared to the second level from `UnsafeFixedWidthAggregationMap`. The introducing of two level hash map can help improve CPU performance of query as the first level hash map normally fits in hardware cache and has cheaper hash function for key lookup.
    
    For final hash aggregation, we can also support two level of hash map, to improve query performance further.
    The original two level of hash map code works for final aggregation mostly out of box. The major change here is to support testing fall back of final aggregation (see change related to `bitMaxCapacity` and `checkFallbackForGeneratedHashMap`).
    
    Example:
    
    An aggregation query:
    
    ```
    spark.sql(
      """
        |SELECT key, avg(value)
        |FROM agg1
        |GROUP BY key
      """.stripMargin)
    ```
    
    The generated code for final aggregation is [here](https://gist.github.com/c21/20c10cc8e2c7e561aafbe9b8da055242).
    
    An aggregation query with testing fallback:
    ```
    withSQLConf("spark.sql.TungstenAggregate.testFallbackStartsAt" -> "2, 3") {
      spark.sql(
        """
          |SELECT key, avg(value)
          |FROM agg1
          |GROUP BY key
        """.stripMargin)
    }
    ```
    The generated code for final aggregation is [here](https://gist.github.com/c21/dabf176cbc18a5e2138bc0a29e81c878). Note the no more counter condition for first level fast map.
    
    ### Why are the changes needed?
    
    Improve the CPU performance of hash aggregation query in general.
    
    For `AggregateBenchmark."Aggregate w multiple keys"`, seeing query performance improved by 10%.
    `codegen = T` means whole stage code-gen is enabled.
    `hashmap = T` means two level maps is enabled for partial aggregation.
    `finalhashmap = T` means two level maps is enabled for final aggregation.
    
    ```
    Running benchmark: Aggregate w multiple keys
      Running case: codegen = F
      Stopped after 2 iterations, 8284 ms
      Running case: codegen = T hashmap = F
      Stopped after 2 iterations, 5424 ms
      Running case: codegen = T hashmap = T finalhashmap = F
      Stopped after 2 iterations, 4753 ms
      Running case: codegen = T hashmap = T finalhashmap = T
      Stopped after 2 iterations, 4508 ms
    
    Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.15.7
    Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
    Aggregate w multiple keys:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------------------------------
    codegen = F                                        3881           4142         370          5.4         185.1       1.0X
    codegen = T hashmap = F                            2701           2712          16          7.8         128.8       1.4X
    codegen = T hashmap = T finalhashmap = F           2363           2377          19          8.9         112.7       1.6X
    codegen = T hashmap = T finalhashmap = T           2252           2254           3          9.3         107.4       1.7X
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Existing unit test in `HashAggregationQuerySuite` and `HashAggregationQueryWithControlledFallbackSuite` already cover the test.
    
    Closes #32242 from c21/agg.
    
    Authored-by: Cheng Su <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    c21 authored and cloud-fan committed Apr 23, 2021
    Configuration menu
    Copy the full SHA
    cab205e View commit details
    Browse the repository at this point in the history
  6. [SPARK-35143][SQL][SHELL] Add default log level config for spark-sql

    ### What changes were proposed in this pull request?
    Add default log config for spark-sql
    
    ### Why are the changes needed?
    The default log level for spark-sql is `WARN`. How to change the log level is confusing, we need a default config.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Change config `log4j.logger.org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver=INFO` in log4j.properties, then spark-sql's default log level changed.
    
    Closes #32248 from hddong/spark-35413.
    
    Lead-authored-by: hongdongdong <[email protected]>
    Co-authored-by: Hyukjin Kwon <[email protected]>
    Signed-off-by: hyukjinkwon <[email protected]>
    hddong and HyukjinKwon committed Apr 23, 2021
    Configuration menu
    Copy the full SHA
    7582dc8 View commit details
    Browse the repository at this point in the history
  7. [SPARK-35159][SQL][DOCS] Extract hive format doc

    ### What changes were proposed in this pull request?
    Extract common doc about hive format for `sql-ref-syntax-ddl-create-table-hiveformat.md` and `sql-ref-syntax-qry-select-transform.md` to refer.
    
    ![image](https://user-images.githubusercontent.com/46485123/115802193-04641800-a411-11eb-827d-d92544881842.png)
    
    ### Why are the changes needed?
    Improve doc
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Not need
    
    Closes #32264 from AngersZhuuuu/SPARK-35159.
    
    Authored-by: Angerszhuuuu <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    AngersZhuuuu authored and cloud-fan committed Apr 23, 2021
    Configuration menu
    Copy the full SHA
    20d68dc View commit details
    Browse the repository at this point in the history
  8. Revert "[SPARK-34581][SQL] Don't optimize out grouping expressions fr…

    …om aggregate expressions without aggregate function"
    
    This reverts commit c8d78a7.
    cloud-fan committed Apr 23, 2021
    Configuration menu
    Copy the full SHA
    fdccd88 View commit details
    Browse the repository at this point in the history
  9. [SPARK-35078][SQL] Add tree traversal pruning in expression rules

    ### What changes were proposed in this pull request?
    
    Added the following TreePattern enums:
    - AND_OR
    - BINARY_ARITHMETIC
    - BINARY_COMPARISON
    - CASE_WHEN
    - CAST
    - CONCAT
    - COUNT
    - IF
    - LIKE_FAMLIY
    - NOT
    - NULL_CHECK
    - UNARY_POSITIVE
    - UPPER_OR_LOWER
    
    Used them in the following rules:
    - ConstantPropagation
    - ReorderAssociativeOperator
    - BooleanSimplification
    - SimplifyBinaryComparison
    - SimplifyCaseConversionExpressions
    - SimplifyConditionals
    - PushFoldableIntoBranches
    - LikeSimplification
    - NullPropagation
    - SimplifyCasts
    - RemoveDispensableExpressions
    - CombineConcats
    
    ### Why are the changes needed?
    
    Reduce the number of tree traversals and hence improve the query compilation latency.
    
    ### How was this patch tested?
    
    Existing tests.
    
    Closes #32280 from sigmod/expression.
    
    Authored-by: Yingyi Bu <[email protected]>
    Signed-off-by: Gengliang Wang <[email protected]>
    sigmod authored and gengliangwang committed Apr 23, 2021
    Configuration menu
    Copy the full SHA
    9af338c View commit details
    Browse the repository at this point in the history
  10. [SPARK-35201][SQL] Format empty grouping set exception in CUBE/ROLLUP

    ### What changes were proposed in this pull request?
    Format empty grouping set exception in CUBE/ROLLUP
    
    ### Why are the changes needed?
    Format empty grouping set exception in CUBE/ROLLUP
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Not need
    
    Closes #32307 from AngersZhuuuu/SPARK-35201.
    
    Authored-by: Angerszhuuuu <[email protected]>
    Signed-off-by: Takeshi Yamamuro <[email protected]>
    AngersZhuuuu authored and maropu committed Apr 23, 2021
    Configuration menu
    Copy the full SHA
    e503b9c View commit details
    Browse the repository at this point in the history
  11. [SPARK-35204][SQL] CatalystTypeConverters of date/timestamp should ac…

    …cept both the old and new Java time classes
    
    ### What changes were proposed in this pull request?
    
    `CatalystTypeConverters` is useful when the type of the input data classes are not known statically (otherwise we can use `ExpressionEncoder`). However, the current `CatalystTypeConverters` requires you to know the datetime data class statically, which makes it hard to use.
    
    This PR improves the `CatalystTypeConverters` for date/timestamp, to support the old and new Java time classes at the same time.
    
    ### Why are the changes needed?
    
    Make `CatalystTypeConverters` easier to use.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    new test
    
    Closes #32312 from cloud-fan/minor.
    
    Authored-by: Wenchen Fan <[email protected]>
    Signed-off-by: Max Gekk <[email protected]>
    cloud-fan authored and MaxGekk committed Apr 23, 2021
    Configuration menu
    Copy the full SHA
    a9345a0 View commit details
    Browse the repository at this point in the history
  12. [SPARK-34297][SQL][SS] Add metrics for data loss and offset out range…

    … for KafkaMicroBatchStream
    
    ### What changes were proposed in this pull request?
    
    This patch proposes to add a couple of metrics in scan node for Kafka batch streaming query.
    
    ### Why are the changes needed?
    
    When testing SS, I found it is hard to track data loss of SS reading from Kafka. The micro batch scan node has only one metric, number of output rows. Users have no idea how many offsets to fetch are out of Kafka, how many times data loss happens. These metrics are important for users to know the quality of SS query running.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, adding two metrics to micro batch scan node for Kafka batch streaming.
    
    ### How was this patch tested?
    
    Currently I tested on internal cluster with Kafka:
    
    <img width="1193" alt="Screen Shot 2021-04-22 at 7 16 29 PM" src="https://user-images.githubusercontent.com/68855/115808460-61bf8100-a39f-11eb-99a9-65d22c3f5fb0.png">
    
    I was trying to add unit test. But as our batch streaming query disallows to specify ending offsets. If I only specify an out-of-range starting offset, when we get offset range in `getRanges`,  any negative size range will be filtered out. So it cannot actually test the case of fetched non-existing offset.
    
    Closes #31398 from viirya/micro-batch-metrics.
    
    Authored-by: Liang-Chi Hsieh <[email protected]>
    Signed-off-by: Liang-Chi Hsieh <[email protected]>
    viirya committed Apr 23, 2021
    Configuration menu
    Copy the full SHA
    b2a2b5d View commit details
    Browse the repository at this point in the history

Commits on Apr 24, 2021

  1. [SPARK-35210][BUILD] Upgrade Jetty to 9.4.40 to fix ERR_CONNECTION_RE…

    …SET issue
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to upgrade Jetty to 9.4.40.
    
    ### Why are the changes needed?
    
    SPARK-34988 (#32091) upgraded Jetty to 9.4.39 for CVE-2021-28165.
    But after the upgrade, Jetty 9.4.40 was released to fix the ERR_CONNECTION_RESET issue (jetty/jetty.project#6152).
    This issue seems to affect Jetty 9.4.39 when POST method is used with SSL.
    For Spark, job submission using REST and ThriftServer with HTTPS protocol can be affected.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No. No released version uses Jetty 9.3.39.
    
    ### How was this patch tested?
    
    CI.
    
    Closes #32318 from sarutak/upgrade-jetty-9.4.40.
    
    Authored-by: Kousuke Saruta <[email protected]>
    Signed-off-by: Kousuke Saruta <[email protected]>
    sarutak committed Apr 24, 2021
    Configuration menu
    Copy the full SHA
    44c1387 View commit details
    Browse the repository at this point in the history
  2. [SPARK-34990][SQL][TESTS] Add ParquetEncryptionSuite

    ### What changes were proposed in this pull request?
    
    A simple test that writes and reads an encrypted parquet and verifies that it's encrypted by checking its magic string (in encrypted footer mode).
    
    ### Why are the changes needed?
    
    To provide a test coverage for Parquet encryption.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    - [x] [SBT / Hadoop 3.2 / Java8 (the default)](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137785/testReport)
    - [ ] ~SBT / Hadoop 3.2 / Java11 by adding [test-java11] to the PR title.~ (Jenkins Java11 build is broken due to missing JDK11 installation)
    - [x] [SBT / Hadoop 2.7 / Java8 by adding [test-hadoop2.7] to the PR title.](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137836/testReport)
    - [x] Maven / Hadoop 3.2 / Java8 by adding [test-maven] to the PR title.
    - [x] Maven / Hadoop 2.7 / Java8 by adding [test-maven][test-hadoop2.7] to the PR title.
    
    Closes #32146 from andersonm-ibm/pme_testing.
    
    Authored-by: Maya Anderson <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    andersonm-ibm authored and dongjoon-hyun committed Apr 24, 2021
    Configuration menu
    Copy the full SHA
    166cc62 View commit details
    Browse the repository at this point in the history
  3. [SPARK-35200][CORE] Avoid to recompute the pending speculative tasks …

    …in the ExecutorAllocationManager and remove some unnecessary code
    
    ### What changes were proposed in this pull request?
    Avoid to recompute the pending speculative tasks in the ExecutorAllocationManager, and remove some unnecessary code.
    
    ### Why are the changes needed?
    
    The number of the pending speculative tasks is recomputed in the ExecutorAllocationManager to calculate the maximum number of executors required.  While , it only needs to be computed once to improve performance.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Existing tests.
    
    Closes #32306 from weixiuli/SPARK-35200.
    
    Authored-by: weixiuli <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    weixiuli authored and dongjoon-hyun committed Apr 24, 2021
    Configuration menu
    Copy the full SHA
    bcac733 View commit details
    Browse the repository at this point in the history

Commits on Apr 25, 2021

  1. [SPARK-35024][ML] Refactor LinearSVC - support virtual centering

    ### What changes were proposed in this pull request?
    1, remove existing agg, and use a new agg supporting virtual centering
    2, add related testsuites
    
    ### Why are the changes needed?
    centering vectors should accelerate convergence, and generate solution more close to R
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    updated testsuites and added testsuites
    
    Closes #32124 from zhengruifeng/svc_agg_refactor.
    
    Authored-by: Ruifeng Zheng <[email protected]>
    Signed-off-by: Ruifeng Zheng <[email protected]>
    zhengruifeng committed Apr 25, 2021
    Configuration menu
    Copy the full SHA
    1f150b9 View commit details
    Browse the repository at this point in the history
  2. [SPARK-33913][SS] Upgrade Kafka to 2.8.0

    ### What changes were proposed in this pull request?
    
    This PR aims to upgrade Kafka client to 2.8.0.
    Note that Kafka 2.8.0 uses ZSTD JNI 1.4.9-1 like Apache Spark 3.2.0.
    
    ### Why are the changes needed?
    
    This will bring the latest client-side improvement and bug fixes like the following examples.
    
    - KAFKA-10631 ProducerFencedException is not Handled on Offest Commit
    - KAFKA-10134 High CPU issue during rebalance in Kafka consumer after upgrading to 2.5
    - KAFKA-12193 Re-resolve IPs when a client is disconnected
    - KAFKA-10090 Misleading warnings: The configuration was supplied but isn't a known config
    - KAFKA-9263 The new hw is added to incorrect log when  ReplicaAlterLogDirsThread is replacing log
    - KAFKA-10607 Ensure the error counts contains the NONE
    - KAFKA-10458 Need a way to update quota for TokenBucket registered with Sensor
    - KAFKA-10503 MockProducer doesn't throw ClassCastException when no partition for topic
    
    **RELEASE NOTE**
    - https://downloads.apache.org/kafka/2.8.0/RELEASE_NOTES.html
    - https://downloads.apache.org/kafka/2.7.0/RELEASE_NOTES.html
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Pass the CIs with the existing tests because this is a dependency change.
    
    Closes #32325 from dongjoon-hyun/SPARK-33913.
    
    Authored-by: Dongjoon Hyun <[email protected]>
    Signed-off-by: hyukjinkwon <[email protected]>
    dongjoon-hyun authored and HyukjinKwon committed Apr 25, 2021
    Configuration menu
    Copy the full SHA
    b108e7f View commit details
    Browse the repository at this point in the history
  3. [SPARK-35168][SQL] mapred.reduce.tasks should be shuffle.partitions n…

    …ot adaptive.coalescePartitions.initialPartitionNum
    
    ### What changes were proposed in this pull request?
    
    ```sql
    spark-sql> set spark.sql.adaptive.coalescePartitions.initialPartitionNum=1;
    spark.sql.adaptive.coalescePartitions.initialPartitionNum	1
    Time taken: 2.18 seconds, Fetched 1 row(s)
    spark-sql> set mapred.reduce.tasks;
    21/04/21 14:27:11 WARN SetCommand: Property mapred.reduce.tasks is deprecated, showing spark.sql.shuffle.partitions instead.
    spark.sql.shuffle.partitions	1
    Time taken: 0.03 seconds, Fetched 1 row(s)
    spark-sql> set spark.sql.shuffle.partitions;
    spark.sql.shuffle.partitions	200
    Time taken: 0.024 seconds, Fetched 1 row(s)
    spark-sql> set mapred.reduce.tasks=2;
    21/04/21 14:31:52 WARN SetCommand: Property mapred.reduce.tasks is deprecated, automatically converted to spark.sql.shuffle.partitions instead.
    spark.sql.shuffle.partitions	2
    Time taken: 0.017 seconds, Fetched 1 row(s)
    spark-sql> set mapred.reduce.tasks;
    21/04/21 14:31:55 WARN SetCommand: Property mapred.reduce.tasks is deprecated, showing spark.sql.shuffle.partitions instead.
    spark.sql.shuffle.partitions	1
    Time taken: 0.017 seconds, Fetched 1 row(s)
    spark-sql>
    ```
    
    `mapred.reduce.tasks` is mapping to `spark.sql.shuffle.partitions` at write-side, but `spark.sql.adaptive.coalescePartitions.initialPartitionNum` might take precede of `spark.sql.shuffle.partitions`
    
    ### Why are the changes needed?
    
    roundtrip for `mapred.reduce.tasks`
    
    ### Does this PR introduce _any_ user-facing change?
    
    yes, `mapred.reduce.tasks` will always report `spark.sql.shuffle.partitions` whether `spark.sql.adaptive.coalescePartitions.initialPartitionNum` exists or not.
    
    ### How was this patch tested?
    
    a new test
    
    Closes #32265 from yaooqinn/SPARK-35168.
    
    Authored-by: Kent Yao <[email protected]>
    Signed-off-by: Kent Yao <[email protected]>
    yaooqinn committed Apr 25, 2021
    Configuration menu
    Copy the full SHA
    5b1353f View commit details
    Browse the repository at this point in the history

Commits on Apr 26, 2021

  1. [SPARK-35220][SQL] DayTimeIntervalType/YearMonthIntervalType show dif…

    …ferent between Hive SerDe and row format delimited
    
    ### What changes were proposed in this pull request?
    DayTimeIntervalType/YearMonthIntervalString show different between Hive SerDe and row format delimited.
    Create this pr to add a test and  have disscuss.
    
    For this problem I think we have two direction:
    
    1. leave it as current and add a item t explain this  in migration guide docs.
    2. Since we should not change hive serde's behavior, so we can cast spark row format delimited's behavior to use cast  DayTimeIntervalType/YearMonthIntervalType as HIVE_STYLE
    
    ### Why are the changes needed?
    Add UT
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    added ut
    
    Closes #32335 from AngersZhuuuu/SPARK-35220.
    
    Authored-by: Angerszhuuuu <[email protected]>
    Signed-off-by: hyukjinkwon <[email protected]>
    AngersZhuuuu authored and HyukjinKwon committed Apr 26, 2021
    Configuration menu
    Copy the full SHA
    6f782ef View commit details
    Browse the repository at this point in the history
  2. [SPARK-35087][UI] Some columns in table Aggregated Metrics by Executo…

    …r of stage-detail page shows incorrectly.
    
    ### What changes were proposed in this pull request?
    
     columns like 'Shuffle Read Size / Records', 'Output Size/ Records' etc  in table ` Aggregated Metrics by Executor` of stage-detail page should be sorted as numerical-order instead of lexicographical-order.
    
    ### Why are the changes needed?
    buf fix,the sorting style should be consistent between different columns.
    
    The correspondence between the table and the index is shown below(it is defined in stagespage-template.html):
    | index | column name                            |
    | ----- | -------------------------------------- |
    | 0     | Executor ID                            |
    | 1     | Logs                                   |
    | 2     | Address                                |
    | 3     | Task Time                              |
    | 4     | Total Tasks                            |
    | 5     | Failed Tasks                           |
    | 6     | Killed Tasks                           |
    | 7     | Succeeded Tasks                        |
    | 8     | Excluded                               |
    | 9     | Input Size / Records                   |
    | 10    | Output Size / Records                  |
    | 11    | Shuffle Read Size / Records            |
    | 12    | Shuffle Write Size / Records           |
    | 13    | Spill (Memory)                         |
    | 14    | Spill (Disk)                           |
    | 15    | Peak JVM Memory OnHeap / OffHeap       |
    | 16    | Peak Execution Memory OnHeap / OffHeap |
    | 17    | Peak Storage Memory OnHeap / OffHeap   |
    | 18    | Peak Pool Memory Direct / Mapped       |
    
    I constructed some data to simulate the sorting results of the index columns from 9 to 18.
    As shown below,it can be seen that the sorting results of columns 9-12 are wrong:
    
    ![simulate-result](https://user-images.githubusercontent.com/52202080/115120775-c9fa1580-9fe1-11eb-8514-71f29db3a5eb.png)
    
    The reason is that the real data corresponding to columns 9-12 (note that it is not the data displayed on the page) are **all strings similar to`94685/131`(bytes/records),while the real data corresponding to columns 13-18 are all numbers,**
    so the sorting corresponding to columns 13-18 loos well, but the results of columns 9-12 are incorrect because the strings are sorted according to lexicographical order.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Only JS was modified, and the manual test result works well.
    
    **before modified:**
    ![looks-illegal](https://user-images.githubusercontent.com/52202080/115120812-06c60c80-9fe2-11eb-9ada-fa520fe43c4e.png)
    
    **after modified:**
    ![sort-result-corrent](https://user-images.githubusercontent.com/52202080/114865187-7c847980-9e24-11eb-9fbc-39ee224726d6.png)
    
    Closes #32190 from kyoty/aggregated-metrics-by-executor-sorted-incorrectly.
    
    Authored-by: kyoty <[email protected]>
    Signed-off-by: Kousuke Saruta <[email protected]>
    echohlne authored and sarutak committed Apr 26, 2021
    Configuration menu
    Copy the full SHA
    2d6467d View commit details
    Browse the repository at this point in the history
  3. [SPARK-32921][SHUFFLE] MapOutputTracker extensions to support push-ba…

    …sed shuffle
    
    ### What changes were proposed in this pull request?
    This is one of the patches for SPIP SPARK-30602 for push-based shuffle.
    Summary of changes:
    
    - Introduce `MergeStatus` which tracks the partition level metadata for a merged shuffle partition in the Spark driver
    - Unify `MergeStatus` and `MapStatus` under a single trait to allow code reusing inside `MapOutputTracker`
    - Extend `MapOutputTracker` to support registering / unregistering `MergeStatus`, calculate preferred locations for a shuffle taking into consideration of merged shuffle partitions, and serving reducer requests for block fetching locations with merged shuffle partitions.
    
    The added APIs in `MapOutputTracker` will be used by `DAGScheduler` in SPARK-32920 and by `ShuffleBlockFetcherIterator` in SPARK-32922
    
    ### Why are the changes needed?
    Refer to SPARK-30602
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Added unit tests.
    
    Lead-authored-by: Min Shen mshenlinkedin.com
    Co-authored-by: Chandni Singh chsinghlinkedin.com
    Co-authored-by: Venkata Sowrirajan vsowrirajanlinkedin.com
    
    Closes #30480 from Victsm/SPARK-32921.
    
    Lead-authored-by: Venkata krishnan Sowrirajan <[email protected]>
    Co-authored-by: Min Shen <[email protected]>
    Co-authored-by: Chandni Singh <[email protected]>
    Co-authored-by: Chandni Singh <[email protected]>
    Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
    4 people authored and Mridul Muralidharan committed Apr 26, 2021
    Configuration menu
    Copy the full SHA
    38ef477 View commit details
    Browse the repository at this point in the history
  4. [SPARK-35224][SQL][TESTS] Fix buffer overflow in `MutableProjectionSu…

    …ite`
    
    ### What changes were proposed in this pull request?
    In the test `"unsafe buffer with NO_CODEGEN"` of `MutableProjectionSuite`, fix unsafe buffer size calculation to be able to place all input fields without buffer overflow + meta-data.
    
    ### Why are the changes needed?
    To make the test suite `MutableProjectionSuite` more stable.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    By running the affected test suite:
    ```
    $ build/sbt "test:testOnly *MutableProjectionSuite"
    ```
    
    Closes #32339 from MaxGekk/fix-buffer-overflow-MutableProjectionSuite.
    
    Authored-by: Max Gekk <[email protected]>
    Signed-off-by: Max Gekk <[email protected]>
    MaxGekk committed Apr 26, 2021
    Configuration menu
    Copy the full SHA
    d572a85 View commit details
    Browse the repository at this point in the history
  5. [SPARK-35213][SQL] Keep the correct ordering of nested structs in cha…

    …ined withField operations
    
    ### What changes were proposed in this pull request?
    
    Modifies the UpdateFields optimizer to fix correctness issues with certain nested and chained withField operations. Examples for recreating the issue are in the new unit tests as well as the JIRA issue.
    
    ### Why are the changes needed?
    
    Certain withField patterns can cause Exceptions or even incorrect results. It appears to be a result of the additional UpdateFields optimization added in #29812. It traverses fieldOps in reverse order to take the last one per field, but this can cause nested structs to change order which leads to mismatches between the schema and the actual data. This updates the optimization to maintain the initial ordering of nested structs to match the generated schema.
    
    ### Does this PR introduce _any_ user-facing change?
    
    It fixes exceptions and incorrect results for valid uses in the latest Spark release.
    
    ### How was this patch tested?
    
    Added new unit tests for these edge cases.
    
    Closes #32338 from Kimahriman/bug/optimize-with-fields.
    
    Authored-by: Adam Binford <[email protected]>
    Signed-off-by: Liang-Chi Hsieh <[email protected]>
    Kimahriman authored and viirya committed Apr 26, 2021
    Configuration menu
    Copy the full SHA
    74afc68 View commit details
    Browse the repository at this point in the history
  6. [SPARK-35088][SQL] Accept ANSI intervals by the Sequence expression

    ### What changes were proposed in this pull request?
    This PR makes `Sequence` expression supports ANSI intervals as step expression.
    If the start and stop expression is `TimestampType,` then the step expression could select year-month or day-time interval.
    If the start and stop expression is `DateType,` then the step expression must be year-month.
    
    ### Why are the changes needed?
    Extends the function of `Sequence` expression.
    
    ### Does this PR introduce _any_ user-facing change?
    'Yes'. Users could use ANSI intervals as step expression for `Sequence` expression.
    
    ### How was this patch tested?
    New tests.
    
    Closes #32311 from beliefer/SPARK-35088.
    
    Lead-authored-by: beliefer <[email protected]>
    Co-authored-by: gengjiaan <[email protected]>
    Signed-off-by: Max Gekk <[email protected]>
    2 people authored and MaxGekk committed Apr 26, 2021
    Configuration menu
    Copy the full SHA
    c0a3c0c View commit details
    Browse the repository at this point in the history
  7. [SPARK-35223] Add IssueNavigationLink

    ### What changes were proposed in this pull request?
    
    Add `IssueNavigationLink` to make IDEA support hyperlink on JIRA Ticket and GitHub PR on Git plugin.
    
    ![image](https://user-images.githubusercontent.com/26535726/115997353-5ecdc600-a615-11eb-99eb-6acbf15d8626.png)
    
    ### Why are the changes needed?
    
    Make it more friendly for developers which using IDEA.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Closes #32337 from pan3793/SPARK-35223.
    
    Authored-by: Cheng Pan <[email protected]>
    Signed-off-by: Kent Yao <[email protected]>
    pan3793 authored and yaooqinn committed Apr 26, 2021
    Configuration menu
    Copy the full SHA
    84026d7 View commit details
    Browse the repository at this point in the history
  8. [SPARK-35230][SQL] Move custom metric classes to proper package

    ### What changes were proposed in this pull request?
    
    This patch moves DS v2 custom metric classes to `org.apache.spark.sql.connector.metric` package. Moving `CustomAvgMetric` and `CustomSumMetric` to above package and make them as public java abstract class too.
    
    ### Why are the changes needed?
    
    `CustomAvgMetric` and `CustomSumMetric`  should be public APIs for developers to extend. As there are a few metric classes, we should put them together in one package.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No, dev only and they are not released yet.
    
    ### How was this patch tested?
    
    Unit tests.
    
    Closes #32348 from viirya/move-custom-metric-classes.
    
    Authored-by: Liang-Chi Hsieh <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    viirya authored and dongjoon-hyun committed Apr 26, 2021
    Configuration menu
    Copy the full SHA
    bdac191 View commit details
    Browse the repository at this point in the history
  9. [SPARK-35220][DOCS][FOLLOWUP] DayTimeIntervalType/YearMonthIntervalTy…

    …pe show different between Hive SerDe and row format delimited
    
    ### What changes were proposed in this pull request?
    Add note in migration guide about  DayTimeIntervalType/YearMonthIntervalType show different between Hive SerDe and row format delimited
    
    ### Why are the changes needed?
    Add note
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Not need
    
    Closes #32343 from AngersZhuuuu/SPARK-35220-FOLLOWUP.
    
    Authored-by: Angerszhuuuu <[email protected]>
    Signed-off-by: Max Gekk <[email protected]>
    AngersZhuuuu authored and MaxGekk committed Apr 26, 2021
    Configuration menu
    Copy the full SHA
    1db031f View commit details
    Browse the repository at this point in the history
  10. [SPARK-34638][SQL] Single field nested column prune on generator output

    ### What changes were proposed in this pull request?
    
    This patch proposes an improvement on nested column pruning if the pruning target is generator's output. Previously we disallow such case. This patch allows to prune on it if there is only one single nested column is accessed after `Generate`.
    
    E.g., `df.select(explode($"items").as('item)).select($"item.itemId")`. As we only need `itemId` from `item`, we can prune other fields out and only keep `itemId`.
    
    In this patch, we only address explode-like generators. We will address other generators in followups.
    
    ### Why are the changes needed?
    
    This helps to extend the availability of nested column pruning.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Unit test
    
    Closes #31966 from viirya/SPARK-34638.
    
    Authored-by: Liang-Chi Hsieh <[email protected]>
    Signed-off-by: Liang-Chi Hsieh <[email protected]>
    viirya committed Apr 26, 2021
    Configuration menu
    Copy the full SHA
    c59988a View commit details
    Browse the repository at this point in the history
  11. [SPARK-33985][SQL][TESTS] Add query test of combine usage of TRANSFOR…

    …M and CLUSTER BY/ORDER BY
    
    ### What changes were proposed in this pull request?
    Under hive's document  https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform there are many usage about  TRANSFORM and CLUSTER BY/ORDER BY, in this pr add some test about this cases.
    
    ### Why are the changes needed?
    Add UT
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Added UT
    
    Closes #32333 from AngersZhuuuu/SPARK-33985.
    
    Authored-by: Angerszhuuuu <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    AngersZhuuuu authored and cloud-fan committed Apr 26, 2021
    Configuration menu
    Copy the full SHA
    f009046 View commit details
    Browse the repository at this point in the history
  12. [SPARK-35060][SQL] Group exception messages in sql/types

    ### What changes were proposed in this pull request?
    This PR group exception messages in `sql/catalyst/src/main/scala/org/apache/spark/sql/types`.
    
    ### Why are the changes needed?
    It will largely help with standardization of error messages and its maintenance.
    
    ### Does this PR introduce _any_ user-facing change?
    No. Error messages remain unchanged.
    
    ### How was this patch tested?
    No new tests - pass all original tests to make sure it doesn't break any existing behavior.
    
    Closes #32244 from beliefer/SPARK-35060.
    
    Lead-authored-by: beliefer <[email protected]>
    Co-authored-by: gengjiaan <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    2 people authored and cloud-fan committed Apr 26, 2021
    Configuration menu
    Copy the full SHA
    1b609c7 View commit details
    Browse the repository at this point in the history
  13. [SPARK-28247][SS][TEST] Fix flaky test "query without test harness" o…

    …n ContinuousSuite
    
    ### What changes were proposed in this pull request?
    
    This is another attempt to fix the flaky test "query without test harness" on ContinuousSuite.
    
    `query without test harness` is flaky because it starts a continuous query with two partitions but assumes they will run at the same speed.
    
    In this test, 0 and 2 will be written to partition 0, 1 and 3 will be written to partition 1. It assumes when we see 3, 2 should be written to the memory sink. But this is not guaranteed. We can add `if (currentValue == 2) Thread.sleep(5000)` at this line https://github.com/apache/spark/blob/b2a2b5d8206b7c09b180b8b6363f73c6c3fdb1d8/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousRateStreamSource.scala#L135 to reproduce the failure: `Result set Set([0], [1], [3]) are not a superset of Set(0, 1, 2, 3)!`
    
    The fix is changing `waitForRateSourceCommittedValue` to wait until all partitions reach the desired values before stopping the query.
    
    ### Why are the changes needed?
    
    Fix a flaky test.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Existing tests. Manually verify the reproduction I mentioned above doesn't fail after this change.
    
    Closes #32316 from zsxwing/SPARK-28247-fix.
    
    Authored-by: Shixiong Zhu <[email protected]>
    Signed-off-by: Jungtaek Lim <[email protected]>
    zsxwing authored and HeartSaVioR committed Apr 26, 2021
    Configuration menu
    Copy the full SHA
    0df3b50 View commit details
    Browse the repository at this point in the history

Commits on Apr 27, 2021

  1. [SPARK-35227][BUILD] Update the resolver for spark-packages in SparkS…

    …ubmit
    
    ### What changes were proposed in this pull request?
    This change is to use repos.spark-packages.org instead of Bintray as the repository service for spark-packages.
    
    ### Why are the changes needed?
    The change is needed because Bintray will no longer be available from May 1st.
    
    ### Does this PR introduce _any_ user-facing change?
    This should be transparent for users who use SparkSubmit.
    
    ### How was this patch tested?
    Tested running spark-shell with --packages manually.
    
    Closes #32346 from bozhang2820/replace-bintray.
    
    Authored-by: Bo Zhang <[email protected]>
    Signed-off-by: hyukjinkwon <[email protected]>
    bozhang2820 authored and HyukjinKwon committed Apr 27, 2021
    Configuration menu
    Copy the full SHA
    f738fe0 View commit details
    Browse the repository at this point in the history
  2. [SPARK-35225][SQL] EXPLAIN command should handle empty output of anal…

    …yzed plan
    
    ### What changes were proposed in this pull request?
    
    EXPLAIN command puts an empty line if there is no output for an analyzed plan. For example,
    
    `sql("CREATE VIEW test AS SELECT 1").explain(true)` produces:
    ```
    == Parsed Logical Plan ==
    'CreateViewStatement [test], SELECT 1, false, false, PersistedView
    +- 'Project [unresolvedalias(1, None)]
       +- OneRowRelation
    
    == Analyzed Logical Plan ==
    
    CreateViewCommand `default`.`test`, SELECT 1, false, false, PersistedView, true
       +- Project [1 AS 1#7]
          +- OneRowRelation
    
    == Optimized Logical Plan ==
    CreateViewCommand `default`.`test`, SELECT 1, false, false, PersistedView, true
       +- Project [1 AS 1#7]
          +- OneRowRelation
    
    == Physical Plan ==
    Execute CreateViewCommand
       +- CreateViewCommand `default`.`test`, SELECT 1, false, false, PersistedView, true
             +- Project [1 AS 1#7]
                +- OneRowRelation
    ```
    
    ### Why are the changes needed?
    
    To handle empty output of analyzed plan and remove the unneeded empty line.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, now the EXPLAIN command for the analyzed plan produces the following without the empty line:
    ```
    == Analyzed Logical Plan ==
    CreateViewCommand `default`.`test`, SELECT 1, false, false, PersistedView, true
       +- Project [1 AS 1#7]
          +- OneRowRelation
    ```
    
    ### How was this patch tested?
    
    Added a test.
    
    Closes #32342 from imback82/analyzed_plan_blank_line.
    
    Authored-by: Terry Kim <[email protected]>
    Signed-off-by: hyukjinkwon <[email protected]>
    imback82 authored and HyukjinKwon committed Apr 27, 2021
    Configuration menu
    Copy the full SHA
    7779fce View commit details
    Browse the repository at this point in the history
  3. [SPARK-26164][SQL] Allow concurrent writers for writing dynamic parti…

    …tions and bucket table
    
    ### What changes were proposed in this pull request?
    
    This is a re-proposal of #23163. Currently spark always requires a [local sort](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L188) before writing to output table with dynamic partition/bucket columns. The sort can be unnecessary if cardinality of partition/bucket values is small, and can be avoided by keeping multiple output writers concurrently.
    
    This PR introduces a config `spark.sql.maxConcurrentOutputFileWriters` (which disables this feature by default), where user can tune the maximal number of concurrent writers. The config is needed here as we cannot keep arbitrary number of writers in task memory which can cause OOM (especially for Parquet/ORC vectorization writer).
    
    The feature is to first use concurrent writers to write rows. If the number of writers exceeds the above config specified limit. Sort rest of rows and write rows one by one (See `DynamicPartitionDataConcurrentWriter.writeWithIterator()`).
    
    In addition, interface `WriteTaskStatsTracker` and its implementation `BasicWriteTaskStatsTracker` are also changed because previously they are relying on the assumption that only one writer is active for writing dynamic partitions and bucketed table.
    
    ### Why are the changes needed?
    
    Avoid the sort before writing output for dynamic partitioned query and bucketed table.
    Help improve CPU and IO performance for these queries.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Added unit test in `DataFrameReaderWriterSuite.scala`.
    
    Closes #32198 from c21/writer.
    
    Authored-by: Cheng Su <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    c21 authored and cloud-fan committed Apr 27, 2021
    Configuration menu
    Copy the full SHA
    7f51106 View commit details
    Browse the repository at this point in the history
  4. [SPARK-35139][SQL] Support ANSI intervals as Arrow Column vectors

    ### What changes were proposed in this pull request?
     Support YearMonthIntervalType and DayTimeIntervalType to extend ArrowColumnVector
    
    ### Why are the changes needed?
    https://issues.apache.org/jira/browse/SPARK-35139
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    1. By checking coding style via:
        $ ./dev/scalastyle
        $ ./dev/lint-java
    2. Run the test "ArrowWriterSuite"
    
    Closes #32340 from Peng-Lei/SPARK-35139.
    
    Authored-by: PengLei <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    Peng-Lei authored and cloud-fan committed Apr 27, 2021
    Configuration menu
    Copy the full SHA
    eb08b90 View commit details
    Browse the repository at this point in the history
  5. [SPARK-35235][SQL][TEST] Add row-based hash map into aggregate benchmark

    ### What changes were proposed in this pull request?
    
    `AggregateBenchmark` is only testing the performance for vectorized fast hash map, but not row-based hash map (which is used by default). We should add the row-based hash map into the benchmark.
    
    java 8 benchmark run - https://github.com/c21/spark/actions/runs/787731549
    java 11 benchmark run - https://github.com/c21/spark/actions/runs/787742858
    
    ### Why are the changes needed?
    
    To have and track a basic sense of benchmarking different fast hash map used in hash aggregate.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Existing unit test, as this only touches benchmark code.
    
    Closes #32357 from c21/agg-benchmark.
    
    Authored-by: Cheng Su <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    c21 authored and cloud-fan committed Apr 27, 2021
    Configuration menu
    Copy the full SHA
    c4ad86f View commit details
    Browse the repository at this point in the history
  6. [SPARK-35169][SQL] Fix wrong result of min ANSI interval division by -1

    ### What changes were proposed in this pull request?
    Before this patch
    ```
    scala> Seq(java.time.Period.ofMonths(Int.MinValue)).toDF("i").select($"i" / -1).show(false)
    +-------------------------------------+
    |(i / -1)                             |
    +-------------------------------------+
    |INTERVAL '-178956970-8' YEAR TO MONTH|
    +-------------------------------------+
    scala> Seq(java.time.Duration.of(Long.MinValue, java.time.temporal.ChronoUnit.MICROS)).toDF("i").select($"i" / -1).show(false)
    +---------------------------------------------------+
    |(i / -1)                                           |
    +---------------------------------------------------+
    |INTERVAL '-106751991 04:00:54.775808' DAY TO SECOND|
    +---------------------------------------------------+
    ```
    
    Wrong result of min ANSI interval division by -1, this pr fix this
    
    ### Why are the changes needed?
    Fix bug
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Added UT
    
    Closes #32314 from AngersZhuuuu/SPARK-35169.
    
    Authored-by: Angerszhuuuu <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    AngersZhuuuu authored and cloud-fan committed Apr 27, 2021
    Configuration menu
    Copy the full SHA
    2d2f467 View commit details
    Browse the repository at this point in the history
  7. [SPARK-34837][SQL][FOLLOWUP] Fix division by zero in the avg function…

    … over ANSI intervals
    
    ### What changes were proposed in this pull request?
    #32229 support ANSI SQL intervals by the aggregate function `avg`.
    But have not treat that the input zero rows. so this will lead to:
    ```
    Caused by: java.lang.ArithmeticException: / by zero
    	at com.google.common.math.LongMath.divide(LongMath.java:367)
    	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
    	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
    	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
    	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
    	at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1864)
    	at org.apache.spark.rdd.RDD.$anonfun$count$1(RDD.scala:1253)
    	at org.apache.spark.rdd.RDD.$anonfun$count$1$adapted(RDD.scala:1253)
    	at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2248)
    	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    	at org.apache.spark.scheduler.Task.run(Task.scala:131)
    	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498)
    	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437)
    	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    	at java.lang.Thread.run(Thread.java:748)
    ```
    
    ### Why are the changes needed?
    Fix a bug.
    
    ### Does this PR introduce _any_ user-facing change?
    No. Just new feature.
    
    ### How was this patch tested?
    new tests.
    
    Closes #32358 from beliefer/SPARK-34837-followup.
    
    Authored-by: gengjiaan <[email protected]>
    Signed-off-by: Max Gekk <[email protected]>
    beliefer authored and MaxGekk committed Apr 27, 2021
    Configuration menu
    Copy the full SHA
    55dea2d View commit details
    Browse the repository at this point in the history
  8. [SPARK-35239][SQL] Coalesce shuffle partition should handle empty inp…

    …ut RDD
    
    ### What changes were proposed in this pull request?
    
    Create empty partition for custom shuffle reader if input RDD is empty.
    
    ### Why are the changes needed?
    
    If input RDD partition is empty then the map output statistics will be null. And if all shuffle stage's input RDD partition is empty, we will skip it and lose the chance to coalesce partition.
    
    We can simply create a empty partition for these custom shuffle reader to reduce the partition number.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, the shuffle partition might be changed in AQE.
    
    ### How was this patch tested?
    
    add new test.
    
    Closes #32362 from ulysses-you/SPARK-35239.
    
    Authored-by: ulysses-you <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    ulysses-you authored and cloud-fan committed Apr 27, 2021
    Configuration menu
    Copy the full SHA
    4ff9f1f View commit details
    Browse the repository at this point in the history
  9. [SPARK-35091][SPARK-35090][SQL] Support extract from ANSI Intervals

    ### What changes were proposed in this pull request?
    
    In this PR, we add extract/date_part support for ANSI Intervals
    
    The `extract` is an ANSI expression and `date_part` is NON-ANSI but exists as an equivalence for `extract`
    
    #### expression
    
    ```
    <extract expression> ::=
      EXTRACT <left paren> <extract field> FROM <extract source> <right paren>
    ```
    
    #### <extract field> for interval source
    
    ```
    
    <primary datetime field> ::=
        <non-second primary datetime field>
    | SECOND
    <non-second primary datetime field> ::=
        YEAR
      | MONTH
      | DAY
      | HOUR
      | MINUTE
    ```
    
    #### dataType
    
    ```
    If <extract field> is a <primary datetime field> that does not specify SECOND or <extract field> is not a <primary datetime field>, then the declared type of the result is an implementation-defined exact numeric type with scale 0 (zero)
    
    Otherwise, the declared type of the result is an implementation-defined exact numeric type with scale not less than the specified or implied <time fractional seconds precision> or <interval fractional seconds precision>, as appropriate, of the SECOND <primary datetime field> of the <extract source>.
    ```
    
    ### Why are the changes needed?
    
    Subtask of ANSI Intervals Support
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes
    1. extract/date_part support ANSI intervals
    2. for non-ansi intervals, the return type is changed from long to byte when extracting hours
    
    ### How was this patch tested?
    
    new added tests
    
    Closes #32351 from yaooqinn/SPARK-35091.
    
    Authored-by: Kent Yao <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    yaooqinn authored and cloud-fan committed Apr 27, 2021
    Configuration menu
    Copy the full SHA
    16d223e View commit details
    Browse the repository at this point in the history
  10. [MINOR][DOCS][ML] Explicit return type of array_to_vector utility fun…

    …ction
    
    There are two types of dense vectors:
    * pyspark.ml.linalg.DenseVector
    * pyspark.mllib.linalg.DenseVector
    
    In spark-3.1.1, array_to_vector returns instances of pyspark.ml.linalg.DenseVector.
    The documentation is ambiguous & can lead to the false conclusion that instances of
    pyspark.mllib.linalg.DenseVector will be returned.
    Conversion from ml versions to mllib versions can easly be achieved with
    mlutils.convertVectorColumnsToML helper.
    
    ### What changes were proposed in this pull request?
    Make documentation more explicit
    
    ### Why are the changes needed?
    The documentation is a bit misleading and users can lose time investigating & realizing there are two DenseVector types.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    No test were run as only the documentation was changed
    
    Closes #32255 from jlafaye/master.
    
    Authored-by: Julien Lafaye <[email protected]>
    Signed-off-by: Sean Owen <[email protected]>
    jlafaye authored and srowen committed Apr 27, 2021
    Configuration menu
    Copy the full SHA
    592230e View commit details
    Browse the repository at this point in the history
  11. [SPARK-35238][DOC] Add JindoFS SDK in cloud integration documents

    ### What changes were proposed in this pull request?
    Add JindoFS SDK documents link in the cloud integration section of Spark's official document.
    
    ### Why are the changes needed?
    If Spark users need to interact with Alibaba Cloud OSS, JindoFS SDK is the official solution provided by Alibaba Cloud.
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    tested the url manually.
    
    Closes #32360 from adrian-wang/jindodoc.
    
    Authored-by: Daoyuan Wang <[email protected]>
    Signed-off-by: Sean Owen <[email protected]>
    adrian-wang authored and srowen committed Apr 27, 2021
    Configuration menu
    Copy the full SHA
    26a8d2f View commit details
    Browse the repository at this point in the history
  12. [SPARK-35150][ML] Accelerate fallback BLAS with dev.ludovic.netlib

    ### What changes were proposed in this pull request?
    
    Following #30810, I've continued looking for ways to accelerate the usage of BLAS in Spark. With this PR, I integrate work done in the [`dev.ludovic.netlib`](https://github.com/luhenry/netlib/) Maven package.
    
    The `dev.ludovic.netlib` library wraps the original `com.github.fommil.netlib` library and focus on accelerating the linear algebra routines in use in Spark. When running the `org.apache.spark.ml.linalg.BLASBenchmark` benchmarking suite, I get the results at [1] on an Intel machine. Moreover, this library is thoroughly tested to return the exact same results as the reference implementation.
    
    Under the hood, it reimplements the necessary algorithms in pure autovectorization-friendly Java 8, as well as takes advantage of the Vector API and Foreign Linker API introduced in JDK 16 when available.
    
    A table summarising which version gets loaded in which case:
    
    ```
    |                       | BLAS.nativeBLAS                                    | BLAS.javaBLAS                                      |
    | --------------------- | -------------------------------------------------- | -------------------------------------------------- |
    | with -Pnetlib-lgpl    | 1. dev.ludovic.netlib.blas.NetlibNativeBLAS, a     | 1. dev.ludovic.netlib.blas.VectorizedBLAS          |
    |                       |     wrapper for com.github.fommil:all              |    (JDK16+, relies on the Vector API, requires     |
    |                       | 2. dev.ludovic.netlib.blas.ForeignBLAS (JDK16+,    |     `--add-modules=jdk.incubator.vector` on JDK16) |
    |                       |    relies on the Foreign Linker API, requires      | 2. dev.ludovic.netlib.blas.Java11BLAS (JDK11+)     |
    |                       |    `--add-modules=jdk.incubator.foreign            | 3. dev.ludovic.netlib.blas.JavaBLAS                |
    |                       |     -Dforeign.restricted=warn`)                    | 4. dev.ludovic.netlib.blas.NetlibF2jBLAS, a        |
    |                       | 3. fails to load, falls back to BLAS.javaBLAS in   |     wrapper for com.github.fommil:core             |
    |                       |     org.apache.spark.ml.linalg.BLAS                |                                                    |
    | --------------------- | -------------------------------------------------- | -------------------------------------------------- |
    | without -Pnetlib-lgpl | 1. dev.ludovic.netlib.blas.ForeignBLAS (JDK16+,    | 1. dev.ludovic.netlib.blas.VectorizedBLAS          |
    |                       |    relies on the Foreign Linker API, requires      |    (JDK16+, relies on the Vector API, requires     |
    |                       |    `--add-modules=jdk.incubator.foreign            |     `--add-modules=jdk.incubator.vector` on JDK16) |
    |                       |     -Dforeign.restricted=warn`)                    | 2. dev.ludovic.netlib.blas.Java11BLAS (JDK11+)     |
    |                       | 2. fails to load, falls back to BLAS.javaBLAS in   | 3. dev.ludovic.netlib.blas.JavaBLAS                |
    |                       |     org.apache.spark.ml.linalg.BLAS                | 4. dev.ludovic.netlib.blas.NetlibF2jBLAS, a        |
    |                       |                                                    |     wrapper for com.github.fommil:core             |
    | --------------------- | -------------------------------------------------- | -------------------------------------------------- |
    ```
    
    ### Why are the changes needed?
    
    Accelerates linear algebra operations when the pure-java fallback method is in use. Transparently falls back to native implementation (OpenBLAS, MKL) when available.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No, all changes are transparent to the user.
    
    ### How was this patch tested?
    
    The `dev.ludovic.netlib` library has its own test suite [2]. It has also been validated by running the Spark test suite and benchmarking suite.
    
    [1] Results for `org.apache.spark.ml.linalg.BLASBenchmark`:
    #### JDK8:
    ```
    [info] OpenJDK 64-Bit Server VM 1.8.0_292-b10 on Linux 5.8.0-50-generic
    [info] Intel(R) Xeon(R) E-2276G CPU  3.80GHz
    [info]
    [info] f2jBLAS    = dev.ludovic.netlib.blas.NetlibF2jBLAS
    [info] javaBLAS   = dev.ludovic.netlib.blas.Java8BLAS
    [info] nativeBLAS = dev.ludovic.netlib.blas.Java8BLAS
    [info]
    [info] daxpy:                                    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                 223            232           8        448.0           2.2       1.0X
    [info] java                                                221            228           7        453.0           2.2       1.0X
    [info]
    [info] saxpy:                                    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                 122            128           4        821.2           1.2       1.0X
    [info] java                                                122            128           4        822.3           1.2       1.0X
    [info]
    [info] ddot:                                     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                 109            112           2        921.4           1.1       1.0X
    [info] java                                                 70             74           3       1423.5           0.7       1.5X
    [info]
    [info] sdot:                                     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                  96             98           2       1046.1           1.0       1.0X
    [info] java                                                 47             49           2       2121.7           0.5       2.0X
    [info]
    [info] dscal:                                    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                 184            195           8        544.3           1.8       1.0X
    [info] java                                                185            196           7        539.5           1.9       1.0X
    [info]
    [info] sscal:                                    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                  99            104           4       1011.9           1.0       1.0X
    [info] java                                                 99            104           4       1010.4           1.0       1.0X
    [info]
    [info] dspmv[U]:                                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                   1              1           0        947.2           1.1       1.0X
    [info] java                                                  0              0           0       1584.8           0.6       1.7X
    [info]
    [info] dspr[U]:                                  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                   1              1           0        867.4           1.2       1.0X
    [info] java                                                  1              1           0        865.0           1.2       1.0X
    [info]
    [info] dsyr[U]:                                  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                   1              1           0        485.9           2.1       1.0X
    [info] java                                                  1              1           0        486.8           2.1       1.0X
    [info]
    [info] dgemv[N]:                                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                   1              1           0       1843.0           0.5       1.0X
    [info] java                                                  0              0           0       2690.6           0.4       1.5X
    [info]
    [info] dgemv[T]:                                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                   1              1           0       1214.7           0.8       1.0X
    [info] java                                                  0              0           0       2536.8           0.4       2.1X
    [info]
    [info] sgemv[N]:                                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                   1              1           0       1895.9           0.5       1.0X
    [info] java                                                  0              0           0       2961.1           0.3       1.6X
    [info]
    [info] sgemv[T]:                                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                   1              1           0       1223.4           0.8       1.0X
    [info] java                                                  0              0           0       3091.4           0.3       2.5X
    [info]
    [info] dgemm[N,N]:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                 560            575          20       1787.1           0.6       1.0X
    [info] java                                                226            232           5       4432.4           0.2       2.5X
    [info]
    [info] dgemm[N,T]:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                 570            586          23       1755.2           0.6       1.0X
    [info] java                                                227            232           4       4410.1           0.2       2.5X
    [info]
    [info] dgemm[T,N]:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                 863            879          17       1158.4           0.9       1.0X
    [info] java                                                227            231           3       4407.9           0.2       3.8X
    [info]
    [info] dgemm[T,T]:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                1282           1305          23        780.0           1.3       1.0X
    [info] java                                                227            232           4       4413.4           0.2       5.7X
    [info]
    [info] sgemm[N,N]:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                 538            548           8       1858.6           0.5       1.0X
    [info] java                                                221            226           3       4521.1           0.2       2.4X
    [info]
    [info] sgemm[N,T]:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                 549            558          10       1819.9           0.5       1.0X
    [info] java                                                222            229           7       4503.5           0.2       2.5X
    [info]
    [info] sgemm[T,N]:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                 838            852          12       1193.0           0.8       1.0X
    [info] java                                                222            229           5       4500.5           0.2       3.8X
    [info]
    [info] sgemm[T,T]:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                 905            919          18       1104.8           0.9       1.0X
    [info] java                                                221            228           5       4521.3           0.2       4.1X
    ```
    
    #### JDK11:
    ```
    [info] OpenJDK 64-Bit Server VM 11.0.11+9-LTS on Linux 5.8.0-50-generic
    [info] Intel(R) Xeon(R) E-2276G CPU  3.80GHz
    [info]
    [info] f2jBLAS    = dev.ludovic.netlib.blas.NetlibF2jBLAS
    [info] javaBLAS   = dev.ludovic.netlib.blas.Java11BLAS
    [info] nativeBLAS = dev.ludovic.netlib.blas.Java11BLAS
    [info]
    [info] daxpy:                                    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                 195            204          10        512.7           2.0       1.0X
    [info] java                                                195            202           7        512.4           2.0       1.0X
    [info]
    [info] saxpy:                                    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                 108            113           4        923.3           1.1       1.0X
    [info] java                                                102            107           4        984.4           1.0       1.1X
    [info]
    [info] ddot:                                     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                 107            110           3        938.1           1.1       1.0X
    [info] java                                                 69             72           3       1447.1           0.7       1.5X
    [info]
    [info] sdot:                                     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                  96             98           2       1046.5           1.0       1.0X
    [info] java                                                 43             45           2       2317.1           0.4       2.2X
    [info]
    [info] dscal:                                    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                 155            168           8        644.2           1.6       1.0X
    [info] java                                                158            169           8        632.8           1.6       1.0X
    [info]
    [info] sscal:                                    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                  85             90           4       1178.1           0.8       1.0X
    [info] java                                                 86             90           4       1167.7           0.9       1.0X
    [info]
    [info] dspmv[U]:                                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                   0              0           0       1182.1           0.8       1.0X
    [info] java                                                  0              0           0       1432.1           0.7       1.2X
    [info]
    [info] dspr[U]:                                  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                   1              1           0        898.7           1.1       1.0X
    [info] java                                                  1              1           0        891.5           1.1       1.0X
    [info]
    [info] dsyr[U]:                                  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                   1              1           0        495.4           2.0       1.0X
    [info] java                                                  1              1           0        495.7           2.0       1.0X
    [info]
    [info] dgemv[N]:                                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                   0              0           0       2271.6           0.4       1.0X
    [info] java                                                  0              0           0       3648.1           0.3       1.6X
    [info]
    [info] dgemv[T]:                                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                   1              1           0       1229.3           0.8       1.0X
    [info] java                                                  0              0           0       2711.3           0.4       2.2X
    [info]
    [info] sgemv[N]:                                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                   0              0           0       2677.5           0.4       1.0X
    [info] java                                                  0              0           0       3288.2           0.3       1.2X
    [info]
    [info] sgemv[T]:                                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                   1              1           0       1233.0           0.8       1.0X
    [info] java                                                  0              0           0       2766.3           0.4       2.2X
    [info]
    [info] dgemm[N,N]:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                 520            536          16       1923.6           0.5       1.0X
    [info] java                                                214            221           7       4669.5           0.2       2.4X
    [info]
    [info] dgemm[N,T]:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                 593            612          17       1686.5           0.6       1.0X
    [info] java                                                215            219           3       4643.3           0.2       2.8X
    [info]
    [info] dgemm[T,N]:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                 853            870          16       1172.8           0.9       1.0X
    [info] java                                                215            218           3       4659.7           0.2       4.0X
    [info]
    [info] dgemm[T,T]:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                1350           1370          23        740.8           1.3       1.0X
    [info] java                                                215            219           4       4656.6           0.2       6.3X
    [info]
    [info] sgemm[N,N]:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                 460            468           6       2173.2           0.5       1.0X
    [info] java                                                210            213           2       4752.7           0.2       2.2X
    [info]
    [info] sgemm[N,T]:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                 535            544           8       1869.3           0.5       1.0X
    [info] java                                                210            215           5       4761.8           0.2       2.5X
    [info]
    [info] sgemm[T,N]:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                 843            853          11       1186.8           0.8       1.0X
    [info] java                                                209            214           4       4793.4           0.2       4.0X
    [info]
    [info] sgemm[T,T]:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                 891            904          15       1122.0           0.9       1.0X
    [info] java                                                209            214           4       4777.2           0.2       4.3X
    ```
    
    #### JDK16:
    ```
    [info] OpenJDK 64-Bit Server VM 16+36 on Linux 5.8.0-50-generic
    [info] Intel(R) Xeon(R) E-2276G CPU  3.80GHz
    [info]
    [info] f2jBLAS    = dev.ludovic.netlib.blas.NetlibF2jBLAS
    [info] javaBLAS   = dev.ludovic.netlib.blas.VectorizedBLAS
    [info] nativeBLAS = dev.ludovic.netlib.blas.VectorizedBLAS
    [info]
    [info] daxpy:                                    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                 194            199           7        515.7           1.9       1.0X
    [info] java                                                181            186           3        551.1           1.8       1.1X
    [info]
    [info] saxpy:                                    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                 109            115           4        915.0           1.1       1.0X
    [info] java                                                 88             92           3       1138.8           0.9       1.2X
    [info]
    [info] ddot:                                     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                 108            110           2        922.6           1.1       1.0X
    [info] java                                                 54             56           2       1839.2           0.5       2.0X
    [info]
    [info] sdot:                                     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                  96             97           2       1046.1           1.0       1.0X
    [info] java                                                 29             30           1       3393.4           0.3       3.2X
    [info]
    [info] dscal:                                    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                 156            165           5        643.0           1.6       1.0X
    [info] java                                                150            159           5        667.1           1.5       1.0X
    [info]
    [info] sscal:                                    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                  85             91           6       1171.0           0.9       1.0X
    [info] java                                                 75             79           3       1340.6           0.7       1.1X
    [info]
    [info] dspmv[U]:                                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                   1              1           0        917.0           1.1       1.0X
    [info] java                                                  0              0           0       8147.2           0.1       8.9X
    [info]
    [info] dspr[U]:                                  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                   1              1           0        859.3           1.2       1.0X
    [info] java                                                  1              1           0        859.3           1.2       1.0X
    [info]
    [info] dsyr[U]:                                  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                   1              1           0        482.1           2.1       1.0X
    [info] java                                                  1              1           0        482.6           2.1       1.0X
    [info]
    [info] dgemv[N]:                                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                   0              0           0       2214.2           0.5       1.0X
    [info] java                                                  0              0           0       7975.8           0.1       3.6X
    [info]
    [info] dgemv[T]:                                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                   1              1           0       1231.4           0.8       1.0X
    [info] java                                                  0              0           0       8680.9           0.1       7.0X
    [info]
    [info] sgemv[N]:                                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                   0              0           0       2684.3           0.4       1.0X
    [info] java                                                  0              0           0      18527.1           0.1       6.9X
    [info]
    [info] sgemv[T]:                                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                   1              1           0       1235.4           0.8       1.0X
    [info] java                                                  0              0           0      17347.9           0.1      14.0X
    [info]
    [info] dgemm[N,N]:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                 530            552          18       1887.5           0.5       1.0X
    [info] java                                                 58             64           3      17143.9           0.1       9.1X
    [info]
    [info] dgemm[N,T]:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                 598            620          17       1671.1           0.6       1.0X
    [info] java                                                 58             64           3      17196.6           0.1      10.3X
    [info]
    [info] dgemm[T,N]:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                 834            847          14       1199.4           0.8       1.0X
    [info] java                                                 57             63           4      17486.9           0.1      14.6X
    [info]
    [info] dgemm[T,T]:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                1338           1366          22        747.3           1.3       1.0X
    [info] java                                                 58             63           3      17356.6           0.1      23.2X
    [info]
    [info] sgemm[N,N]:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                 489            501           9       2045.5           0.5       1.0X
    [info] java                                                 36             38           2      27721.9           0.0      13.6X
    [info]
    [info] sgemm[N,T]:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                 478            488           9       2094.0           0.5       1.0X
    [info] java                                                 36             38           2      27813.2           0.0      13.3X
    [info]
    [info] sgemm[T,N]:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                 825            837          10       1211.6           0.8       1.0X
    [info] java                                                 35             38           2      28433.1           0.0      23.5X
    [info]
    [info] sgemm[T,T]:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] ------------------------------------------------------------------------------------------------------------------------
    [info] f2j                                                 900            918          15       1111.6           0.9       1.0X
    [info] java                                                 36             38           2      28073.0           0.0      25.3X
    ```
    
    [2] https://github.com/luhenry/netlib/tree/master/blas/src/test/java/dev/ludovic/netlib/blas
    
    Closes #32253 from luhenry/master.
    
    Authored-by: Ludovic Henry <[email protected]>
    Signed-off-by: Sean Owen <[email protected]>
    luhenry authored and srowen committed Apr 27, 2021
    Configuration menu
    Copy the full SHA
    5b77ebb View commit details
    Browse the repository at this point in the history

Commits on Apr 28, 2021

  1. [SPARK-34979][PYTHON][DOC] Add PyArrow installation note for PySpark …

    …aarch64 user
    
    ### What changes were proposed in this pull request?
    
    This patch adds a note for aarch64 user to install the specific pyarrow>=4.0.0.
    
    ### Why are the changes needed?
    
    The pyarrow aarch64 support is [introduced](apache/arrow#9285) in [PyArrow 4.0.0](https://github.com/apache/arrow/releases/tag/apache-arrow-4.0.0), and it has been published 27.Apr.2021.
    
    See more in [SPARK-34979](https://issues.apache.org/jira/browse/SPARK-34979).
    
    ### Does this PR introduce _any_ user-facing change?
    Yes, this doc can help user install arrow on aarch64.
    
    ### How was this patch tested?
    doc test passed.
    
    Closes #32363 from Yikun/SPARK-34979.
    
    Authored-by: Yikun Jiang <[email protected]>
    Signed-off-by: hyukjinkwon <[email protected]>
    Yikun authored and HyukjinKwon committed Apr 28, 2021
    Configuration menu
    Copy the full SHA
    0769049 View commit details
    Browse the repository at this point in the history
  2. [SPARK-35236][SQL] Support archive files as resources for CREATE FUNC…

    …TION USING syntax
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to make `CREATE FUNCTION USING` syntax can take archives as resources.
    
    ### Why are the changes needed?
    
    It would be useful.
    `CREATE FUNCTION USING` syntax doesn't support archives as resources because archives were not supported in Spark SQL.
    Now Spark SQL supports archives so I think we can support them for the syntax.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes. Users can specify archives for `CREATE FUNCTION USING` syntax.
    
    ### How was this patch tested?
    
    New test.
    
    Closes #32359 from sarutak/load-function-using-archive.
    
    Authored-by: Kousuke Saruta <[email protected]>
    Signed-off-by: hyukjinkwon <[email protected]>
    sarutak authored and HyukjinKwon committed Apr 28, 2021
    Configuration menu
    Copy the full SHA
    abb1f0c View commit details
    Browse the repository at this point in the history
  3. [SPARK-35244][SQL] Invoke should throw the original exception

    ### What changes were proposed in this pull request?
    
    This PR updates the interpreted code path of invoke expressions, to unwrap the `InvocationTargetException`
    
    ### Why are the changes needed?
    
    Make interpreted and codegen path consistent for invoke expressions.
    
    ### Does this PR introduce _any_ user-facing change?
    
    no
    
    ### How was this patch tested?
    
    new UT
    
    Closes #32370 from cloud-fan/minor.
    
    Authored-by: Wenchen Fan <[email protected]>
    Signed-off-by: hyukjinkwon <[email protected]>
    cloud-fan authored and HyukjinKwon committed Apr 28, 2021
    Configuration menu
    Copy the full SHA
    10c2b68 View commit details
    Browse the repository at this point in the history
  4. [SPARK-35246][SS] Don't allow streaming-batch intersects

    ### What changes were proposed in this pull request?
    The UnsupportedOperationChecker shouldn't allow streaming-batch intersects. As described in the ticket, they can't actually be planned correctly, and even simple cases like the below will fail:
    
    ```
      test("intersect") {
        val input = MemoryStream[Long]
        val df = input.toDS().intersect(spark.range(10).as[Long])
        testStream(df) (
          AddData(input, 1L),
          CheckAnswer(1)
        )
      }
    ```
    
    ### Why are the changes needed?
    Users will be confused by the cryptic errors produced from trying to run an invalid query plan.
    
    ### Does this PR introduce _any_ user-facing change?
    Some queries which previously failed with a poor error will now fail with a better one.
    
    ### How was this patch tested?
    modified unit test
    
    Closes #32371 from jose-torres/ossthing.
    
    Authored-by: Jose Torres <[email protected]>
    Signed-off-by: hyukjinkwon <[email protected]>
    jose-torres authored and HyukjinKwon committed Apr 28, 2021
    Configuration menu
    Copy the full SHA
    253a1ae View commit details
    Browse the repository at this point in the history
  5. [SPARK-34878][SQL][TESTS] Check actual sizes of year-month and day-ti…

    …me intervals
    
    ### What changes were proposed in this pull request?
    As we have suport the year-month and day-time intervals.  Add the test actual size of year-month and day-time intervals type
    
    ### Why are the changes needed?
    Just add test
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    ./dev/scalastyle
    run test for "ColumnTypeSuite"
    
    Closes #32366 from Peng-Lei/SPARK-34878.
    
    Authored-by: PengLei <[email protected]>
    Signed-off-by: Max Gekk <[email protected]>
    Peng-Lei authored and MaxGekk committed Apr 28, 2021
    Configuration menu
    Copy the full SHA
    046c8c3 View commit details
    Browse the repository at this point in the history
  6. [SPARK-35085][SQL] Get columns operation should handle ANSI interval …

    …column properly
    
    ### What changes were proposed in this pull request?
    This PR let JDBC clients identify ANSI interval columns properly.
    
    ### Why are the changes needed?
    This PR is similar to #29539.
    JDBC users can query interval values through thrift server, create views with ansi interval columns, e.g.
    `CREATE global temp view view1 as select interval '1-1' year to month as I;`
    but when they want to get the details of the columns of view1, the will fail with `Unrecognized type name: YEAR-MONTH INTERVAL`
    ```
    Caused by: java.lang.IllegalArgumentException: Unrecognized type name: YEAR-MONTH INTERVAL
    	at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.toJavaSQLType(SparkGetColumnsOperation.scala:190)
    	at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$addToRowSet$1(SparkGetColumnsOperation.scala:206)
    	at scala.collection.immutable.List.foreach(List.scala:392)
    	at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.addToRowSet(SparkGetColumnsOperation.scala:198)
    	at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$runInternal$7(SparkGetColumnsOperation.scala:109)
    	at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$runInternal$7$adapted(SparkGetColumnsOperation.scala:109)
    	at scala.Option.foreach(Option.scala:407)
    	at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$runInternal$5(SparkGetColumnsOperation.scala:109)
    	at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$runInternal$5$adapted(SparkGetColumnsOperation.scala:107)
    	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    	at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.runInternal(SparkGetColumnsOperation.scala:107)
    	... 34 more
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    Yes. Let hive JDBC recognize ANSI interval.
    
    ### How was this patch tested?
    Jenkins test.
    
    Closes #32345 from beliefer/SPARK-35085.
    
    Lead-authored-by: gengjiaan <[email protected]>
    Co-authored-by: beliefer <[email protected]>
    Signed-off-by: Max Gekk <[email protected]>
    2 people authored and MaxGekk committed Apr 28, 2021
    Configuration menu
    Copy the full SHA
    56bb815 View commit details
    Browse the repository at this point in the history
  7. [SPARK-33976][SQL][DOCS][FOLLOWUP] Fix syntax error in select doc page

    ### What changes were proposed in this pull request?
    Add doc about `TRANSFORM` and related function.
    
    ### Why are the changes needed?
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Not need
    
    Closes #32257 from AngersZhuuuu/SPARK-33976-followup.
    
    Authored-by: Angerszhuuuu <[email protected]>
    Signed-off-by: Takeshi Yamamuro <[email protected]>
    AngersZhuuuu authored and maropu committed Apr 28, 2021
    Configuration menu
    Copy the full SHA
    26a5e33 View commit details
    Browse the repository at this point in the history
  8. [SPARK-35214][SQL] OptimizeSkewedJoin support ShuffledHashJoinExec

    ### What changes were proposed in this pull request?
    
    Add `ShuffledHashJoin` pattern check in `OptimizeSkewedJoin` so that we can optimize it.
    
    ### Why are the changes needed?
    
    Currently, we have already supported all type of join through hint that make it easy to choose the join implementation.
    
    We would choose `ShuffledHashJoin` if one table is not big but over the broadcast threshold. It's better that we can support optimize it in `OptimizeSkewedJoin`.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Probably yes, the execute plan in AQE mode may be changed.
    
    ### How was this patch tested?
    
    Improve exists test in `AdaptiveQueryExecSuite`
    
    Closes #32328 from ulysses-you/SPARK-35214.
    
    Authored-by: ulysses-you <[email protected]>
    Signed-off-by: Takeshi Yamamuro <[email protected]>
    ulysses-you authored and maropu committed Apr 28, 2021
    Configuration menu
    Copy the full SHA
    8b62c29 View commit details
    Browse the repository at this point in the history
  9. [SPARK-34781][SQL][FOLLOWUP] Adjust the order of AQE optimizer rules

    ### What changes were proposed in this pull request?
    
    Reorder  `DemoteBroadcastHashJoin` and `EliminateUnnecessaryJoin`.
    
    ### Why are the changes needed?
    
    Skip unnecessary check in `DemoteBroadcastHashJoin` if `EliminateUnnecessaryJoin` affects.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    No result affect.
    
    Closes #32380 from ulysses-you/SPARK-34781-FOLLOWUP.
    
    Authored-by: ulysses-you <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    ulysses-you authored and cloud-fan committed Apr 28, 2021
    Configuration menu
    Copy the full SHA
    0bcf348 View commit details
    Browse the repository at this point in the history
  10. [SPARK-34981][SQL] Implement V2 function resolution and evaluation

    Co-Authored-By: Chao Sun <sunchaoapple.com>
    Co-Authored-By: Ryan Blue <rbluenetflix.com>
    
    ### What changes were proposed in this pull request?
    
    This implements function resolution and evaluation for functions registered through V2 FunctionCatalog [SPARK-27658](https://issues.apache.org/jira/browse/SPARK-27658). In particular:
    - Added documentation for how to define the "magic method" in `ScalarFunction`.
    - Added a new expression `ApplyFunctionExpression` which evaluates input by delegating to `ScalarFunction.produceResult` method.
    - added a new expression `V2Aggregator` which is a type of `TypedImperativeAggregate`. It's a wrapper of V2 `AggregateFunction` and mostly delegate methods to the implementation of the latter. It also uses plain Java serde for intermediate state.
    - Added function resolution logic for `ScalarFunction` and `AggregateFunction` in `Analyzer`.
      + For `ScalarFunction` this checks if the magic method is implemented through Java reflection, and create a `Invoke` expression if so. Otherwise, it checks if the default `produceResult` is overridden. If so, it creates a `ApplyFunctionExpression` which evaluates through `InternalRow`. Otherwise an analysis exception is thrown.
     + For `AggregateFunction`, this checks if the `update` method is overridden. If so, it converts it to `V2Aggregator`. Otherwise an analysis exception is thrown similar to the case of `ScalarFunction`.
    - Extended existing `InMemoryTableCatalog` to add the function catalog capability. Also renamed it to `InMemoryCatalog` since it no longer only covers tables.
    
    **Note**: this currently can successfully detect whether a subclass overrides the default `produceResult` or `update` method from the parent interface **only for Java implementations**. It seems in Scala it's hard to differentiate whether a subclass overrides a default method from its parent interface. In this case, it will be a runtime error instead of analysis error.
    
    A few TODOs:
    - Extend `V2SessionCatalog` with function catalog. This seems a little tricky since API such V2 `FunctionCatalog`'s `loadFunction` is different from V1 `SessionCatalog`'s `lookupFunction`.
    - Add magic method for `AggregateFunction`.
    - Type coercion when looking up functions
    
    ### Why are the changes needed?
    
    As V2 FunctionCatalog APIs are finalized, we should integrate it with function resolution and evaluation process so that they are actually useful.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, now a function exposed through V2 FunctionCatalog can be analyzed and evaluated.
    
    ### How was this patch tested?
    
    Added new unit tests.
    
    Closes #32082 from sunchao/resolve-func-v2.
    
    Lead-authored-by: Chao Sun <[email protected]>
    Co-authored-by: Chao Sun <[email protected]>
    Co-authored-by: Chao Sun <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    3 people authored and cloud-fan committed Apr 28, 2021
    Configuration menu
    Copy the full SHA
    86d3bb5 View commit details
    Browse the repository at this point in the history

Commits on Apr 29, 2021

  1. [SPARK-35244][SQL][FOLLOWUP] Add null check for the exception cause

    ### What changes were proposed in this pull request?
    
    Make sure we re-throw an exception that is not null.
    
    ### Why are the changes needed?
    
    to be super safe
    
    ### Does this PR introduce _any_ user-facing change?
    
    no
    
    ### How was this patch tested?
    
    N/A
    
    Closes #32387 from cloud-fan/minor.
    
    Authored-by: Wenchen Fan <[email protected]>
    Signed-off-by: Takeshi Yamamuro <[email protected]>
    cloud-fan authored and maropu committed Apr 29, 2021
    Configuration menu
    Copy the full SHA
    403e479 View commit details
    Browse the repository at this point in the history
  2. [SPARK-35135][CORE] Turn the WritablePartitionedIterator from a tra…

    …it into a default implementation class
    
    ### What changes were proposed in this pull request?
    `WritablePartitionedIterator` define in `WritablePartitionedPairCollection.scala` and there are two implementation of these trait,  but the code for these two implementations is duplicate.
    
    The main change of this pr is turn the `WritablePartitionedIterator` from a trait into a default implementation class because there is only one implementation now.
    
    ### Why are the changes needed?
    Cleanup duplicate code.
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    Pass the Jenkins or GitHub Action
    
    Closes #32232 from LuciferYang/writable-partitioned-iterator.
    
    Authored-by: yangjie01 <[email protected]>
    Signed-off-by: yi.wu <[email protected]>
    LuciferYang authored and Ngone51 committed Apr 29, 2021
    Configuration menu
    Copy the full SHA
    74b9326 View commit details
    Browse the repository at this point in the history
  3. [SPARK-34786][SQL][FOLLOWUP] Explicitly declare DecimalType(20, 0) fo…

    …r Parquet UINT_64
    
    ### What changes were proposed in this pull request?
    
    Explicitly declare DecimalType(20, 0) for Parquet UINT_64, avoid use DecimalType.LongDecimal which only happens to have 20 as precision.
    
    #31960 (comment)
    
    ### Why are the changes needed?
    
    fix ambiguity
    
    ### Does this PR introduce _any_ user-facing change?
    
    no
    
    ### How was this patch tested?
    
    not needed, just current CI pass
    
    Closes #32390 from yaooqinn/SPARK-34786-F.
    
    Authored-by: Kent Yao <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    yaooqinn authored and cloud-fan committed Apr 29, 2021
    Configuration menu
    Copy the full SHA
    7713565 View commit details
    Browse the repository at this point in the history
  4. [SPARK-35226][SQL] Support refreshKrb5Config option in JDBC datasources

    ### What changes were proposed in this pull request?
    
    This PR proposes to introduce a new JDBC option `refreshKrb5Config` which allows to reflect the change of `krb5.conf`.
    
    ### Why are the changes needed?
    
    In the current master, JDBC datasources can't accept `refreshKrb5Config` which is defined in `Krb5LoginModule`.
    So even if we change the `krb5.conf` after establishing a connection, the change will not be reflected.
    
    The similar issue happens when we run multiple `*KrbIntegrationSuites` at the same time.
    `MiniKDC` starts and stops every KerberosIntegrationSuite and different port number is recorded to `krb5.conf`.
    Due to `SecureConnectionProvider.JDBCConfiguration` doesn't take `refreshKrb5Config`, KerberosIntegrationSuites except the first running one see the wrong port so those suites fail.
    You can easily confirm with the following command.
    ```
    build/sbt -Phive Phive-thriftserver -Pdocker-integration-tests "testOnly org.apache.spark.sql.jdbc.*KrbIntegrationSuite"
    ```
    ### Does this PR introduce _any_ user-facing change?
    
    Yes. Users can set `refreshKrb5Config` to refresh krb5 relevant configuration.
    
    ### How was this patch tested?
    
    New test.
    
    Closes #32344 from sarutak/kerberos-refresh-issue.
    
    Authored-by: Kousuke Saruta <[email protected]>
    Signed-off-by: Kousuke Saruta <[email protected]>
    sarutak committed Apr 29, 2021
    Configuration menu
    Copy the full SHA
    529b875 View commit details
    Browse the repository at this point in the history
  5. [SPARK-35105][SQL] Support multiple paths for ADD FILE/JAR/ARCHIVE co…

    …mmands
    
    ### What changes were proposed in this pull request?
    
    This PR extends `ADD FILE/JAR/ARCHIVE` commands to be able to take multiple path arguments like Hive.
    
    ### Why are the changes needed?
    
    To make those commands more useful.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes. In the current implementation, those commands can take a path which contains whitespaces without enclose it by neither `'` nor `"` but after this change, users need to enclose such paths.
    I've note this incompatibility in the migration guide.
    
    ### How was this patch tested?
    
    New tests.
    
    Closes #32205 from sarutak/add-multiple-files.
    
    Authored-by: Kousuke Saruta <[email protected]>
    Signed-off-by: Kousuke Saruta <[email protected]>
    sarutak committed Apr 29, 2021
    Configuration menu
    Copy the full SHA
    132cbf0 View commit details
    Browse the repository at this point in the history
  6. [SPARK-35234][CORE] Reserve the format of stage failureMessage

    ### What changes were proposed in this pull request?
    
    `failureMessage` is already formatted, but `replaceAll("\n", " ")` destroyed the format. This PR fixed it.
    
    ### Why are the changes needed?
    
    The formatted error message is easier to read and debug.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, users see the clear error message in the application log.
    
    (Note I changed a little bit to let the test throw exception intentionally. The test itself is good.)
    
    Before:
    ![2141619490903_ pic_hd](https://user-images.githubusercontent.com/16397174/116177970-5a092f00-a747-11eb-9a0f-017391e80c8b.jpg)
    
    After:
    
    ![2151619490955_ pic_hd](https://user-images.githubusercontent.com/16397174/116177981-5ecde300-a747-11eb-90ef-fd16e906beeb.jpg)
    
    ### How was this patch tested?
    
    Manually tested.
    
    Closes #32356 from Ngone51/format-stage-error-message.
    
    Authored-by: yi.wu <[email protected]>
    Signed-off-by: attilapiros <[email protected]>
    Ngone51 authored and attilapiros committed Apr 29, 2021
    Configuration menu
    Copy the full SHA
    068b6c8 View commit details
    Browse the repository at this point in the history
  7. [SPARK-35269][BUILD] Upgrade commons-lang3 to 3.12.0

    ### What changes were proposed in this pull request?
    
    This pr aims to upgrade Apache commons-lang3 to 3.12.0
    
    ### Why are the changes needed?
    This version will bring the latest bug fixes as follows:
    
    - https://commons.apache.org/proper/commons-lang/changes-report.html#a3.12.0
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Pass the Jenkins or GitHub Action
    
    Closes #32393 from LuciferYang/lang3-to-312.
    
    Authored-by: yangjie01 <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    LuciferYang authored and dongjoon-hyun committed Apr 29, 2021
    Configuration menu
    Copy the full SHA
    7b78e34 View commit details
    Browse the repository at this point in the history
  8. [SPARK-35254][BUILD] Upgrade SBT to 1.5.1

    ### What changes were proposed in this pull request?
    
    This PR aims to upgrade SBT to 1.5.1.
    
    ### Why are the changes needed?
    
    https://github.com/sbt/sbt/releases/tag/v1.5.1
    
    ### Does this PR introduce _any_ user-facing change?
    
    NO.
    
    ### How was this patch tested?
    
    Pass the SBT CIs (Build/Test/Docs/Plugins).
    
    Closes #32382 from lipzhu/SPARK-35254.
    
    Authored-by: lipzhu <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    lipzhu authored and dongjoon-hyun committed Apr 29, 2021
    Configuration menu
    Copy the full SHA
    4e3daa5 View commit details
    Browse the repository at this point in the history
  9. [SPARK-35009][CORE] Avoid creating multiple python worker monitor thr…

    …eads for the same worker and same task context
    
    ### What changes were proposed in this pull request?
    
    With this PR Spark avoids creating multiple monitor threads for the same worker and same task context.
    
    ### Why are the changes needed?
    
    Without this change unnecessary threads will be created. It even can cause job failure for example when a coalesce (without shuffle) from high partition number goes to very low one. This exception is exactly comes for such a run:
    
    ```
    py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
    : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (192.168.1.210 executor driver): java.lang.OutOfMemoryError: unable to create new native thread
    	at java.lang.Thread.start0(Native Method)
    	at java.lang.Thread.start(Thread.java:717)
    	at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:166)
    	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
    	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
    	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
    	at org.apache.spark.rdd.CoalescedRDD.$anonfun$compute$1(CoalescedRDD.scala:99)
    	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
    	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
    	at scala.collection.Iterator.foreach(Iterator.scala:941)
    	at scala.collection.Iterator.foreach$(Iterator.scala:941)
    	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
    	at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
    	at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
    	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
    	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
    	at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
    	at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
    	at scala.collection.AbstractIterator.to(Iterator.scala:1429)
    	at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
    	at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
    	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
    	at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
    	at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
    	at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
    	at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
    	at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2260)
    	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    	at org.apache.spark.scheduler.Task.run(Task.scala:131)
    	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498)
    	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437)
    	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    	at java.lang.Thread.run(Thread.java:748)
    
    Driver stacktrace:
    	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2262)
    	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2211)
    	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2210)
    	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2210)
    	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1083)
    	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1083)
    	at scala.Option.foreach(Option.scala:407)
    	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1083)
    	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2449)
    	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2391)
    	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2380)
    	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
    	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:872)
    	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2220)
    	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2241)
    	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2260)
    	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2285)
    	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
    	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    	at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
    	at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
    	at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:180)
    	at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
    	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    	at java.lang.reflect.Method.invoke(Method.java:498)
    	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    	at py4j.Gateway.invoke(Gateway.java:282)
    	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    	at py4j.commands.CallCommand.execute(CallCommand.java:79)
    	at py4j.GatewayConnection.run(GatewayConnection.java:238)
    	at java.lang.Thread.run(Thread.java:748)
    Caused by: java.lang.OutOfMemoryError: unable to create new native thread
    	at java.lang.Thread.start0(Native Method)
    	at java.lang.Thread.start(Thread.java:717)
    	at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:166)
    	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
    	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
    	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
    	at org.apache.spark.rdd.CoalescedRDD.$anonfun$compute$1(CoalescedRDD.scala:99)
    	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
    	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
    	at scala.collection.Iterator.foreach(Iterator.scala:941)
    	at scala.collection.Iterator.foreach$(Iterator.scala:941)
    	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
    	at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
    	at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
    	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
    	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
    	at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
    	at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
    	at scala.collection.AbstractIterator.to(Iterator.scala:1429)
    	at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
    	at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
    	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
    	at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
    	at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
    	at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
    	at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
    	at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2260)
    	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    	at org.apache.spark.scheduler.Task.run(Task.scala:131)
    	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498)
    	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437)
    	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    	... 1 more
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Manually I used a the following Python script used (`reproduce-SPARK-35009.py`):
    
    ```
    import pyspark
    
    conf = pyspark.SparkConf().setMaster("local[*]").setAppName("Test1")
    sc = pyspark.SparkContext.getOrCreate(conf)
    
    rows = 70000
    data = list(range(rows))
    rdd = sc.parallelize(data, rows)
    assert rdd.getNumPartitions() == rows
    rdd0 = rdd.filter(lambda x: False)
    data = rdd0.coalesce(1).collect()
    assert data == []
    ```
    
    Spark submit:
    ```
    $ ./bin/spark-submit reproduce-SPARK-35009.py
    ```
    
    #### With this change
    
    Checking the number of monitor threads with jcmd:
    ```
    $ jcmd
    85273 sun.tools.jcmd.JCmd
    85227 org.apache.spark.deploy.SparkSubmit reproduce-SPARK-35009.py
    41020 scala.tools.nsc.MainGenericRunner
    $ jcmd 85227 Thread.print | grep -c "Monitor for python"
    2
    $ jcmd 85227 Thread.print | grep -c "Monitor for python"
    2
    ...
    $ jcmd 85227 Thread.print | grep -c "Monitor for python"
    2
    $ jcmd 85227 Thread.print | grep -c "Monitor for python"
    2
    $ jcmd 85227 Thread.print | grep -c "Monitor for python"
    2
    $ jcmd 85227 Thread.print | grep -c "Monitor for python"
    2
    ```
    <img width="859" alt="Screenshot 2021-04-14 at 16 06 51" src="https://user-images.githubusercontent.com/2017933/114731755-4969b980-9d42-11eb-8ec5-f60b217bdd96.png">
    
    #### Without this change
    
    ```
    ...
    $ jcmd 90052 Thread.print | grep -c "Monitor for python"                                                                                                      [INSERT]
    5645
    ..
    ```
    
    <img width="856" alt="Screenshot 2021-04-14 at 16 30 18" src="https://user-images.githubusercontent.com/2017933/114731724-4373d880-9d42-11eb-9f9b-d976bf2530e2.png">
    
    Closes #32169 from attilapiros/SPARK-35009.
    
    Authored-by: attilapiros <[email protected]>
    Signed-off-by: attilapiros <[email protected]>
    attilapiros committed Apr 29, 2021
    Configuration menu
    Copy the full SHA
    738cf7f View commit details
    Browse the repository at this point in the history
  10. [SPARK-35268][BUILD] Upgrade GenJavadoc to 0.17

    ### What changes were proposed in this pull request?
    
    This PR upgrades `GenJavadoc` to `0.17`.
    
    ### Why are the changes needed?
    
    This version seems to include a fix for an issue which can happen with Scala 2.13.5.
    https://github.com/lightbend/genjavadoc/releases/tag/v0.17
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    I confirmed build succeed with the following commands.
    ```
    # For Scala 2.12
    $ build/sbt -Phive -Phive-thriftserver -Pyarn -Pmesos -Pkubernetes -Phadoop-cloud -Pspark-ganglia-lgpl -Pkinesis-asl -Pdocker-integration-tests -Pkubernetes-integration-tests unidoc
    
    # For Scala 2.13
    build/sbt -Phive -Phive-thriftserver -Pyarn -Pmesos -Pkubernetes -Phadoop-cloud -Pspark-ganglia-lgpl -Pkinesis-asl -Pdocker-integration-tests -Pkubernetes-integration-tests -Pscala-2.13 unidoc
    ```
    
    Closes #32392 from sarutak/upgrade-genjavadoc-0.17.
    
    Authored-by: Kousuke Saruta <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    sarutak authored and dongjoon-hyun committed Apr 29, 2021
    Configuration menu
    Copy the full SHA
    8a5af37 View commit details
    Browse the repository at this point in the history
  11. [SPARK-35047][SQL] Allow Json datasources to write non-ascii characte…

    …rs as codepoints
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to enable the JSON datasources to write non-ascii characters as codepoints.
    To enable/disable this feature, I introduce a new option `writeNonAsciiCharacterAsCodePoint` for JSON datasources.
    
    ### Why are the changes needed?
    
    JSON specification allows codepoints as literal but Spark SQL's JSON datasources don't support the way to do it.
    It's great if we can write non-ascii characters as codepoints, which is a platform neutral representation.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes. Users can write non-ascii characters as codepoints with JSON datasources.
    
    ### How was this patch tested?
    
    New test.
    
    Closes #32147 from sarutak/json-unicode-write.
    
    Authored-by: Kousuke Saruta <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    sarutak authored and dongjoon-hyun committed Apr 29, 2021
    Configuration menu
    Copy the full SHA
    e8bf8fe View commit details
    Browse the repository at this point in the history

Commits on Apr 30, 2021

  1. [SPARK-35255][BUILD] Automated formatting for Scala Code for Blank Lines

    ### What changes were proposed in this pull request?
    
    https://github.com/databricks/scala-style-guide#blanklines
    https://scalameta.org/scalafmt/docs/configuration.html#newlinestoplevelstatements
    
    ### How was this patch tested?
    
    Manually tested by modifying a few files and running ./dev/scalafmt then checking that ./dev/scalastyle still passed.
    
    Closes #32383 from lipzhu/SPARK-35255.
    
    Authored-by: lipzhu <[email protected]>
    Signed-off-by: HyukjinKwon <[email protected]>
    lipzhu authored and HyukjinKwon committed Apr 30, 2021
    Configuration menu
    Copy the full SHA
    77e9152 View commit details
    Browse the repository at this point in the history
  2. [SPARK-35277][BUILD] Upgrade snappy to 1.1.8.4

    ### What changes were proposed in this pull request?
    This PR aims to upgrade snappy to version 1.1.8.4.
    
    ### Why are the changes needed?
    This will bring the latest bug fixes and improvements.
    - https://github.com/xerial/snappy-java/blob/master/Milestone.md#snappy-java-1183-2021-01-20
    
        - Make pure-java Snappy thread-safe
        - Improved SnappyFramedInput/OutputStream performance by using java.util.zip.CRC32C
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    Pass the CIs.
    
    Closes #32402 from williamhyun/snappy1184.
    
    Authored-by: William Hyun <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    williamhyun authored and dongjoon-hyun committed Apr 30, 2021
    Configuration menu
    Copy the full SHA
    ac8813e View commit details
    Browse the repository at this point in the history
  3. [SPARK-35111][SQL] Support Cast string to year-month interval

    ### What changes were proposed in this pull request?
    Support Cast string to year-month interval
    Supported format as below
    ```
    ANSI_STYLE, like
    INTERVAL -'-10-1' YEAR TO MONTH
    HIVE_STYLE like
    10-1 or -10-1
    
    Rules from the SQL standard about ANSI_STYLE:
    
    <interval literal> ::=
      INTERVAL [ <sign> ] <interval string> <interval qualifier>
    <interval string> ::=
      <quote> <unquoted interval string> <quote>
    <unquoted interval string> ::=
      [ <sign> ] { <year-month literal> | <day-time literal> }
    <year-month literal> ::=
      <years value> [ <minus sign> <months value> ]
      | <months value>
    <years value> ::=
      <datetime value>
    <months value> ::=
      <datetime value>
    <datetime value> ::=
      <unsigned integer>
    <unsigned integer> ::= <digit>...
    ```
    ### Why are the changes needed?
    Support Cast string to year-month interval
    
    ### Does this PR introduce _any_ user-facing change?
    User can cast year month interval string to YearMonthIntervalType
    
    ### How was this patch tested?
    Added UT
    
    Closes #32266 from AngersZhuuuu/SPARK-SPARK-35111.
    
    Authored-by: Angerszhuuuu <[email protected]>
    Signed-off-by: Max Gekk <[email protected]>
    AngersZhuuuu authored and MaxGekk committed Apr 30, 2021
    Configuration menu
    Copy the full SHA
    11ea255 View commit details
    Browse the repository at this point in the history
  4. [SPARK-35264][SQL] Support AQE side broadcastJoin threshold

    ### What changes were proposed in this pull request?
    
    ~~This PR aims to add a new AQE optimizer rule `DynamicJoinSelection`. Like other AQE partition number configs, this rule add a new broadcast threshold config `spark.sql.adaptive.autoBroadcastJoinThreshold`.~~
    This PR amis to add a flag in `Statistics` to distinguish AQE stats or normal stats, so that we can make some sql configs isolation between AQE and normal.
    
    ### Why are the changes needed?
    
    The main idea here is that make join config isolation between normal planner and aqe planner which shared the same code path.
    
    Actually we do not very trust using the static stats to consider if it can build broadcast hash join. In our experience it's very common that Spark throw broadcast timeout or driver side OOM exception when execute a bit large plan. And due to braodcast join is not reversed which means if we covert join to braodcast hash join at first time, we(AQE) can not optimize it again, so it should make sense to decide if we can do broadcast at aqe side using different sql config.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, a new config `spark.sql.adaptive.autoBroadcastJoinThreshold` added.
    
    ### How was this patch tested?
    
    Add new test.
    
    Closes #32391 from ulysses-you/SPARK-35264.
    
    Authored-by: ulysses-you <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    ulysses-you authored and cloud-fan committed Apr 30, 2021
    Configuration menu
    Copy the full SHA
    39889df View commit details
    Browse the repository at this point in the history
  5. [SPARK-35280][K8S] Promote KubernetesUtils to DeveloperApi

    ### What changes were proposed in this pull request?
    
    Since SPARK-22757, `KubernetesUtils` has been used as an important utility class by all K8s modules and `ExternalClusterManager`s. This PR aims to promote `KubernetesUtils` to `DeveloperApi` in order to maintain it officially in a backward compatible way at Apache Spark 3.2.0.
    
    ### Why are the changes needed?
    
    Apache Spark 3.1.1 makes `Kubernetes` module GA and provides an extensible external cluster manager framework. To have `ExternalClusterManager` for K8s environment, `KubernetesUtils` class is crucial and needs to be stable. By promoting to a subset of K8s developer API, we can maintain these more sustainable way and give a better and stable functionality to K8s users.
    
    In this PR, `Since` annotations denote the last function signature changes because these are going to become public at Apache Spark 3.2.0.
    
    | Version | Function Name |
    |-|-|
    | 2.3.0 | parsePrefixedKeyValuePairs |
    | 2.3.0 | requireNandDefined |
    | 2.3.0 | parsePrefixedKeyValuePairs |
    | 2.4.0 | parseMasterUrl |
    | 3.0.0 | requireBothOrNeitherDefined |
    | 3.0.0 | requireSecondIfFirstIsDefined |
    | 3.0.0 | selectSparkContainer |
    | 3.0.0 | formatPairsBundle |
    | 3.0.0 | formatPodState |
    | 3.0.0 | containersDescription |
    | 3.0.0 | containerStatusDescription |
    | 3.0.0 | formatTime |
    | 3.0.0 | uniqueID |
    | 3.0.0 | buildResourcesQuantities |
    | 3.0.0 | uploadAndTransformFileUris |
    | 3.0.0 | uploadFileUri |
    | 3.0.0 | requireBothOrNeitherDefined |
    | 3.0.0 | buildPodWithServiceAccount |
    | 3.0.0 | isLocalAndResolvable |
    | 3.1.1 | renameMainAppResource |
    | 3.1.1 | addOwnerReference |
    | 3.2.0 | loadPodFromTemplate |
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, but this is new API additions.
    
    ### How was this patch tested?
    
    Pass the CIs.
    
    Closes #32406 from dongjoon-hyun/SPARK-35280.
    
    Authored-by: Dongjoon Hyun <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    dongjoon-hyun committed Apr 30, 2021
    Configuration menu
    Copy the full SHA
    4e8701a View commit details
    Browse the repository at this point in the history

Commits on May 1, 2021

  1. [SPARK-35273][SQL] CombineFilters support non-deterministic expressions

    ### What changes were proposed in this pull request?
    
    This pr makes `CombineFilters` support non-deterministic expressions. For example:
    ```sql
    spark.sql("CREATE TABLE t1(id INT, dt STRING) using parquet PARTITIONED BY (dt)")
    spark.sql("CREATE VIEW v1 AS SELECT * FROM t1 WHERE dt NOT IN ('2020-01-01', '2021-01-01')")
    spark.sql("SELECT * FROM v1 WHERE dt = '2021-05-01' AND rand() <= 0.01").explain()
    ```
    
    Before this pr:
    ```
    == Physical Plan ==
    *(1) Filter (isnotnull(dt#1) AND ((dt#1 = 2021-05-01) AND (rand(-6723800298719475098) <= 0.01)))
    +- *(1) ColumnarToRow
       +- FileScan parquet default.t1[id#0,dt#1] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], PartitionFilters: [NOT dt#1 IN (2020-01-01,2021-01-01)], PushedFilters: [], ReadSchema: struct<id:int>
    ```
    
    After this pr:
    ```
    == Physical Plan ==
    *(1) Filter (rand(-2400509328955813273) <= 0.01)
    +- *(1) ColumnarToRow
       +- FileScan parquet default.t1[id#0,dt#1] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], PartitionFilters: [isnotnull(dt#1), NOT dt#1 IN (2020-01-01,2021-01-01), (dt#1 = 2021-05-01)], PushedFilters: [], ReadSchema: struct<id:int>
    ```
    
    ### Why are the changes needed?
    
    Improve query performance.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Unit test.
    
    Closes #32405 from wangyum/SPARK-35273.
    
    Authored-by: Yuming Wang <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    wangyum authored and cloud-fan committed May 1, 2021
    Configuration menu
    Copy the full SHA
    72e238a View commit details
    Browse the repository at this point in the history
  2. [SPARK-35278][SQL] Invoke should find the method with correct number …

    …of parameters
    
    ### What changes were proposed in this pull request?
    
    This patch fixes `Invoke` expression when the target object has more than one method with the given method name.
    
    ### Why are the changes needed?
    
    `Invoke` will find out the method on the target object with given method name. If there are more than one method with the name, currently it is undeterministic which method will be used. We should add the condition of parameter number when finding the method.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, fixed a bug when using `Invoke` on a object where more than one method with the given method name.
    
    ### How was this patch tested?
    
    Unit test.
    
    Closes #32404 from viirya/verify-invoke-param-len.
    
    Authored-by: Liang-Chi Hsieh <[email protected]>
    Signed-off-by: Liang-Chi Hsieh <[email protected]>
    viirya committed May 1, 2021
    Configuration menu
    Copy the full SHA
    6ce1b16 View commit details
    Browse the repository at this point in the history

Commits on May 2, 2021

  1. [SPARK-34581][SQL] Don't optimize out grouping expressions from aggre…

    …gate expressions without aggregate function
    
    ### What changes were proposed in this pull request?
    This PR adds a new rule `PullOutGroupingExpressions` to pull out complex grouping expressions to a `Project` node under an `Aggregate`. These expressions are then referenced in both grouping expressions and aggregate expressions without aggregate functions to ensure that optimization rules don't change the aggregate expressions to invalid ones that no longer refer to any grouping expressions.
    
    ### Why are the changes needed?
    If aggregate expressions (without aggregate functions) in an `Aggregate` node are complex then the `Optimizer` can optimize out grouping expressions from them and so making aggregate expressions invalid.
    
    Here is a simple example:
    ```
    SELECT not(t.id IS NULL) , count(*)
    FROM t
    GROUP BY t.id IS NULL
    ```
    In this case the `BooleanSimplification` rule does this:
    ```
    === Applying Rule org.apache.spark.sql.catalyst.optimizer.BooleanSimplification ===
    !Aggregate [isnull(id#222)], [NOT isnull(id#222) AS (NOT (id IS NULL))#226, count(1) AS c#224L]   Aggregate [isnull(id#222)], [isnotnull(id#222) AS (NOT (id IS NULL))#226, count(1) AS c#224L]
     +- Project [value#219 AS id#222]                                                                 +- Project [value#219 AS id#222]
        +- LocalRelation [value#219]                                                                     +- LocalRelation [value#219]
    ```
    where `NOT isnull(id#222)` is optimized to `isnotnull(id#222)` and so it no longer refers to any grouping expression.
    
    Before this PR:
    ```
    == Optimized Logical Plan ==
    Aggregate [isnull(id#222)], [isnotnull(id#222) AS (NOT (id IS NULL))#234, count(1) AS c#232L]
    +- Project [value#219 AS id#222]
       +- LocalRelation [value#219]
    ```
    and running the query throws an error:
    ```
    Couldn't find id#222 in [isnull(id#222)#230,count(1)#226L]
    java.lang.IllegalStateException: Couldn't find id#222 in [isnull(id#222)#230,count(1)#226L]
    ```
    
    After this PR:
    ```
    == Optimized Logical Plan ==
    Aggregate [_groupingexpression#233], [NOT _groupingexpression#233 AS (NOT (id IS NULL))#230, count(1) AS c#228L]
    +- Project [isnull(value#219) AS _groupingexpression#233]
       +- LocalRelation [value#219]
    ```
    and the query works.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes, the query works.
    
    ### How was this patch tested?
    Added new UT.
    
    Closes #32396 from peter-toth/SPARK-34581-keep-grouping-expressions-2.
    
    Authored-by: Peter Toth <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    peter-toth authored and cloud-fan committed May 2, 2021
    Configuration menu
    Copy the full SHA
    cfc0495 View commit details
    Browse the repository at this point in the history
  2. [SPARK-35112][SQL] Support Cast string to day-second interval

    ### What changes were proposed in this pull request?
    Support Cast string to day-seconds interval
    
    ### Why are the changes needed?
    Users can cast day-second interval string to DayTimeIntervalType.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Added UT
    
    Closes #32271 from AngersZhuuuu/SPARK-35112.
    
    Lead-authored-by: Angerszhuuuu <[email protected]>
    Co-authored-by: AngersZhuuuu <[email protected]>
    Signed-off-by: Max Gekk <[email protected]>
    AngersZhuuuu authored and MaxGekk committed May 2, 2021
    Configuration menu
    Copy the full SHA
    caa46ce View commit details
    Browse the repository at this point in the history

Commits on May 3, 2021

  1. [SPARK-35192][SQL][TESTS] Port minimal TPC-DS datagen code from datab…

    …ricks/spark-sql-perf
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to port minimal code to generate TPC-DS data from [databricks/spark-sql-perf](https://github.com/databricks/spark-sql-perf). The classes in a new class file `tpcdsDatagen.scala` are basically copied from the `databricks/spark-sql-perf` codebase.
    Note that I've modified them a bit to follow the Spark code style and removed unnecessary parts from them.
    
    The code authors of these classes are:
    juliuszsompolski
    npoggi
    wangyum
    
    ### Why are the changes needed?
    
    We frequently use TPCDS data now for benchmarks/tests, but the classes for the TPCDS schemas of datagen and benchmarks/tests are managed separately, e.g.,
     - https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/TPCDSBase.scala
     - https://github.com/databricks/spark-sql-perf/blob/master/src/main/scala/com/databricks/spark/sql/perf/tpcds/TPCDSTables.scala
    
     I think this causes some inconveniences, e.g., we need to update both files in the separate repositories if we update the TPCDS schema #32037. So, it would be useful for the Spark codebase to generate them by referring to the same schema definition.
    
    ### Does this PR introduce _any_ user-facing change?
    
    dev only.
    
    ### How was this patch tested?
    
    Manually checked and GA passed.
    
    Closes #32243 from maropu/tpcdsDatagen.
    
    Authored-by: Takeshi Yamamuro <[email protected]>
    Signed-off-by: Takeshi Yamamuro <[email protected]>
    maropu committed May 3, 2021
    Configuration menu
    Copy the full SHA
    cd689c9 View commit details
    Browse the repository at this point in the history
  2. [SPARK-35285][SQL] Parse ANSI interval types in SQL schema

    ### What changes were proposed in this pull request?
    1. Extend Spark SQL parser to support parsing of:
        - `INTERVAL YEAR TO MONTH` to `YearMonthIntervalType`
        - `INTERVAL DAY TO SECOND` to `DayTimeIntervalType`
    2. Assign new names to the ANSI interval types according to the SQL standard to be able to parse the names back by Spark SQL parser. Override the `typeName()` name of `YearMonthIntervalType`/`DayTimeIntervalType`.
    
    ### Why are the changes needed?
    To be able to use new ANSI interval types in SQL. The SQL standard requires the types to be defined according to the rules:
    ```
    <interval type> ::= INTERVAL <interval qualifier>
    <interval qualifier> ::= <start field> TO <end field> | <single datetime field>
    <start field> ::= <non-second primary datetime field> [ <left paren> <interval leading field precision> <right paren> ]
    <end field> ::= <non-second primary datetime field> | SECOND [ <left paren> <interval fractional seconds precision> <right paren> ]
    <primary datetime field> ::= <non-second primary datetime field | SECOND
    <non-second primary datetime field> ::= YEAR | MONTH | DAY | HOUR | MINUTE
    <interval fractional seconds precision> ::= <unsigned integer>
    <interval leading field precision> ::= <unsigned integer>
    ```
    Currently, Spark SQL supports only `YEAR TO MONTH` and `DAY TO SECOND` as `<interval qualifier>`.
    
    ### Does this PR introduce _any_ user-facing change?
    Should not since the types has not been released yet.
    
    ### How was this patch tested?
    By running the affected tests such as:
    ```
    $ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z interval.sql"
    $ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z datetime.sql"
    $ build/sbt "test:testOnly *ExpressionTypeCheckingSuite"
    $ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z windowFrameCoercion.sql"
    $ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z literals.sql"
    ```
    
    Closes #32409 from MaxGekk/parse-ansi-interval-types.
    
    Authored-by: Max Gekk <[email protected]>
    Signed-off-by: HyukjinKwon <[email protected]>
    MaxGekk authored and HyukjinKwon committed May 3, 2021
    Configuration menu
    Copy the full SHA
    335f00b View commit details
    Browse the repository at this point in the history
  3. [SPARK-35281][SQL] StaticInvoke should not apply boxing if return typ…

    …e is primitive
    
    ### What changes were proposed in this pull request?
    
    In `StaticInvoke`, when result is nullable, don't box the return value if its type is primitive.
    
    ### Why are the changes needed?
    
    It is unnecessary to apply boxing when the method return value is of primitive type, and it would hurt performance a lot if the method is simple. The check is done in `Invoke` but not in `StaticInvoke`.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Added a UT.
    
    Closes #32416 from sunchao/SPARK-35281.
    
    Authored-by: Chao Sun <[email protected]>
    Signed-off-by: HyukjinKwon <[email protected]>
    sunchao authored and HyukjinKwon committed May 3, 2021
    Configuration menu
    Copy the full SHA
    2a8d7ed View commit details
    Browse the repository at this point in the history
  4. [SPARK-35176][PYTHON] Standardize input validation error type

    ### What changes were proposed in this pull request?
    This PR corrects some exception type when the function input params are failed to validate due to TypeError.
    In order to convenient to review, there are 3 commits in this PR:
    - Standardize input validation error type on sql
    - Standardize input validation error type on ml
    - Standardize input validation error type on pandas
    
    ### Why are the changes needed?
    As suggestion from Python exception doc [1]: "Raised when an operation or function is applied to an object of inappropriate type.", but there are many Value error are raised in some pyspark code, this patch fix them.
    
    [1] https://docs.python.org/3/library/exceptions.html#TypeError
    
    Note that: this patch only addresses the exsiting some wrong raise type for input validation, the input validation decorator/framework which mentioned in [SPARK-35176](https://issues.apache.org/jira/browse/SPARK-35176), would be submited in a speparated patch.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes, code can raise the right TypeError instead of ValueError.
    
    ### How was this patch tested?
    Existing test case and UT
    
    Closes #32368 from Yikun/SPARK-35176.
    
    Authored-by: Yikun Jiang <[email protected]>
    Signed-off-by: HyukjinKwon <[email protected]>
    Yikun authored and HyukjinKwon committed May 3, 2021
    Configuration menu
    Copy the full SHA
    44b7931 View commit details
    Browse the repository at this point in the history
  5. [SPARK-35266][TESTS] Fix error in BenchmarkBase.scala that occurs whe…

    …n creating benchmark files in non-existent directory
    
    ### What changes were proposed in this pull request?
    This PR fixes an error in `BenchmarkBase.scala` that occurs when creating a benchmark file in a non-existent directory.
    
    ### Why are the changes needed?
    When submitting a benchmark job using `org.apache.spark.benchmark.Benchmarks` class with `SPARK_GENERATE_BENCHMARK_FILES=1` option, an exception is raised if the directory where the benchmark file will be generated does not exist.
    For more information, please refer to [SPARK-35266](https://issues.apache.org/jira/browse/SPARK-35266).
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    After building Spark, manually tested with the following command:
    ```
    SPARK_GENERATE_BENCHMARK_FILES=1 bin/spark-submit --class \
        org.apache.spark.benchmark.Benchmarks --jars \
        "`find . -name '*-SNAPSHOT-tests.jar' -o -name '*avro*-SNAPSHOT.jar' | paste -sd ',' -`" \
        "`find . -name 'spark-core*-SNAPSHOT-tests.jar'`" \
        "org.apache.spark.ml.linalg.BLASBenchmark"
    ```
    It successfully generated the benchmark result files.
    
    **Why it is sufficient:**
    As illustrated in the comments in `Benchmarks.scala`, the command below runs all benchmarks and generates the results:
    ```
    SPARK_GENERATE_BENCHMARK_FILES=1 bin/spark-submit --class \
        org.apache.spark.benchmark.Benchmarks --jars \
        "`find . -name '*-SNAPSHOT-tests.jar' -o -name '*avro*-SNAPSHOT.jar' | paste -sd ',' -`" \
        "`find . -name 'spark-core*-SNAPSHOT-tests.jar'`" \
        "*"
    ```
    Of all the benchmarks (55 benchmarks in total), only `BLASBenchmark` fails due to the proposed issue for the current code in the master branch. Thus, it is currently sufficient to test `BLASBenchmark` to validate this change.
    
    Closes #32394 from byungsoo-oh/SPARK-35266.
    
    Authored-by: byungsoo <[email protected]>
    Signed-off-by: HyukjinKwon <[email protected]>
    byungsoo authored and HyukjinKwon committed May 3, 2021
    Configuration menu
    Copy the full SHA
    be6ecb6 View commit details
    Browse the repository at this point in the history
  6. [MINOR][SS][DOCS] Fix a typo in the documentation of GroupState

    ### What changes were proposed in this pull request?
    
    Fixing some typos in the documenting comments.
    
    ### Why are the changes needed?
    
    To make reading the docs more pleasant.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, since the user sees the docs.
    
    ### How was this patch tested?
    
    It was not tested, because no code was changed.
    
    Closes #32400 from Dobiasd/patch-1.
    
    Authored-by: Tobias Hermann <[email protected]>
    Signed-off-by: HyukjinKwon <[email protected]>
    Dobiasd authored and HyukjinKwon committed May 3, 2021
    Configuration menu
    Copy the full SHA
    54e0aa1 View commit details
    Browse the repository at this point in the history
  7. [SPARK-35250][SQL][DOCS] Fix duplicated STOP_AT_DELIMITER to SKIP_VAL…

    …UE at CSV's unescapedQuoteHandling option documentation
    
    ### What changes were proposed in this pull request?
    
    This is rather a followup of #30518 that should be ported back to `branch-3.1` too.
    `STOP_AT_DELIMITER` was mistakenly used twice. The duplicated `STOP_AT_DELIMITER` should be `SKIP_VALUE` in the documentation.
    
    ### Why are the changes needed?
    
    To correctly document.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, it fixes the user-facing documentation.
    
    ### How was this patch tested?
    
    I checked them via running linters.
    
    Closes #32423 from HyukjinKwon/SPARK-35250.
    
    Authored-by: HyukjinKwon <[email protected]>
    Signed-off-by: HyukjinKwon <[email protected]>
    HyukjinKwon committed May 3, 2021
    Configuration menu
    Copy the full SHA
    8aaa9e8 View commit details
    Browse the repository at this point in the history

Commits on May 4, 2021

  1. [SPARK-35292][PYTHON] Delete redundant parameter in mypy configuration

    ### What changes were proposed in this pull request?
    
    The parameter **no_implicit_optional** is defined twice in the mypy configuration, [ligne 20](https://github.com/apache/spark/blob/master/python/mypy.ini#L20) and ligne 105.
    
    ### Why are the changes needed?
    
    We would like to keep the mypy configuration clean.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    This patch can be tested with `dev/lint-python`
    
    Closes #32418 from garawalid/feature/clean-mypy-config.
    
    Authored-by: garawalid <[email protected]>
    Signed-off-by: HyukjinKwon <[email protected]>
    garawalid authored and HyukjinKwon committed May 4, 2021
    Configuration menu
    Copy the full SHA
    176218b View commit details
    Browse the repository at this point in the history
  2. [SPARK-34887][PYTHON] Port Koalas dependencies into PySpark

    ### What changes were proposed in this pull request?
    
    Port Koalas dependencies appropriately to PySpark dependencies.
    
    ### Why are the changes needed?
    
    pandas-on-Spark has its own required dependency and optional dependencies.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Manual test.
    
    Closes #32386 from xinrong-databricks/portDeps.
    
    Authored-by: Xinrong Meng <[email protected]>
    Signed-off-by: HyukjinKwon <[email protected]>
    xinrong-meng authored and HyukjinKwon committed May 4, 2021
    Configuration menu
    Copy the full SHA
    120c389 View commit details
    Browse the repository at this point in the history
  3. [SPARK-35300][PYTHON][DOCS] Standardize module names in install.rst

    ### What changes were proposed in this pull request?
    
    Use full names of modules in `install.rst` when specifying dependencies.
    
    ### Why are the changes needed?
    
    Using full names makes it more clear.
    In addition, `pandas APIs on Spark` as a new module can start to be recognized by more people.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Manual verification.
    
    Closes #32427 from xinrong-databricks/nameDoc.
    
    Authored-by: Xinrong Meng <[email protected]>
    Signed-off-by: HyukjinKwon <[email protected]>
    xinrong-meng authored and HyukjinKwon committed May 4, 2021
    Configuration menu
    Copy the full SHA
    5ecb112 View commit details
    Browse the repository at this point in the history
  4. [SPARK-35302][INFRA] Benchmark workflow should create new files for n…

    …ew benchmarks
    
    ### What changes were proposed in this pull request?
    
    Currently, it fails at `git diff --name-only` when new benchmarks are added, see https://github.com/HyukjinKwon/spark/actions/runs/808870999
    
    We should include untracked files (new benchmark result files) to upload so developers download the results.
    
    ### Why are the changes needed?
    
    So the new benchmark results can be added and uploaded.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No, dev-only
    
    ### How was this patch tested?
    
    Tested at:
    
    https://github.com/HyukjinKwon/spark/actions/runs/808867285
    
    Closes #32428 from HyukjinKwon/include-new-benchmarks.
    
    Authored-by: HyukjinKwon <[email protected]>
    Signed-off-by: HyukjinKwon <[email protected]>
    HyukjinKwon committed May 4, 2021
    Configuration menu
    Copy the full SHA
    a2927cb View commit details
    Browse the repository at this point in the history
  5. [SPARK-35308][TESTS] Fix bug in SPARK-35266 that creates benchmark fi…

    …les in invalid path with wrong name
    
    ### What changes were proposed in this pull request?
    This PR fixes a bug in [SPARK-35266](https://issues.apache.org/jira/browse/SPARK-35266) that creates benchmark files in the invalid path with the wrong name.
    e.g. For `BLASBenchmark`,
    - AS-IS: Creates `benchmarksBLASBenchmark-results.txt` in `{SPARK_HOME}/mllib-local/`
    - TO-BE: Creates `BLASBenchmark-results.txt` in `{SPARK_HOME}/mllib-local/benchmarks/`
    
    ### Why are the changes needed?
    As you can see in the above example, new benchmark files cannot be created as intended due to this bug.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    After building Spark, manually tested with the following command:
    ```
    SPARK_GENERATE_BENCHMARK_FILES=1 bin/spark-submit --class \
        org.apache.spark.benchmark.Benchmarks --jars \
        "`find . -name '*-SNAPSHOT-tests.jar' -o -name '*avro*-SNAPSHOT.jar' | paste -sd ',' -`" \
        "`find . -name 'spark-core*-SNAPSHOT-tests.jar'`" \
        "org.apache.spark.ml.linalg.BLASBenchmark"
    ```
    It successfully generated the benchmark files as intended (`BLASBenchmark-results.txt` in `{SPARK_HOME}/mllib-local/benchmarks/`).
    
    Closes #32432 from byungsoo-oh/SPARK-35308.
    
    Lead-authored-by: byungsoo <[email protected]>
    Co-authored-by: Hyukjin Kwon <[email protected]>
    Signed-off-by: HyukjinKwon <[email protected]>
    byungsoo and HyukjinKwon committed May 4, 2021
    Configuration menu
    Copy the full SHA
    9b387a1 View commit details
    Browse the repository at this point in the history
  6. [SPARK-35294][SQL] Add tree traversal pruning in rules with dedicated…

    … files under optimizer
    
    ### What changes were proposed in this pull request?
    
    Added the following TreePattern enums:
    - CREATE_NAMED_STRUCT
    - EXTRACT_VALUE
    - JSON_TO_STRUCT
    - OUTER_REFERENCE
    - AGGREGATE
    - LOCAL_RELATION
    - EXCEPT
    - LIMIT
    - WINDOW
    
    Used them in the following rules:
    - DecorrelateInnerQuery
    - LimitPushDownThroughWindow
    - OptimizeCsvJsonExprs
    - PropagateEmptyRelation
    - PullOutGroupingExpressions
    - PushLeftSemiLeftAntiThroughJoin
    - ReplaceExceptWithFilter
    - RewriteDistinctAggregates
    - SimplifyConditionalsInPredicate
    - UnwrapCastInBinaryComparison
    
    ### Why are the changes needed?
    
    Reduce the number of tree traversals and hence improve the query compilation latency.
    
    ### How was this patch tested?
    
    Existing tests.
    
    Closes #32421 from sigmod/opt.
    
    Authored-by: Yingyi Bu <[email protected]>
    Signed-off-by: Gengliang Wang <[email protected]>
    sigmod authored and gengliangwang committed May 4, 2021
    Configuration menu
    Copy the full SHA
    7fd3f8f View commit details
    Browse the repository at this point in the history

Commits on May 5, 2021

  1. [SPARK-34794][SQL] Fix lambda variable name issues in nested DataFram…

    …e functions
    
    ### What changes were proposed in this pull request?
    
    To fix lambda variable name issues in nested DataFrame functions, this PR modifies code to use a global counter for `LambdaVariables` names created by higher order functions.
    
    This is the rework of #31887. Closes #31887.
    
    ### Why are the changes needed?
    
     This moves away from the current hard-coded variable names which break on nested function calls. There is currently a bug where nested transforms in particular fail (the inner variable shadows the outer variable)
    
    For this query:
    ```
    val df = Seq(
        (Seq(1,2,3), Seq("a", "b", "c"))
    ).toDF("numbers", "letters")
    
    df.select(
        f.flatten(
            f.transform(
                $"numbers",
                (number: Column) => { f.transform(
                    $"letters",
                    (letter: Column) => { f.struct(
                        number.as("number"),
                        letter.as("letter")
                    ) }
                ) }
            )
        ).as("zipped")
    ).show(10, false)
    ```
    This is the current (incorrect) output:
    ```
    +------------------------------------------------------------------------+
    |zipped                                                                  |
    +------------------------------------------------------------------------+
    |[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]|
    +------------------------------------------------------------------------+
    ```
    And this is the correct output after fix:
    ```
    +------------------------------------------------------------------------+
    |zipped                                                                  |
    +------------------------------------------------------------------------+
    |[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]|
    +------------------------------------------------------------------------+
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Added the new test in `DataFrameFunctionsSuite`.
    
    Closes #32424 from maropu/pr31887.
    
    Lead-authored-by: dsolow <[email protected]>
    Co-authored-by: Takeshi Yamamuro <[email protected]>
    Co-authored-by: dmsolow <[email protected]>
    Signed-off-by: Takeshi Yamamuro <[email protected]>
    3 people committed May 5, 2021
    Configuration menu
    Copy the full SHA
    f550e03 View commit details
    Browse the repository at this point in the history
  2. [SPARK-34854][SQL][SS] Expose source metrics via progress report and …

    …add Kafka use-case to report delay
    
    ### What changes were proposed in this pull request?
    This pull request proposes a new API for streaming sources to signal that they can report metrics, and adds a use case to support Kafka micro batch stream to report the stats of # of offsets for the current offset falling behind the latest.
    
    A public interface is added.
    
    `metrics`: returns the metrics reported by the streaming source with given offset.
    
    ### Why are the changes needed?
    The new API can expose any custom metrics for the "current" offset for streaming sources. Different from #31398, this PR makes metrics available to user through progress report, not through spark UI. A use case is that people want to know how the current offset falls behind the latest offset.
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    Unit test for Kafka micro batch source v2 are added to test the Kafka use case.
    
    Closes #31944 from yijiacui-db/SPARK-34297.
    
    Authored-by: Yijia Cui <[email protected]>
    Signed-off-by: Jungtaek Lim <[email protected]>
    yijiacui-db authored and HeartSaVioR committed May 5, 2021
    Configuration menu
    Copy the full SHA
    bbdbe0f View commit details
    Browse the repository at this point in the history
  3. [SPARK-35315][TESTS] Keep benchmark result consistent between spark-s…

    …ubmit and SBT
    
    ### What changes were proposed in this pull request?
    
    Set `IS_TESTING` to true in `BenchmarkBase`, before running benchmarks.
    
    ### Why are the changes needed?
    
    Currently benchmark can be done via 2 ways: `spark-submit`, or SBT command. However in the former Spark will miss some properties such as `IS_TESTING`, which is necessary to turn on/off certain behavior like codegen (`spark.sql.codegen.factoryMode`). Therefore, the result could differ between the two. In addition, the benchmark GitHub workflow is using the spark-submit approach.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    N/A
    
    Closes #32440 from sunchao/SPARK-35315.
    
    Authored-by: Chao Sun <[email protected]>
    Signed-off-by: Yuming Wang <[email protected]>
    sunchao authored and wangyum committed May 5, 2021
    Configuration menu
    Copy the full SHA
    4fe4b65 View commit details
    Browse the repository at this point in the history

Commits on May 6, 2021

  1. [SPARK-35155][SQL] Add rule id pruning to Analyzer rules

    ### What changes were proposed in this pull request?
    
    Added rule id based pruning to Analyzer rules in fixed point batches:
    
    - org.apache.spark.sql.catalyst.analysis.Analyzer$AddMetadataColumns
    - org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator
    - org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions
    - org.apache.spark.sql.catalyst.analysis.Analyzer$GlobalAggregates
    - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggAliasInGroupBy
    - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions
    - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases
    - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveBinaryArithmetic
    - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveDeserializer
    - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveEncodersInUDF
    - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions
    - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate
    - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGroupingAnalytics
    - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveInsertInto
    - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveMissingReferences
    - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveNewInstance
    - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy
    - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOutputRelation
    - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolvePivot
    - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRandomSeed
    - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences
    - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations
    - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubqueryColumnAliases
    - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTables
    - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews
    - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast
    - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUserSpecifiedColumns
    - org.apache.spark.sql.catalyst.analysis.Analyzer$WindowsSubstitution
    - org.apache.spark.sql.catalyst.analysis.DeduplicateRelations
    - org.apache.spark.sql.catalyst.analysis.EliminateSubqueryAliases
    - org.apache.spark.sql.catalyst.analysis.EliminateUnions
    - org.apache.spark.sql.catalyst.analysis.ResolveCreateNamedStruct
    - org.apache.spark.sql.catalyst.analysis.ResolveHints$ResolveCoalesceHints
    - org.apache.spark.sql.catalyst.analysis.ResolveHints$ResolveJoinStrategyHints
    - org.apache.spark.sql.catalyst.analysis.ResolveInlineTables
    - org.apache.spark.sql.catalyst.analysis.ResolveLambdaVariables
    - org.apache.spark.sql.catalyst.analysis.ResolveTimeZone
    - org.apache.spark.sql.catalyst.analysis.ResolveUnion
    - org.apache.spark.sql.catalyst.analysis.SubstituteUnresolvedOrdinals
    - org.apache.spark.sql.catalyst.analysis.TimeWindowing
    
    Subsequent PRs will add tree bits based pruning to those rules. Split a big PR to reduce review load.
    
    ### Why are the changes needed?
    
    Reduce the number of tree traversals and hence improve the query compilation latency.
    
    ### How was this patch tested?
    
    Existing tests.
    
    Closes #32425 from sigmod/analyzer.
    
    Authored-by: Yingyi Bu <[email protected]>
    Signed-off-by: Gengliang Wang <[email protected]>
    sigmod authored and gengliangwang committed May 6, 2021
    Configuration menu
    Copy the full SHA
    7970318 View commit details
    Browse the repository at this point in the history
  2. [SPARK-35323][BUILD] Remove unused libraries from LICENSE-binary

    ### What changes were proposed in this pull request?
    
    This PR removes unused libraries from `LICENSE-binary` file.
    
    ### Why are the changes needed?
    
    SPARK-33212 removes many `Hadoop 3`-only transitive libraries like `dnsjava-2.1.7.jar`. We can simplify Apache Spark LICENSE file by removing them.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, but this is only LICENSE file change.
    
    ### How was this patch tested?
    
    Manual.
    
    Closes #32445 from dongjoon-hyun/SPARK-35323.
    
    Authored-by: Dongjoon Hyun <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    dongjoon-hyun committed May 6, 2021
    Configuration menu
    Copy the full SHA
    0126924 View commit details
    Browse the repository at this point in the history
  3. [SPARK-35319][K8S][BUILD] Upgrade K8s client to 5.3.1

    ### What changes were proposed in this pull request?
    
    This PR aims to upgrade K8s client to 5.3.1.
    
    ### Why are the changes needed?
    
    This will bring the latest bug fixes.
    - https://github.com/fabric8io/kubernetes-client/releases/tag/v5.3.1
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Pass the CIs.
    
    K8s IT is manually tested like the following.
    
    ```
    KubernetesSuite:
    - Run SparkPi with no resources
    - Run SparkPi with a very long application name.
    - Use SparkLauncher.NO_RESOURCE
    - Run SparkPi with a master URL without a scheme.
    - Run SparkPi with an argument.
    - Run SparkPi with custom labels, annotations, and environment variables.
    - All pods have the same service account by default
    - Run extraJVMOptions check on driver
    - Run SparkRemoteFileTest using a remote data file
    - Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties
    - Run SparkPi with env and mount secrets.
    - Run PySpark on simple pi.py example
    - Run PySpark to test a pyfiles example
    - Run PySpark with memory customization
    - Run in client mode.
    - Start pod creation from template
    - PVs with local storage
    - Launcher client dependencies
    - SPARK-33615: Launcher client archives
    - SPARK-33748: Launcher python client respecting PYSPARK_PYTHON
    - SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python
    - Launcher python client dependencies using a zip file
    - Test basic decommissioning
    - Test basic decommissioning with shuffle cleanup
    - Test decommissioning with dynamic allocation & shuffle cleanups
    - Test decommissioning timeouts
    - Run SparkR on simple dataframe.R example
    Run completed in 18 minutes, 33 seconds.
    Total number of tests run: 27
    Suites: completed 2, aborted 0
    Tests: succeeded 27, failed 0, canceled 0, ignored 0, pending 0
    All tests passed.
    [INFO] ------------------------------------------------------------------------
    [INFO] Reactor Summary for Spark Project Parent POM 3.2.0-SNAPSHOT:
    [INFO]
    [INFO] Spark Project Parent POM ........................... SUCCESS [  3.959 s]
    [INFO] Spark Project Tags ................................. SUCCESS [  7.830 s]
    [INFO] Spark Project Local DB ............................. SUCCESS [  3.457 s]
    [INFO] Spark Project Networking ........................... SUCCESS [  5.496 s]
    [INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [  3.239 s]
    [INFO] Spark Project Unsafe ............................... SUCCESS [  9.006 s]
    [INFO] Spark Project Launcher ............................. SUCCESS [  2.422 s]
    [INFO] Spark Project Core ................................. SUCCESS [02:17 min]
    [INFO] Spark Project Kubernetes Integration Tests ......... SUCCESS [21:05 min]
    [INFO] ------------------------------------------------------------------------
    [INFO] BUILD SUCCESS
    [INFO] ------------------------------------------------------------------------
    [INFO] Total time:  23:59 min
    [INFO] Finished at: 2021-05-05T11:59:19-07:00
    [INFO] ------------------------------------------------------------------------
    ```
    
    Closes #32443 from dongjoon-hyun/SPARK-35319.
    
    Authored-by: Dongjoon Hyun <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    dongjoon-hyun committed May 6, 2021
    Configuration menu
    Copy the full SHA
    a0c76a8 View commit details
    Browse the repository at this point in the history
  4. [SPARK-35325][SQL][TESTS] Add nested column ORC encryption test case

    ### What changes were proposed in this pull request?
    
    This PR aims to enrich ORC encryption test coverage for nested columns.
    
    ### Why are the changes needed?
    
    This will provide a test coverage for this feature.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Pass the CIs with the newly added test case.
    
    Closes #32449 from dongjoon-hyun/SPARK-35325.
    
    Authored-by: Dongjoon Hyun <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    dongjoon-hyun committed May 6, 2021
    Configuration menu
    Copy the full SHA
    19661f6 View commit details
    Browse the repository at this point in the history
  5. [SPARK-35293][SQL][TESTS] Use the newer dsdgen for TPCDSQueryTestSuite

    ### What changes were proposed in this pull request?
    
    This PR intends to replace `maropu/spark-tpcds-datagen` with `databricks/tpcds-kit` for using a newer dsdgen and update the golden files in `tpcds-query-results`.
    
    ### Why are the changes needed?
    
    For better testing.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    GA passed.
    
    Closes #32420 from maropu/UseTpcdsKit.
    
    Authored-by: Takeshi Yamamuro <[email protected]>
    Signed-off-by: Takeshi Yamamuro <[email protected]>
    maropu committed May 6, 2021
    Configuration menu
    Copy the full SHA
    5c67d0c View commit details
    Browse the repository at this point in the history
  6. [SPARK-35318][SQL] Hide internal view properties for describe table cmd

    ### What changes were proposed in this pull request?
    Hide internal view properties for describe table command, because those
    properties are generated by spark and should be transparent to the end-user.
    
    ### Why are the changes needed?
    Avoid internal properties confusing the users.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes
    Before this change, the user will see below output for `describe formatted test_view`
    ```
    ....
    Table Properties       [view.catalogAndNamespace.numParts=2, view.catalogAndNamespace.part.0=spark_catalog, view.catalogAndNamespace.part.1=default, view.query.out.col.0=c, view.query.out.col.1=v, view.query.out.numCols=2, view.referredTempFunctionsNames=[], view.referredTempViewNames=[]]
    ...
    ```
    After this change, the internal properties will be hidden for `describe formatted test_view`
    ```
    ...
    Table Properties        []
    ...
    ```
    
    ### How was this patch tested?
    existing UT
    
    Closes #32441 from linhongliu-db/hide-properties.
    
    Authored-by: Linhong Liu <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    linhongliu-db authored and cloud-fan committed May 6, 2021
    Configuration menu
    Copy the full SHA
    3f5a209 View commit details
    Browse the repository at this point in the history
  7. [SPARK-35240][SS] Use CheckpointFileManager for checkpoint file manip…

    …ulation
    
    ### What changes were proposed in this pull request?
    
    This patch changes a few places using `FileSystem` API to manipulate checkpoint file to `CheckpointFileManager`.
    
    ### Why are the changes needed?
    
    `CheckpointFileManager` is designed to handle checkpoint file manipulation. However, there are a few places exposing `FileSystem` from checkpoint files/paths. We should use `CheckpointFileManager` to manipulate checkpoint files. For example, we may want to have one storage system for checkpoint file. If all checkpoint file manipulation is performed through `CheckpointFileManager`, we can only implement `CheckpointFileManager` for the storage system, and don't need to implement `FileSystem` API for it.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Existing unit tests.
    
    Closes #32361 from viirya/checkpoint-manager.
    
    Authored-by: Liang-Chi Hsieh <[email protected]>
    Signed-off-by: Liang-Chi Hsieh <[email protected]>
    viirya committed May 6, 2021
    Configuration menu
    Copy the full SHA
    c6d3f37 View commit details
    Browse the repository at this point in the history
  8. [SPARK-35215][SQL] Update custom metric per certain rows and at the e…

    …nd of the task
    
    ### What changes were proposed in this pull request?
    
    This patch changes custom metric updating to update per certain rows (currently 100), instead of per row.
    
    ### Why are the changes needed?
    
    Based on previous discussion #31451 (comment), we should only update custom metrics per certain (e.g. 100) rows and also at the end of the task. Updating per row doesn't make too much benefit.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Existing unit test.
    
    Closes #32330 from viirya/metric-update.
    
    Authored-by: Liang-Chi Hsieh <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    viirya authored and cloud-fan committed May 6, 2021
    Configuration menu
    Copy the full SHA
    6cd5cf5 View commit details
    Browse the repository at this point in the history
  9. [SPARK-34526][SS] Ignore the error when checking the path in FileStre…

    …amSink.hasMetadata
    
    ### What changes were proposed in this pull request?
    When checking the path in `FileStreamSink.hasMetadata`, we should ignore the error and assume the user wants to read a batch output.
    
    ### Why are the changes needed?
    Keep the original behavior of ignoring the error.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes.
    The path checking will not throw an exception when checking file sink format
    
    ### How was this patch tested?
    New UT added.
    
    Closes #31638 from xuanyuanking/SPARK-34526.
    
    Authored-by: Yuanjian Li <[email protected]>
    Signed-off-by: Jungtaek Lim <[email protected]>
    xuanyuanking authored and HeartSaVioR committed May 6, 2021
    Configuration menu
    Copy the full SHA
    dfb3343 View commit details
    Browse the repository at this point in the history
  10. [SPARK-35326][BUILD] Upgrade Jersey to 2.34

    ### What changes were proposed in this pull request?
    
    This PR upgrades Jersey to 2.34.
    
    ### Why are the changes needed?
    
    CVE-2021-28168, a local information disclosure vulnerability, is reported (https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-28168).
    Spark 3.1.1, 3.0.2 and 3.2.0 use an affected version 2.30.
    
    ### Does this PR introduce _any_ user-facing change?
    
    It's not clear how much the impact is but Spark uses an affected version of Jersey so I think it's better to upgrade it just in case.
    
    ### How was this patch tested?
    
    CI.
    
    Closes #32453 from sarutak/upgrade-jersey.
    
    Authored-by: Kousuke Saruta <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    sarutak authored and dongjoon-hyun committed May 6, 2021
    Configuration menu
    Copy the full SHA
    bb93547 View commit details
    Browse the repository at this point in the history
  11. [SPARK-35326][BUILD][FOLLOWUP] Update dependency manifest files

    ### What changes were proposed in this pull request?
    
    This is a followup of #32453.
    
    ### Why are the changes needed?
    
    Jenkins doesn't check dependency manifest files.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Pass the GitHub Action or manually.
    
    Closes #32458 from dongjoon-hyun/SPARK-35326.
    
    Authored-by: Dongjoon Hyun <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    dongjoon-hyun committed May 6, 2021
    Configuration menu
    Copy the full SHA
    482b43d View commit details
    Browse the repository at this point in the history
  12. [SPARK-35293][SQL][TESTS][FOLLOWUP] Update the hash key to refresh TP…

    …C-DS cache data in forked GA jobs
    
    ### What changes were proposed in this pull request?
    
    This is a follow-up PRi of #32420 and it intends to update the hash key to refresh TPC-DS cache data in forked GA jobs.
    
    ### Why are the changes needed?
    
    To recover GA jobs.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    GA passed.
    
    Closes #32460 from maropu/SPARK-35293-FOLLOWUP.
    
    Authored-by: Takeshi Yamamuro <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    maropu authored and dongjoon-hyun committed May 6, 2021
    Configuration menu
    Copy the full SHA
    e834ef7 View commit details
    Browse the repository at this point in the history

Commits on May 7, 2021

  1. [SPARK-35306][MLLIB][TESTS] Add benchmark results for BLASBenchmark c…

    …reated by GitHub Actions machines
    
    ### What changes were proposed in this pull request?
    This PR adds benchmark results for `BLASBenchmark` created by GitHub Actions machines.
    Benchmark result files are added for both JDK 8 (`BLASBenchmark-result.txt`) and 11 (`BLASBenchmark-jdk11-result.txt`) in `{SPARK_HOME}/mllib-local/benchmarks/`.
    
    ### Why are the changes needed?
    In [SPARK-34950](https://issues.apache.org/jira/browse/SPARK-34950), benchmark results were updated to the ones created by Github Actions machines.
    As benchmark results for `BLASBenchmark` (added at [SPARK-33882](https://issues.apache.org/jira/browse/SPARK-33882) and [SPARK-35150](https://issues.apache.org/jira/browse/SPARK-35150)) are not currently available at the repository, this PR adds them.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    The benchmark results were obtained by running tests with GitHub Actions workflow in my forked repository.
    You can refer to the test results and output files from the link below.
    - https://github.com/byungsoo-oh/spark/actions/runs/809900377
    - https://github.com/byungsoo-oh/spark/actions/runs/810084610
    
    Closes #32435 from byungsoo-oh/SPARK-35306.
    
    Authored-by: byungsoo <[email protected]>
    Signed-off-by: HyukjinKwon <[email protected]>
    byungsoo authored and HyukjinKwon committed May 7, 2021
    Configuration menu
    Copy the full SHA
    94bbca3 View commit details
    Browse the repository at this point in the history
  2. [SPARK-35133][SQL] Explain codegen works with AQE

    ### What changes were proposed in this pull request?
    
    `EXPLAIN CODEGEN <query>` (and Dataset.explain("codegen")) prints out the generated code for each stage of plan. The current implementation is to match `WholeStageCodegenExec` operator in query plan and prints out generated code (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/debug/package.scala#L111-L118 ). This does not work with AQE as we wrap the whole query plan inside `AdaptiveSparkPlanExec` and do not run whole stage code-gen physical plan rule eagerly (`CollapseCodegenStages`). This introduces unexpected behavior change for EXPLAIN query (and Dataset.explain), as we enable AQE by default now.
    
    The change is to explain code-gen for the current executed plan of AQE.
    
    ### Why are the changes needed?
    
    Make `EXPLAIN CODEGEN` work same as before.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No (when comparing with latest Spark release 3.1.1).
    
    ### How was this patch tested?
    
    Added unit test in `ExplainSuite.scala`.
    
    Closes #32430 from c21/explain-aqe.
    
    Authored-by: Cheng Su <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    c21 authored and dongjoon-hyun committed May 7, 2021
    Configuration menu
    Copy the full SHA
    42f59ca View commit details
    Browse the repository at this point in the history
  3. [SPARK-34701][SQL][FOLLOW-UP] Children/innerChildren should be mutual…

    …ly exclusive for AnalysisOnlyCommand
    
    ### What changes were proposed in this pull request?
    
    This is a follow up to #32032 (comment). Basically, `children`/`innerChildren` should be mutually exclusive for `AlterViewAsCommand` and `CreateViewCommand`, which extend `AnalysisOnlyCommand`. Otherwise, there could be an issue in the `EXPLAIN` command. Currently, this is not an issue, because these commands will be analyzed (children will always be empty) when the `EXPLAIN` command is run.
    
    ### Why are the changes needed?
    
    To be future-proof where these commands are directly used.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Added new tsts
    
    Closes #32447 from imback82/SPARK-34701-followup.
    
    Authored-by: Terry Kim <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    imback82 authored and cloud-fan committed May 7, 2021
    Configuration menu
    Copy the full SHA
    33c1034 View commit details
    Browse the repository at this point in the history
  4. [SPARK-26164][SQL][FOLLOWUP] WriteTaskStatsTracker should know which …

    …file the row is written to
    
    ### What changes were proposed in this pull request?
    
    This is a follow-up of #32198
    
    Before #32198, in `WriteTaskStatsTracker.newRow`, we know that the row is written to the current file. After #32198 , we no longer know this connection.
    
    This PR adds the file path parameter in `WriteTaskStatsTracker.newRow` to bring back the connection.
    
    ### Why are the changes needed?
    
    To not break some custom `WriteTaskStatsTracker` implementations.
    
    ### Does this PR introduce _any_ user-facing change?
    
    no
    
    ### How was this patch tested?
    
    N/A
    
    Closes #32459 from cloud-fan/minor.
    
    Authored-by: Wenchen Fan <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    cloud-fan committed May 7, 2021
    Configuration menu
    Copy the full SHA
    e83910f View commit details
    Browse the repository at this point in the history
  5. [SPARK-35020][SQL] Group exception messages in catalyst/util

    ### What changes were proposed in this pull request?
    This PR group exception messages in `sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util`.
    
    ### Why are the changes needed?
    It will largely help with standardization of error messages and its maintenance.
    
    ### Does this PR introduce _any_ user-facing change?
    No. Error messages remain unchanged.
    
    ### How was this patch tested?
    No new tests - pass all original tests to make sure it doesn't break any existing behavior.
    
    Closes #32367 from beliefer/SPARK-35020.
    
    Lead-authored-by: gengjiaan <[email protected]>
    Co-authored-by: beliefer <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    2 people authored and cloud-fan committed May 7, 2021
    Configuration menu
    Copy the full SHA
    cf2c4ba View commit details
    Browse the repository at this point in the history
  6. [SPARK-35333][SQL] Skip object null check in Invoke if possible

    ### What changes were proposed in this pull request?
    
    If `targetObject` is not nullable, we don't need the object null check in `Invoke`.
    
    ### Why are the changes needed?
    
    small perf improvement
    
    ### Does this PR introduce _any_ user-facing change?
    
    no
    
    ### How was this patch tested?
    
    existing tests
    
    Closes #32466 from cloud-fan/invoke.
    
    Authored-by: Wenchen Fan <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    cloud-fan committed May 7, 2021
    Configuration menu
    Copy the full SHA
    9aa18df View commit details
    Browse the repository at this point in the history
  7. [SPARK-35144][SQL] Migrate to transformWithPruning for object rules

    ### What changes were proposed in this pull request?
    
    Added the following TreePattern enums:
    - APPEND_COLUMNS
    - DESERIALIZE_TO_OBJECT
    - LAMBDA_VARIABLE
    - MAP_OBJECTS
    - SERIALIZE_FROM_OBJECT
    - PROJECT
    - TYPED_FILTER
    
    Added tree traversal pruning to the following rules dealing with objects:
    - EliminateSerialization
    - CombineTypedFilters
    - EliminateMapObjects
    - ObjectSerializerPruning
    
    ### Why are the changes needed?
    
    Reduce the number of tree traversals and hence improve the query compilation latency.
    
    ### How was this patch tested?
    
    Existing tests.
    
    Closes #32451 from sigmod/object.
    
    Authored-by: Yingyi Bu <[email protected]>
    Signed-off-by: Gengliang Wang <[email protected]>
    sigmod authored and gengliangwang committed May 7, 2021
    Configuration menu
    Copy the full SHA
    72d3266 View commit details
    Browse the repository at this point in the history
  8. [SPARK-35021][SQL] Group exception messages in connector/catalog

    ### What changes were proposed in this pull request?
    This PR group exception messages in `sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog`.
    
    ### Why are the changes needed?
    It will largely help with standardization of error messages and its maintenance.
    
    ### Does this PR introduce _any_ user-facing change?
    No. Error messages remain unchanged.
    
    ### How was this patch tested?
    No new tests - pass all original tests to make sure it doesn't break any existing behavior.
    
    Closes #32377 from beliefer/SPARK-35021.
    
    Lead-authored-by: beliefer <[email protected]>
    Co-authored-by: gengjiaan <[email protected]>
    Co-authored-by: Jiaan Geng <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    2 people authored and cloud-fan committed May 7, 2021
    Configuration menu
    Copy the full SHA
    d3b92ee View commit details
    Browse the repository at this point in the history
  9. [SPARK-35175][BUILD] Add linter for JavaScript source files

    ### What changes were proposed in this pull request?
    
    This PR proposes to add linter for JavaScript source files.
    [ESLint](https://eslint.org/) seems to be a popular linter for JavaScript so I choose it.
    
    ### Why are the changes needed?
    
    Linter enables us to check style and keeps code clean.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Manually run `dev/lint-js` (Node.js and npm are required).
    
    In this PR, mainly indentation style is also fixed an linter passes.
    
    Closes #32274 from sarutak/introduce-eslint.
    
    Authored-by: Kousuke Saruta <[email protected]>
    Signed-off-by: Kousuke Saruta <[email protected]>
    sarutak committed May 7, 2021
    Configuration menu
    Copy the full SHA
    2634dba View commit details
    Browse the repository at this point in the history
  10. [SPARK-35297][CORE][DOC][MINOR] Modify the comment about the executor

    ### What changes were proposed in this pull request?
    Now Spark Executor already can be used in Kubernetes scheduler. So we should modify the annotation in the Executor.scala.
    
    ### Why are the changes needed?
    only comment
    
    ### Does this PR introduce _any_ user-facing change?
    no
    
    ### How was this patch tested?
    no
    
    Closes #32426 from jerqi/master.
    
    Authored-by: RoryQi <[email protected]>
    Signed-off-by: Takeshi Yamamuro <[email protected]>
    RoryQi authored and maropu committed May 7, 2021
    Configuration menu
    Copy the full SHA
    6f0ef93 View commit details
    Browse the repository at this point in the history
  11. [SPARK-35288][SQL] StaticInvoke should find the method without exact …

    …argument classes match
    
    ### What changes were proposed in this pull request?
    
    This patch proposes to make StaticInvoke able to find method with given method name even the parameter types do not exactly match to argument classes.
    
    ### Why are the changes needed?
    
    Unlike `Invoke`, `StaticInvoke` only tries to get the method with exact argument classes. If the calling method's parameter types are not exactly matched with the argument classes, `StaticInvoke` cannot find the method.
    
    `StaticInvoke` should be able to find the method under the cases too.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes. `StaticInvoke` can find a method even the argument classes are not exactly matched.
    
    ### How was this patch tested?
    
    Unit test.
    
    Closes #32413 from viirya/static-invoke.
    
    Authored-by: Liang-Chi Hsieh <[email protected]>
    Signed-off-by: Liang-Chi Hsieh <[email protected]>
    viirya committed May 7, 2021
    Configuration menu
    Copy the full SHA
    33fbf56 View commit details
    Browse the repository at this point in the history
  12. [SPARK-35321][SQL] Don't register Hive permanent functions when creat…

    …ing Hive client
    
    ### What changes were proposed in this pull request?
    
    Instantiate a new Hive client through `Hive.getWithFastCheck(conf, false)` instead of `Hive.get(conf)`.
    
    ### Why are the changes needed?
    
    [HIVE-10319](https://issues.apache.org/jira/browse/HIVE-10319) introduced a new API `get_all_functions` which is only supported in Hive 1.3.0/2.0.0 and up. As result, when Spark 3.x talks to a HMS service of version 1.2 or lower, the following error will occur:
    ```
    Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.TApplicationException: Invalid method name: 'get_all_functions'
            at org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3897)
            at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:248)
            at org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:231)
            ... 96 more
    Caused by: org.apache.thrift.TApplicationException: Invalid method name: 'get_all_functions'
            at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79)
            at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_all_functions(ThriftHiveMetastore.java:3845)
            at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_all_functions(ThriftHiveMetastore.java:3833)
    ```
    
    The `get_all_functions` is called only when `doRegisterAllFns` is set to true:
    ```java
      private Hive(HiveConf c, boolean doRegisterAllFns) throws HiveException {
        conf = c;
        if (doRegisterAllFns) {
          registerAllFunctionsOnce();
        }
      }
    ```
    
    what this does is to register all Hive permanent functions defined in HMS in Hive's `FunctionRegistry` class, via iterating through results from `get_all_functions`. To Spark, this seems unnecessary as it loads Hive permanent (not built-in) UDF via directly calling the HMS API, i.e., `get_function`. The `FunctionRegistry` is only used in loading Hive's built-in function that is not supported by Spark. At this time, it only applies to `histogram_numeric`.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes with this fix Spark now should be able to talk to HMS server with Hive 1.2.x and lower (with HIVE-24608 too)
    
    ### How was this patch tested?
    
    Manually started a HMS server of Hive version 1.2.2, with patched Hive 2.3.8 using HIVE-24608. Without the PR it failed with the above exception. With the PR the error disappeared and I can successfully perform common operations such as create table, create database, list tables, etc.
    
    Closes #32446 from sunchao/SPARK-35321.
    
    Authored-by: Chao Sun <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    sunchao authored and dongjoon-hyun committed May 7, 2021
    Configuration menu
    Copy the full SHA
    b4ec9e2 View commit details
    Browse the repository at this point in the history

Commits on May 8, 2021

  1. [SPARK-35261][SQL] Support static magic method for stateless Java Sca…

    …larFunction
    
    ### What changes were proposed in this pull request?
    
    This allows `ScalarFunction` implemented in Java to optionally specify the magic method `invoke` to be static, which can be used if the UDF is stateless. Comparing to the non-static method, it can potentially give better performance due to elimination of dynamic dispatch, etc.
    
    Also added a benchmark to measure performance of: the default `produceResult`, non-static magic method and static magic method.
    
    ### Why are the changes needed?
    
    For UDFs that are stateless (e.g., no need to maintain intermediate state between each function call), it's better to allow users to implement the UDF function as static method which could potentially give better performance.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes. Spark users can now have the choice to define static magic method for `ScalarFunction` when it is written in Java and when the UDF is stateless.
    
    ### How was this patch tested?
    
    Added new UT.
    
    Closes #32407 from sunchao/SPARK-35261.
    
    Authored-by: Chao Sun <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    sunchao authored and dongjoon-hyun committed May 8, 2021
    Configuration menu
    Copy the full SHA
    f47e0f8 View commit details
    Browse the repository at this point in the history
  2. [SPARK-35232][SQL] Nested column pruning should retain column metadata

    ### What changes were proposed in this pull request?
    
    Retain column metadata during the process of nested column pruning, when constructing `StructField`.
    
    To test the above change, this also added the logic of column projection in `InMemoryTable`. Without the fix `DSV2CharVarcharDDLTestSuite` will fail.
    
    ### Why are the changes needed?
    
    The column metadata is used in a few places such as re-constructing CHAR/VARCHAR information such as in [SPARK-33901](https://issues.apache.org/jira/browse/SPARK-33901). Therefore, we should retain the info during nested column pruning.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Existing tests.
    
    Closes #32354 from sunchao/SPARK-35232.
    
    Authored-by: Chao Sun <[email protected]>
    Signed-off-by: Liang-Chi Hsieh <[email protected]>
    sunchao authored and viirya committed May 8, 2021
    Configuration menu
    Copy the full SHA
    323a6e8 View commit details
    Browse the repository at this point in the history
  3. [SPARK-35331][SQL] Support resolving missing attrs for distribute/clu…

    …ster by/repartition hint
    
    ### What changes were proposed in this pull request?
    
    This PR makes the below case work well.
    
    ```sql
    select a b from values(1) t(a) distribute by a;
    ```
    
    ```logtalk
    == Parsed Logical Plan ==
    'RepartitionByExpression ['a]
    +- 'Project ['a AS b#42]
       +- 'SubqueryAlias t
          +- 'UnresolvedInlineTable [a], [List(1)]
    
    == Analyzed Logical Plan ==
    org.apache.spark.sql.AnalysisException: cannot resolve 'a' given input columns: [b]; line 1 pos 62;
    'RepartitionByExpression ['a]
    +- Project [a#48 AS b#42]
       +- SubqueryAlias t
          +- LocalRelation [a#48]
    ```
    ### Why are the changes needed?
    
    bugfix
    
    ### Does this PR introduce _any_ user-facing change?
    
    yes, the original attributes can be used in `distribute by` / `cluster by` and hints like `/*+ REPARTITION(3, c) */`
    
    ### How was this patch tested?
    
    new tests
    
    Closes #32465 from yaooqinn/SPARK-35331.
    
    Authored-by: Kent Yao <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    yaooqinn authored and dongjoon-hyun committed May 8, 2021
    Configuration menu
    Copy the full SHA
    b025780 View commit details
    Browse the repository at this point in the history
  4. [SPARK-35327][SQL][TESTS] Filters out the TPC-DS queries that can cau…

    …se flaky test results
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to filter out TPCDS v1.4 q6 and q75 in `TPCDSQueryTestSuite`.
    
    I saw`TPCDSQueryTestSuite` failed nondeterministically because output row orders were different with those in the golden files. For example, the failure in the GA job, https://github.com/linhongliu-db/spark/runs/2507928605?check_suite_focus=true, happened because the `tpcds/q6.sql` query output rows were only sorted by `cnt`:
    
    https://github.com/apache/spark/blob/a0c76a8755a148e2bd774edcda12fe20f2f38c75/sql/core/src/test/resources/tpcds/q6.sql#L20
    Actually, `tpcds/q6.sql`  and `tpcds-v2.7.0/q6.sql` are almost the same and the only difference is that `tpcds-v2.7.0/q6.sql` sorts both `cnt` and `a.ca_state`:
    https://github.com/apache/spark/blob/a0c76a8755a148e2bd774edcda12fe20f2f38c75/sql/core/src/test/resources/tpcds-v2.7.0/q6.sql#L22
    So, I think it's okay just to test `tpcds-v2.7.0/q6.sql` in this case (q75 has the same issue).
    
    ### Why are the changes needed?
    
    For stable testing.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No, dev-only.
    
    ### How was this patch tested?
    
    GA passed.
    
    Closes #32454 from maropu/CleanUpTpcdsQueries.
    
    Authored-by: Takeshi Yamamuro <[email protected]>
    Signed-off-by: Takeshi Yamamuro <[email protected]>
    maropu committed May 8, 2021
    Configuration menu
    Copy the full SHA
    06c4009 View commit details
    Browse the repository at this point in the history
  5. Revert "[SPARK-35321][SQL] Don't register Hive permanent functions wh…

    …en creating Hive client"
    
    This reverts commit b4ec9e2.
    dongjoon-hyun committed May 8, 2021
    Configuration menu
    Copy the full SHA
    e31bef1 View commit details
    Browse the repository at this point in the history
  6. [SPARK-35347][SQL] Use MethodUtils for looking up methods in Invoke a…

    …nd StaticInvoke
    
    ### What changes were proposed in this pull request?
    
    This patch proposes to use `MethodUtils` for looking up methods `Invoke` and `StaticInvoke` expressions.
    
    ### Why are the changes needed?
    
    Currently we wrote our logic in `Invoke` and `StaticInvoke` expressions for looking up methods. It is tricky to consider all the cases and there is already existing utility package for this purpose. We should reuse the utility package.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No, internal change only.
    
    ### How was this patch tested?
    
    Existing tests.
    
    Closes #32474 from viirya/invoke-util.
    
    Authored-by: Liang-Chi Hsieh <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    viirya authored and dongjoon-hyun committed May 8, 2021
    Configuration menu
    Copy the full SHA
    5b65d8a View commit details
    Browse the repository at this point in the history

Commits on May 9, 2021

  1. [SPARK-35231][SQL] logical.Range override maxRowsPerPartition

    ### What changes were proposed in this pull request?
    when `numSlices` is avaiable, `logical.Range` should compute a exact `maxRowsPerPartition`
    
    ### Why are the changes needed?
    `maxRowsPerPartition` is used in optimizer, we should provide an exact value if possible
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    existing testsuites
    
    Closes #32350 from zhengruifeng/range_maxRowsPerPartition.
    
    Authored-by: Ruifeng Zheng <[email protected]>
    Signed-off-by: Takeshi Yamamuro <[email protected]>
    zhengruifeng authored and maropu committed May 9, 2021
    Configuration menu
    Copy the full SHA
    620f072 View commit details
    Browse the repository at this point in the history

Commits on May 10, 2021

  1. [SPARK-35354][SQL] Replace BaseJoinExec with ShuffledJoin in Coalesce…

    …BucketsInJoin
    
    ### What changes were proposed in this pull request?
    
    As title. We should use a more restrictive interface `ShuffledJoin` other than `BaseJoinExec` in `CoalesceBucketsInJoin`, as the rule only applies to sort merge join and shuffled hash join (i.e. `ShuffledJoin`).
    
    ### Why are the changes needed?
    
    Code cleanup.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Existing unit test in `CoalesceBucketsInJoinSuite`.
    
    Closes #32480 from c21/minor-cleanup.
    
    Authored-by: Cheng Su <[email protected]>
    Signed-off-by: Takeshi Yamamuro <[email protected]>
    c21 authored and maropu committed May 10, 2021
    Configuration menu
    Copy the full SHA
    38eb5a6 View commit details
    Browse the repository at this point in the history
  2. [SPARK-35111][SPARK-35112][SQL][FOLLOWUP] Rename ANSI interval patter…

    …ns and regexps
    
    ### What changes were proposed in this pull request?
    Rename pattern strings and regexps of year-month and day-time intervals.
    
    ### Why are the changes needed?
    To improve code maintainability.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    By existing test suites.
    
    Closes #32444 from AngersZhuuuu/SPARK-35111-followup.
    
    Authored-by: Angerszhuuuu <[email protected]>
    Signed-off-by: Max Gekk <[email protected]>
    AngersZhuuuu authored and MaxGekk committed May 10, 2021
    Configuration menu
    Copy the full SHA
    2c8ced9 View commit details
    Browse the repository at this point in the history
  3. [SPARK-35261][SQL][TESTS][FOLLOW-UP] Change failOnError to false for …

    …NativeAdd in V2FunctionBenchmark
    
    ### What changes were proposed in this pull request?
    
    Change `failOnError` to false for `NativeAdd` in `V2FunctionBenchmark`.
    
    ### Why are the changes needed?
    
    Since `NativeAdd` is simply doing addition on long it's better to set `failOnError` to false so it will use native long addition instead of `Math.addExact`.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    N/A
    
    Closes #32481 from sunchao/SPARK-35261-follow-up.
    
    Authored-by: Chao Sun <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    sunchao authored and cloud-fan committed May 10, 2021
    Configuration menu
    Copy the full SHA
    245dce1 View commit details
    Browse the repository at this point in the history
  4. [SPARK-35358][BUILD] Increase maximum Java heap used for release buil…

    …d to avoid OOM
    
    ### What changes were proposed in this pull request?
    
    This patch proposes to increase the maximum heap memory setting for release build.
    
    ### Why are the changes needed?
    
    When I was cutting RCs for 2.4.8, I frequently encountered OOM during building using mvn. It happens many times until I increased the heap memory setting.
    
    I am not sure if other release managers encounter the same issue. So I propose to increase the heap memory setting and see if it looks good for others.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No, dev only.
    
    ### How was this patch tested?
    
    Manually used it during cutting RCs of 2.4.8.
    
    Closes #32487 from viirya/release-mvn-oom.
    
    Authored-by: Liang-Chi Hsieh <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    viirya authored and dongjoon-hyun committed May 10, 2021
    Configuration menu
    Copy the full SHA
    20d3224 View commit details
    Browse the repository at this point in the history
  5. [MINOR][INFRA] Add python/.idea into git ignore

    ### What changes were proposed in this pull request?
    
    This PR adds `python/.idea` into Git ignore. PyCharm is supposed to be open against `python` directory which contains `pyspark` package as its root package.
    
    This was caused by #32337.
    
    ### Why are the changes needed?
    
    To ignore `.idea` file for PyCharm.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No, dev-only.
    
    ### How was this patch tested?
    
    Manually tested by testing with `git` command.
    
    Closes #32490 from HyukjinKwon/minor-python-gitignore.
    
    Authored-by: Hyukjin Kwon <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    HyukjinKwon committed May 10, 2021
    Configuration menu
    Copy the full SHA
    d808956 View commit details
    Browse the repository at this point in the history
  6. [SPARK-35360][SQL] RepairTableCommand respects `spark.sql.addPartitio…

    …nInBatch.size` too
    
    ### What changes were proposed in this pull request?
    RepairTableCommand respects `spark.sql.addPartitionInBatch.size` too
    
    ### Why are the changes needed?
    Make RepairTableCommand add partition batch size configurable.
    
    ### Does this PR introduce _any_ user-facing change?
    User can use `spark.sql.addPartitionInBatch.size` to change batch size when repair table.
    
    ### How was this patch tested?
    Not need
    
    Closes #32489 from AngersZhuuuu/SPARK-35360.
    
    Authored-by: Angerszhuuuu <[email protected]>
    Signed-off-by: Max Gekk <[email protected]>
    AngersZhuuuu authored and MaxGekk committed May 10, 2021
    Configuration menu
    Copy the full SHA
    7182f8c View commit details
    Browse the repository at this point in the history
  7. [SPARK-34246][FOLLOWUP] Change the definition of `findTightestCommonT…

    …ype` for backward compatibility
    
    ### What changes were proposed in this pull request?
    
    Change the definition of `findTightestCommonType` from
    ```
    def findTightestCommonType(t1: DataType, t2: DataType): Option[DataType]
    ```
    to
    ```
    val findTightestCommonType: (DataType, DataType) => Option[DataType]
    ```
    
    ### Why are the changes needed?
    
    For backward compatibility.
    When running a MongoDB connector (built with Spark 3.1.1) with the latest master, there is such an error
    ```
    java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.analysis.TypeCoercion$.findTightestCommonType()Lscala/Function2
    ```
    from https://github.com/mongodb/mongo-spark/blob/master/src/main/scala/com/mongodb/spark/sql/MongoInferSchema.scala#L150
    
    In the previous release, the function was
    ```
    static public  scala.Function2<org.apache.spark.sql.types.DataType, org.apache.spark.sql.types.DataType, scala.Option<org.apache.spark.sql.types.DataType>> findTightestCommonType ()
    ```
    After #31349, the function becomes:
    ```
    static public  scala.Option<org.apache.spark.sql.types.DataType> findTightestCommonType (org.apache.spark.sql.types.DataType t1, org.apache.spark.sql.types.DataType t2)
    ```
    
    This PR is to reduce the unnecessary API change.
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, the definition of `TypeCoercion.findTightestCommonType`  is consistent with previous release again.
    
    ### How was this patch tested?
    
    Existing unit tests
    
    Closes #32493 from gengliangwang/typecoercion.
    
    Authored-by: Gengliang Wang <[email protected]>
    Signed-off-by: Gengliang Wang <[email protected]>
    gengliangwang committed May 10, 2021
    Configuration menu
    Copy the full SHA
    d2a535f View commit details
    Browse the repository at this point in the history
  8. [SPARK-34736][K8S][TESTS] Kubernetes and Minikube version upgrade for…

    … integration tests
    
    ### What changes were proposed in this pull request?
    
    This PR upgrades Kubernetes and Minikube version for integration tests and removes/updates the old code for this new version.
    
    Details of this changes:
    
    - As [discussed in the mailing list](http://apache-spark-developers-list.1001551.n3.nabble.com/minikube-and-kubernetes-cluster-versions-for-integration-testing-td30856.html): updating Minikube version from v0.34.1 to v1.7.3 and kubernetes version from v1.15.12 to v1.17.3.
    - making Minikube version checked and fail with an explanation when the test is started with on a version <  v1.7.3.
    - removing minikube status checking code related to old Minikube versions
    - in the Minikube backend using fabric8's `Config.autoConfigure()` method to configure the kubernetes client to use the `minikube` k8s context (like it was in [one of the Minikube's example](https://github.com/fabric8io/kubernetes-client/blob/master/kubernetes-examples/src/main/java/io/fabric8/kubernetes/examples/kubectl/equivalents/ConfigUseContext.java#L36))
    - Introducing `persistentVolume` test tag: this would be a temporary change to skip PVC tests in the Kubernetes integration test, as currently the PCV tests are blocking the move to Docker as Minikube's driver (for details please check https://issues.apache.org/jira/browse/SPARK-34738).
    
    ### Why are the changes needed?
    
    With the current suggestion one can run into several problems without noticing the Minikube/kubernetes version is the problem.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    It was tested on Mac with [this script](https://gist.github.com/attilapiros/cd58a16bdde833c80c5803c337fffa94#file-check_minikube_versions-zsh) which installs each Minikube versions from v1.7.2 (including this version to test the negative case of the version check) and runs the integration tests.
    
    It was started with:
    ```
    ./check_minikube_versions.zsh > test_log 2>&1
    ```
    
    And there was only one build failure the rest was successful:
    
    ```
    $ grep "BUILD SUCCESS" test_log | wc -l
          26
    $ grep "BUILD FAILURE" test_log | wc -l
           1
    ```
    
    It was for Minikube v1.7.2  and the log is:
    
    ```
    KubernetesSuite:
    *** RUN ABORTED ***
      java.lang.AssertionError: assertion failed: Unsupported Minikube version is detected: minikube version: v1.7.2.For integration testing Minikube version 1.7.3 or greater is expected.
      at scala.Predef$.assert(Predef.scala:223)
      at org.apache.spark.deploy.k8s.integrationtest.backend.minikube.Minikube$.getKubernetesClient(Minikube.scala:52)
      at org.apache.spark.deploy.k8s.integrationtest.backend.minikube.MinikubeTestBackend$.initialize(MinikubeTestBackend.scala:33)
      at org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.beforeAll(KubernetesSuite.scala:163)
      at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212)
      at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
      at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
      at org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.org$scalatest$BeforeAndAfter$$super$run(KubernetesSuite.scala:43)
      at org.scalatest.BeforeAndAfter.run(BeforeAndAfter.scala:273)
      at org.scalatest.BeforeAndAfter.run$(BeforeAndAfter.scala:271)
      ...
    ```
    
    Moreover I made a test with having multiple k8s cluster contexts, too.
    
    Closes #31829 from attilapiros/SPARK-34736.
    
    Lead-authored-by: “attilapiros” <[email protected]>
    Co-authored-by: attilapiros <[email protected]>
    Signed-off-by: attilapiros <[email protected]>
    attilapiros committed May 10, 2021
    Configuration menu
    Copy the full SHA
    8b94eff View commit details
    Browse the repository at this point in the history

Commits on May 11, 2021

  1. [SPARK-35088][SQL][FOLLOWUP] Improve the error message for Sequence e…

    …xpression
    
    ### What changes were proposed in this pull request?
    Sequence expression output a message looks confused.
    This PR will fix the issue.
    
    ### Why are the changes needed?
    Improve the error message for Sequence expression
    
    ### Does this PR introduce _any_ user-facing change?
    Yes. this PR updates the error message of Sequence expression.
    
    ### How was this patch tested?
    Tests updated.
    
    Closes #32492 from beliefer/SPARK-35088-followup.
    
    Authored-by: gengjiaan <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    beliefer authored and HyukjinKwon committed May 11, 2021
    Configuration menu
    Copy the full SHA
    44bd0a8 View commit details
    Browse the repository at this point in the history
  2. [SPARK-35363][SQL] Refactor sort merge join code-gen be agnostic to j…

    …oin type
    
    ### What changes were proposed in this pull request?
    
    This is a pre-requisite of #32476, in discussion of #32476 (comment) . This is to refactor sort merge join code-gen to depend on streamed/buffered terminology, which makes the code-gen agnostic to different join types and can be extended to support other join types than inner join.
    
    ### Why are the changes needed?
    
    Pre-requisite of #32476.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Existing unit test in `InnerJoinSuite.scala` for inner join code-gen.
    
    Closes #32495 from c21/smj-refactor.
    
    Authored-by: Cheng Su <[email protected]>
    Signed-off-by: Takeshi Yamamuro <[email protected]>
    c21 authored and maropu committed May 11, 2021
    Configuration menu
    Copy the full SHA
    c4ca232 View commit details
    Browse the repository at this point in the history
  3. [SPARK-35146][SQL] Migrate to transformWithPruning or resolveWithPrun…

    …ing for rules in finishAnalysis.scala
    
    ### What changes were proposed in this pull request?
    
    Added the following TreePattern enums:
    - BOOL_AGG
    - COUNT_IF
    - CURRENT_LIKE
    - RUNTIME_REPLACEABLE
    
    Added tree traversal pruning to the following rules:
    - ReplaceExpressions
    - RewriteNonCorrelatedExists
    - ComputeCurrentTime
    - GetCurrentDatabaseAndCatalog
    
    ### Why are the changes needed?
    
    Reduce the number of tree traversals and hence improve the query compilation latency.
    
    Performance improvement (org.apache.spark.sql.TPCDSQuerySuite):
    Rule name | Total Time (baseline) | Total Time (experiment) | experiment/baseline
    ReplaceExpressions | 27546369 | 19753804 | 0.72
    RewriteNonCorrelatedExists | 17304883 | 2086194 | 0.12
    ComputeCurrentTime | 35751301 | 19984477 | 0.56
    GetCurrentDatabaseAndCatalog | 37230787 | 18874013 | 0.51
    
    ### How was this patch tested?
    
    Existing tests.
    
    Closes #32461 from sigmod/finish_analysis.
    
    Authored-by: Yingyi Bu <[email protected]>
    Signed-off-by: Gengliang Wang <[email protected]>
    sigmod authored and gengliangwang committed May 11, 2021
    Configuration menu
    Copy the full SHA
    7c9a9ec View commit details
    Browse the repository at this point in the history
  4. [SPARK-35229][WEBUI] Limit the maximum number of items on the timelin…

    …e view
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to introduces three new configurations to limit the maximum number of jobs/stages/executors on the timeline view.
    
    ### Why are the changes needed?
    
    If the number of items on the timeline view grows +1000, rendering can be significantly slow.
    https://issues.apache.org/jira/browse/SPARK-35229
    
    The maximum number of tasks on the timeline is already limited by `spark.ui.timeline.tasks.maximum` so l proposed to mitigate this issue with the same manner.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes. the maximum number of items shown on the timeline view is limited.
    I proposed the default value 500 for jobs and stages, and 250 for executors.
    A executor has at most 2 items (added and removed) 250 is chosen.
    
    ### How was this patch tested?
    
    I manually confirm this change works with the following procedures.
    ```
    # launch a cluster
    $ bin/spark-shell --conf spark.ui.retainedDeadExecutors=300 --master "local-cluster[4, 1, 1024]"
    
    // Confirm the maximum number of jobs
    (1 to 1000).foreach { _ => sc.parallelize(List(1)).collect }
    
    // Confirm the maximum number of stages
    var df = sc.parallelize(1 to 2)
    (1 to 1000).foreach { i =>  df = df.repartition(i % 5 + 1) }
    df.collect
    
    // Confirm the maximum number of executors
    (1 to 300).foreach { _ => try sc.parallelize(List(1)).foreach { _ => System.exit(0) } catch { case e => }}
    ```
    
    Screenshots here.
    ![jobs_limited](https://user-images.githubusercontent.com/4736016/116386937-3e8c4a00-a855-11eb-8f4c-151cf7ddd3b8.png)
    ![stages_limited](https://user-images.githubusercontent.com/4736016/116386990-49df7580-a855-11eb-9f71-8e129e3336ab.png)
    ![executors_limited](https://user-images.githubusercontent.com/4736016/116387009-4f3cc000-a855-11eb-8697-a2eb4c9c99e6.png)
    
    Closes #32381 from sarutak/mitigate-timeline-issue.
    
    Authored-by: Kousuke Saruta <[email protected]>
    Signed-off-by: Gengliang Wang <[email protected]>
    sarutak authored and gengliangwang committed May 11, 2021
    Configuration menu
    Copy the full SHA
    2b6640a View commit details
    Browse the repository at this point in the history
  5. [SPARK-35372][BUILD] Increase stack size for Scala compilation in Mav…

    …en build
    
    ### What changes were proposed in this pull request?
    
    This PR increases the stack size for Scala compilation in Maven build to fix the error:
    
    ```
    java.lang.StackOverflowError
    scala.reflect.internal.Trees$UnderConstructionTransformer.transform(Trees.scala:1741)
    scala.reflect.internal.Trees$UnderConstructionTransformer.transform$(Trees.scala:1740)
    scala.tools.nsc.transform.ExplicitOuter$OuterPathTransformer.transform(ExplicitOuter.scala:289)
    scala.tools.nsc.transform.ExplicitOuter$ExplicitOuterTransformer.transform(ExplicitOuter.scala:477)
    scala.tools.nsc.transform.ExplicitOuter$ExplicitOuterTransformer.transform(ExplicitOuter.scala:330)
    scala.reflect.api.Trees$Transformer.$anonfun$transformStats$1(Trees.scala:2597)
    scala.reflect.api.Trees$Transformer.transformStats(Trees.scala:2595)
    scala.reflect.internal.Trees.itransform(Trees.scala:1404)
    scala.reflect.internal.Trees.itransform$(Trees.scala:1374)
    scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
    scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
    scala.reflect.api.Trees$Transformer.transform(Trees.scala:2563)
    scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:51)
    scala.tools.nsc.transform.ExplicitOuter$OuterPathTransformer.scala$reflect$internal$Trees$UnderConstructionTransformer$$super$transform(ExplicitOuter.scala:212)
    scala.reflect.internal.Trees$UnderConstructionTransformer.transform(Trees.scala:1745)
    scala.reflect.internal.Trees$UnderConstructionTransformer.transform$(Trees.scala:1740)
    scala.tools.nsc.transform.ExplicitOuter$OuterPathTransformer.transform(ExplicitOuter.scala:289)
    scala.tools.nsc.transform.ExplicitOuter$ExplicitOuterTransformer.transform(ExplicitOuter.scala:477)
    scala.tools.nsc.transform.ExplicitOuter$ExplicitOuterTransformer.transform(ExplicitOuter.scala:330)
    scala.reflect.internal.Trees.itransform(Trees.scala:1383)
    ```
    
    See https://github.com/apache/spark/runs/2554067779
    
    ### Why are the changes needed?
    
    To recover JDK 11 compilation
    
    ### Does this PR introduce _any_ user-facing change?
    
    No, dev-only.
    
    ### How was this patch tested?
    
    CI in this PR will test it out.
    
    Closes #32502 from HyukjinKwon/SPARK-35372.
    
    Authored-by: Hyukjin Kwon <[email protected]>
    Signed-off-by: Kousuke Saruta <[email protected]>
    HyukjinKwon authored and sarutak committed May 11, 2021
    Configuration menu
    Copy the full SHA
    b59d5ab View commit details
    Browse the repository at this point in the history

Commits on May 12, 2021

  1. [SPARK-35375][INFRA] Use Jinja2 < 3.0.0 for Python linter dependency …

    …in GA
    
    ### What changes were proposed in this pull request?
    
    From a few hours ago, Python linter fails in GA.
    The latest Jinja 3.0.0 seems to cause this failure.
    https://pypi.org/project/Jinja2/
    
    ```
    Run ./dev/lint-python
    starting python compilation test...
    python compilation succeeded.
    
    starting pycodestyle test...
    pycodestyle checks passed.
    
    starting flake8 test...
    flake8 checks passed.
    
    starting mypy test...
    mypy checks passed.
    
    starting sphinx-build tests...
    sphinx-build checks failed:
    Running Sphinx v3.0.4
    making output directory... done
    [autosummary] generating autosummary for: development/contributing.rst, development/debugging.rst, development/index.rst, development/setting_ide.rst, development/testing.rst, getting_started/index.rst, getting_started/install.rst, getting_started/quickstart.ipynb, index.rst, migration_guide/index.rst, ..., reference/pyspark.ml.rst, reference/pyspark.mllib.rst, reference/pyspark.resource.rst, reference/pyspark.rst, reference/pyspark.sql.rst, reference/pyspark.ss.rst, reference/pyspark.streaming.rst, user_guide/arrow_pandas.rst, user_guide/index.rst, user_guide/python_packaging.rst
    
    Exception occurred:
      File "/__w/spark/spark/python/docs/source/_templates/autosummary/class_with_docs.rst", line 26, in top-level template code
        {% if '__init__' in methods %}
    jinja2.exceptions.UndefinedError: 'methods' is undefined
    The full traceback has been saved in /tmp/sphinx-err-ypgyi75y.log, if you want to report the issue to the developers.
    Please also report this if it was a user error, so that a better error message can be provided next time.
    A bug report can be filed in the tracker at <https://github.com/sphinx-doc/sphinx/issues>. Thanks!
    make: *** [Makefile:20: html] Error 2
    
    re-running make html to print full warning list:
    Running Sphinx v3.0.4
    making output directory... done
    [autosummary] generating autosummary for: development/contributing.rst, development/debugging.rst, development/index.rst, development/setting_ide.rst, development/testing.rst, getting_started/index.rst, getting_started/install.rst, getting_started/quickstart.ipynb, index.rst, migration_guide/index.rst, ..., reference/pyspark.ml.rst, reference/pyspark.mllib.rst, reference/pyspark.resource.rst, reference/pyspark.rst, reference/pyspark.sql.rst, reference/pyspark.ss.rst, reference/pyspark.streaming.rst, user_guide/arrow_pandas.rst, user_guide/index.rst, user_guide/python_packaging.rst
    
    Exception occurred:
      File "/__w/spark/spark/python/docs/source/_templates/autosummary/class_with_docs.rst", line 26, in top-level template code
        {% if '__init__' in methods %}
    jinja2.exceptions.UndefinedError: 'methods' is undefined
    The full traceback has been saved in /tmp/sphinx-err-fvtmvvwv.log, if you want to report the issue to the developers.
    Please also report this if it was a user error, so that a better error message can be provided next time.
    A bug report can be filed in the tracker at <https://github.com/sphinx-doc/sphinx/issues>. Thanks!
    make: *** [Makefile:20: html] Error 2
    Error: Process completed with exit code 2.
    ```
    
    ### Why are the changes needed?
    
    To recover GA build.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    GA.
    
    Closes #32509 from sarutak/fix-python-lint-error.
    
    Authored-by: Kousuke Saruta <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    sarutak authored and HyukjinKwon committed May 12, 2021
    Configuration menu
    Copy the full SHA
    af0d99c View commit details
    Browse the repository at this point in the history
  2. [SPARK-35361][SQL] Improve performance for ApplyFunctionExpression

    ### What changes were proposed in this pull request?
    
    In `ApplyFunctionExpression`, move `zipWithIndex` out of the loop for each input row.
    
    ### Why are the changes needed?
    
    When the `ScalarFunction` is trivial, `zipWithIndex` could incur significant costs, as shown below:
    
    <img width="899" alt="Screen Shot 2021-05-11 at 10 03 42 AM" src="https://user-images.githubusercontent.com/506679/117866421-fb19de80-b24b-11eb-8c94-d5e8c8b1eda9.png">
    
    By removing it out of the loop, I'm seeing sometimes 2x speedup from `V2FunctionBenchmark`. For instance:
    
    Before:
    ```
    scalar function (long + long) -> long, result_nullable = false codegen = false:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    native_long_add                                                                         32437          32896         434         15.4          64.9       1.0X
    java_long_add_default                                                                   85675          97045         NaN          5.8         171.3       0.4X
    ```
    
    After:
    ```
    scalar function (long + long) -> long, result_nullable = false codegen = false:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    native_long_add                                                                         30182          30387         279         16.6          60.4       1.0X
    java_long_add_default                                                                   42862          43009         209         11.7          85.7       0.7X
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Existing tests
    
    Closes #32507 from sunchao/SPARK-35361.
    
    Authored-by: Chao Sun <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    sunchao authored and HyukjinKwon committed May 12, 2021
    Configuration menu
    Copy the full SHA
    78221bd View commit details
    Browse the repository at this point in the history
  3. [MINOR][DOCS] Avoid some python docs where first sentence has "e.g." …

    …or similar
    
    ### What changes were proposed in this pull request?
    
    Avoid some python docs where first sentence has "e.g." or similar as the period causes the docs to show only half of the first sentence as the summary.
    
    ### Why are the changes needed?
    
    See for example https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.regression.LinearRegressionModel.html?highlight=linearregressionmodel#pyspark.ml.regression.LinearRegressionModel.summary where the method description is clearly truncated.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Only changes docs.
    
    ### How was this patch tested?
    
    Manual testing of docs.
    
    Closes #32508 from srowen/TruncatedPythonDesc.
    
    Authored-by: Sean Owen <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    srowen authored and HyukjinKwon committed May 12, 2021
    Configuration menu
    Copy the full SHA
    a189be8 View commit details
    Browse the repository at this point in the history
  4. [SPARK-35377][INFRA] Add JS linter to GA

    ### What changes were proposed in this pull request?
    
    SPARK-35175 (#32274) added a linter for JS so let's add it to GA.
    
    ### Why are the changes needed?
    
    To JS code keep clean.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    GA
    
    Closes #32512 from sarutak/ga-lintjs.
    
    Authored-by: Kousuke Saruta <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    sarutak authored and HyukjinKwon committed May 12, 2021
    Configuration menu
    Copy the full SHA
    7e3446a View commit details
    Browse the repository at this point in the history
  5. [SPARK-35381][R] Fix lambda variable name issues in nested higher ord…

    …er functions at R APIs
    
    ### What changes were proposed in this pull request?
    
    This PR fixes the same issue as #32424
    
    ```r
    df <- sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as letters")
    collect(select(
      df,
      array_transform("numbers", function(number) {
        array_transform("letters", function(latter) {
          struct(alias(number, "n"), alias(latter, "l"))
        })
      })
    ))
    ```
    
    **Before:**
    
    ```
    ... a, a, b, b, c, c, a, a, b, b, c, c, a, a, b, b, c, c
    ```
    
    **After:**
    
    ```
    ... 1, a, 1, b, 1, c, 2, a, 2, b, 2, c, 3, a, 3, b, 3, c
    ```
    
    ### Why are the changes needed?
    
    To produce the correct results.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, it fixes the results to be correct as mentioned above.
    
    ### How was this patch tested?
    
    Manually tested as above, and unit test was added.
    
    Closes #32517 from HyukjinKwon/SPARK-35381.
    
    Authored-by: Hyukjin Kwon <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    HyukjinKwon committed May 12, 2021
    Configuration menu
    Copy the full SHA
    ecb48cc View commit details
    Browse the repository at this point in the history
  6. [SPARK-35243][SQL] Support columnar execution on ANSI interval types

    ### What changes were proposed in this pull request?
    Columnar execution support for ANSI interval types include YearMonthIntervalType and DayTimeIntervalType
    
    ### Why are the changes needed?
    support cache tables with ANSI interval types.
    
    ### Does this PR introduce _any_ user-facing change?
    ### How was this patch tested?
    run ./dev/lint-java
    run ./dev/scalastyle
    run test: CachedTableSuite
    run test: ColumnTypeSuite
    
    Closes #32452 from Peng-Lei/SPARK-35243.
    
    Lead-authored-by: PengLei <[email protected]>
    Co-authored-by: Lei Peng <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    2 people authored and HyukjinKwon committed May 12, 2021
    Configuration menu
    Copy the full SHA
    82c520a View commit details
    Browse the repository at this point in the history
  7. [SPARK-35298][SQL] Migrate to transformWithPruning for rules in Optim…

    …izer.scala
    
    ### What changes were proposed in this pull request?
    
    Added the following TreePattern enums:
    - ALIAS
    - AND_OR
    - AVERAGE
    - GENERATE
    - INTERSECT
    - SORT
    - SUM
    - DISTINCT_LIKE
    - PROJECT
    - REPARTITION_OPERATION
    - UNION
    
    Added tree traversal pruning to the following rules in Optimizer.scala:
    - EliminateAggregateFilter
    - RemoveRedundantAggregates
    - RemoveNoopOperators
    - RemoveNoopUnion
    - LimitPushDown
    - ColumnPruning
    - CollapseRepartition
    - OptimizeRepartition
    - OptimizeWindowFunctions
    - CollapseWindow
    - TransposeWindow
    - InferFiltersFromGenerate
    - InferFiltersFromConstraints
    - CombineUnions
    - CombineFilters
    - EliminateSorts
    - PruneFilters
    - EliminateLimits
    - DecimalAggregates
    - ConvertToLocalRelation
    - ReplaceDistinctWithAggregate
    - ReplaceIntersectWithSemiJoin
    - ReplaceExceptWithAntiJoin
    - RewriteExceptAll
    - RewriteIntersectAll
    - RemoveLiteralFromGroupExpressions
    - RemoveRepetitionFromGroupExpressions
    - OptimizeLimitZero
    
    ### Why are the changes needed?
    
    Reduce the number of tree traversals and hence improve the query compilation latency.
    
    perf diff:
    Rule name | Total Time (baseline) | Total Time (experiment) | experiment/baseline
    RemoveRedundantAggregates | 51290766 | 67070477 | 1.31
    RemoveNoopOperators | 192371141 | 196631275 | 1.02
    RemoveNoopUnion | 49222561 | 43266681 | 0.88
    LimitPushDown | 40885185 | 21672646 | 0.53
    ColumnPruning | 2003406120 | 1285562149 | 0.64
    CollapseRepartition | 40648048 | 72646515 | 1.79
    OptimizeRepartition | 37813850 | 20600803 | 0.54
    OptimizeWindowFunctions | 174426904 | 46741409 | 0.27
    CollapseWindow | 38959957 | 24542426 | 0.63
    TransposeWindow | 33533191 | 20414930 | 0.61
    InferFiltersFromGenerate | 21758688 | 15597344 | 0.72
    InferFiltersFromConstraints | 518009794 | 493282321 | 0.95
    CombineUnions | 67694022 | 70550382 | 1.04
    CombineFilters | 35265060 | 29005424 | 0.82
    EliminateSorts | 57025509 | 19795776 | 0.35
    PruneFilters | 433964815 | 465579200 | 1.07
    EliminateLimits | 44275393 | 24476859 | 0.55
    DecimalAggregates | 83143172 | 28816090 | 0.35
    ReplaceDistinctWithAggregate | 21783760 | 18287489 | 0.84
    ReplaceIntersectWithSemiJoin | 22311271 | 16566393 | 0.74
    ReplaceExceptWithAntiJoin | 23838520 | 16588808 | 0.70
    RewriteExceptAll | 32750296 | 29421957 | 0.90
    RewriteIntersectAll | 29760454 | 21243599 | 0.71
    RemoveLiteralFromGroupExpressions | 28151861 | 25270947 | 0.90
    RemoveRepetitionFromGroupExpressions | 29587030 | 23447041 | 0.79
    OptimizeLimitZero | 18081943 | 15597344 | 0.86
    **Accumulated | 4129959311 | 3112676285 | 0.75**
    
    ### How was this patch tested?
    
    Existing tests.
    
    Closes #32439 from sigmod/optimizer.
    
    Authored-by: Yingyi Bu <[email protected]>
    Signed-off-by: Gengliang Wang <[email protected]>
    sigmod authored and gengliangwang committed May 12, 2021
    Configuration menu
    Copy the full SHA
    d92018e View commit details
    Browse the repository at this point in the history
  8. [SPARK-29145][SQL][FOLLOWUP] Clean up code about support sub-queries …

    …in join conditions
    
    ### What changes were proposed in this pull request?
    According to discuss #25854 (comment)
    
    ### Why are the changes needed?
    Clean code
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Existed UT
    
    Closes #32499 from AngersZhuuuu/SPARK-29145-fix.
    
    Authored-by: Angerszhuuuu <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    AngersZhuuuu authored and cloud-fan committed May 12, 2021
    Configuration menu
    Copy the full SHA
    ed05954 View commit details
    Browse the repository at this point in the history
  9. [SPARK-35357][GRAPHX] Allow to turn off the normalization applied by …

    …static PageRank utilities
    
    ### What changes were proposed in this pull request?
    
    Overload methods `PageRank.runWithOptions` and  `PageRank.runWithOptionsWithPreviousPageRank` (not to break any user-facing signature) with a `normalized` parameter that describes "whether or not to normalize the rank sum".
    
    ### Why are the changes needed?
    
    https://issues.apache.org/jira/browse/SPARK-35357
    
    When dealing with a non negligible proportion of sinks in a graph, algorithm based on incremental update of ranks can get a **precision gain for free** if they are allowed to manipulate non normalized ranks.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    By adding a unit test that verifies that (even when dealing with a graph containing a sink) we end up with the same result for both these scenarios:
    a)
      - Run **6 iterations** of pagerank in a row using `PageRank.runWithOptions` with **normalization enabled**
    
    b)
      - Run **2 iterations** using `PageRank.runWithOptions` with **normalization disabled**
      - Resume from the `preRankGraph1` and run **2 more iterations** using `PageRank.runWithOptionsWithPreviousPageRank` with **normalization disabled**
      - Finally resume from the `preRankGraph2` and run **2 more iterations** using `PageRank.runWithOptionsWithPreviousPageRank` with **normalization enabled**
    
    Closes #32485 from bonnal-enzo/make-pagerank-normalization-optional.
    
    Authored-by: Enzo Bonnal <[email protected]>
    Signed-off-by: Sean Owen <[email protected]>
    ebonnal authored and srowen committed May 12, 2021
    Configuration menu
    Copy the full SHA
    402375b View commit details
    Browse the repository at this point in the history
  10. [SPARK-35253][SQL][BUILD] Bump up the janino version to v3.1.4

    ### What changes were proposed in this pull request?
    
    This PR proposes to bump up the janino version from 3.0.16 to v3.1.4.
    The major changes of this upgrade are as follows:
     - Fixed issue #131: Janino 3.1.2 is 10x slower than 3.0.11: The Compiler's IClassLoader was initialized way too eagerly, thus lots of classes were loaded from the class path, which is very slow.
     - Improved the encoding of stack map frames according to JVMS11 4.7.4: Previously, only "full_frame"s were generated.
     - Fixed issue #107: Janino requires "org.codehaus.commons.compiler.io", but commons-compiler does not export this package
     - Fixed the promotion of the array access index expression (see JLS7 15.13 Array Access Expressions).
    
    For all the changes, please see the change log: http://janino-compiler.github.io/janino/changelog.html
    
    NOTE1: I've checked that there is no obvious performance regression. For all the data, see a link: https://docs.google.com/spreadsheets/d/1srxT9CioGQg1fLKM3Uo8z1sTzgCsMj4pg6JzpdcG6VU/edit?usp=sharing
    
    NOTE2: We upgraded janino to 3.1.2 (#27860) once before, but the commit had been reverted in #29495 because of the correctness issue. Recently, #32374 had checked if Spark could land on v3.1.3 or not, but a new bug was found there. These known issues has been fixed in v3.1.4 by following PRs:
     - janino-compiler/janino#145
     - janino-compiler/janino#146
    
    ### Why are the changes needed?
    
    janino v3.0.X  is no longer maintained.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    GA passed.
    
    Closes #32455 from maropu/janino_v3.1.4.
    
    Authored-by: Takeshi Yamamuro <[email protected]>
    Signed-off-by: Sean Owen <[email protected]>
    maropu authored and srowen committed May 12, 2021
    Configuration menu
    Copy the full SHA
    101b0cc View commit details
    Browse the repository at this point in the history
  11. [SPARK-35295][ML] Replace fully com.github.fommil.netlib by dev.ludov…

    …ic.netlib:2.0
    
    ### What changes were proposed in this pull request?
    
    Bump to `dev.ludovic.netlib:2.0` which provides JNI-based wrappers for BLAS, ARPACK, and LAPACK. Theseare not taking dependencies on GPL or LGPL libraries, allowing to provide out-of-the-box support for hardware acceleration when a native library is present (this is still up to the end-user to install such library on their system, like OpenBLAS, Intel MKL, and libarpack2).
    
    ### Why are the changes needed?
    
    Great performance improvement for ML-related workload on vanilla-distributions of Spark.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Users now take advantage of hardware acceleration as long as a native library is installed (like OpenBLAS, Intel MKL and libarpack2).
    
    ### How was this patch tested?
    
    Spark test-suite + dev.ludovic.netlib testsuite.
    
    #### JDK8:
    ```
    [info] OpenJDK 64-Bit Server VM 1.8.0_292-b10 on Linux 5.8.0-50-generic
    [info] Intel(R) Xeon(R) E-2276G CPU  3.80GHz
    [info]
    [info] f2jBLAS    = dev.ludovic.netlib.blas.F2jBLAS
    [info] javaBLAS   = dev.ludovic.netlib.blas.Java8BLAS
    [info] nativeBLAS = dev.ludovic.netlib.blas.JNIBLAS
    [info]
    [info] daxpy:           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        220            226           6        454.9           2.2       1.0X
    [info] java                       221            228           5        451.9           2.2       1.0X
    [info] native                     209            215           5        478.7           2.1       1.1X
    [info]
    [info] saxpy:           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        121            125           3        823.3           1.2       1.0X
    [info] java                       121            125           3        824.3           1.2       1.0X
    [info] native                     101            105           3        988.4           1.0       1.2X
    [info]
    [info] dcopy:           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        212            219           6        470.9           2.1       1.0X
    [info] java                       208            212           4        481.0           2.1       1.0X
    [info] native                     209            215           5        478.5           2.1       1.0X
    [info]
    [info] scopy:           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        114            119           3        878.9           1.1       1.0X
    [info] java                        99            105           3       1011.4           1.0       1.2X
    [info] native                      97            103           3       1026.7           1.0       1.2X
    [info]
    [info] ddot:            Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        108            111           2        925.9           1.1       1.0X
    [info] java                        71             73           2       1414.9           0.7       1.5X
    [info] native                      54             56           2       1847.0           0.5       2.0X
    [info]
    [info] sdot:            Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                         96             97           2       1046.8           1.0       1.0X
    [info] java                        47             48           1       2129.8           0.5       2.0X
    [info] native                      29             30           1       3404.7           0.3       3.3X
    [info]
    [info] dnrm2:           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        139            143           2        718.2           1.4       1.0X
    [info] java                        46             47           1       2171.2           0.5       3.0X
    [info] native                      44             46           2       2261.8           0.4       3.1X
    [info]
    [info] snrm2:           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        154            157           4        651.0           1.5       1.0X
    [info] java                        40             42           1       2469.3           0.4       3.8X
    [info] native                      26             27           1       3787.6           0.3       5.8X
    [info]
    [info] dscal:           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        185            195           8        541.0           1.8       1.0X
    [info] java                       186            196           7        538.5           1.9       1.0X
    [info] native                     177            187           7        564.1           1.8       1.0X
    [info]
    [info] sscal:           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                         98            102           3       1016.2           1.0       1.0X
    [info] java                        98            102           3       1017.8           1.0       1.0X
    [info] native                      87             91           3       1143.2           0.9       1.1X
    [info]
    [info] dgemv[N]:        Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                         68             70           1       1474.7           0.7       1.0X
    [info] java                        51             52           1       1973.0           0.5       1.3X
    [info] native                      30             32           1       3298.8           0.3       2.2X
    [info]
    [info] dgemv[T]:        Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                         96             99           2       1037.9           1.0       1.0X
    [info] java                        50             51           1       1999.6           0.5       1.9X
    [info] native                      30             31           1       3368.1           0.3       3.2X
    [info]
    [info] sgemv[N]:        Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                         59             61           1       1688.7           0.6       1.0X
    [info] java                        41             42           1       2461.9           0.4       1.5X
    [info] native                      15             16           1       6593.0           0.2       3.9X
    [info]
    [info] sgemv[T]:        Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                         90             92           1       1116.2           0.9       1.0X
    [info] java                        39             40           1       2565.8           0.4       2.3X
    [info] native                      15             16           1       6594.2           0.2       5.9X
    [info]
    [info] dger:            Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        192            202           7        520.5           1.9       1.0X
    [info] java                       203            214           7        491.9           2.0       0.9X
    [info] native                     176            187           7        568.8           1.8       1.1X
    [info]
    [info] dspmv[U]:        Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                         59             61           1        846.1           1.2       1.0X
    [info] java                        38             39           1       1313.5           0.8       1.6X
    [info] native                      24             27           1       2047.8           0.5       2.4X
    [info]
    [info] dspr[U]:         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                         97            101           3        515.4           1.9       1.0X
    [info] java                        97            101           2        515.1           1.9       1.0X
    [info] native                      88             91           3        569.1           1.8       1.1X
    [info]
    [info] dsyr[U]:         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        169            174           3        295.4           3.4       1.0X
    [info] java                       169            174           3        295.4           3.4       1.0X
    [info] native                     160            165           4        312.2           3.2       1.1X
    [info]
    [info] dgemm[N,N]:      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        561            577          13       1782.3           0.6       1.0X
    [info] java                       225            231           4       4446.2           0.2       2.5X
    [info] native                      31             32           3      32473.1           0.0      18.2X
    [info]
    [info] dgemm[N,T]:      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        570            584           9       1754.8           0.6       1.0X
    [info] java                       224            230           4       4457.3           0.2       2.5X
    [info] native                      31             32           1      32493.4           0.0      18.5X
    [info]
    [info] dgemm[T,N]:      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        855            866           6       1169.2           0.9       1.0X
    [info] java                       224            228           3       4466.9           0.2       3.8X
    [info] native                      31             32           1      32395.5           0.0      27.7X
    [info]
    [info] dgemm[T,T]:      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                       1328           1344           8        752.8           1.3       1.0X
    [info] java                       224            230           4       4458.9           0.2       5.9X
    [info] native                      31             32           1      32201.8           0.0      42.8X
    [info]
    [info] sgemm[N,N]:      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        534            541           5       1873.0           0.5       1.0X
    [info] java                       220            224           3       4542.8           0.2       2.4X
    [info] native                      15             16           1      66803.1           0.0      35.7X
    [info]
    [info] sgemm[N,T]:      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        544            551           6       1839.6           0.5       1.0X
    [info] java                       220            224           4       4538.2           0.2       2.5X
    [info] native                      15             16           1      65589.9           0.0      35.7X
    [info]
    [info] sgemm[T,N]:      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        833            845          21       1201.0           0.8       1.0X
    [info] java                       220            224           3       4548.7           0.2       3.8X
    [info] native                      15             16           1      66603.2           0.0      55.5X
    [info]
    [info] sgemm[T,T]:      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        899            907           5       1112.9           0.9       1.0X
    [info] java                       221            224           2       4531.6           0.2       4.1X
    [info] native                      15             16           1      65944.9           0.0      59.3X
    ```
    
    #### JDK11:
    ```
    [info] OpenJDK 64-Bit Server VM 11.0.11+9-LTS on Linux 5.8.0-50-generic
    [info] Intel(R) Xeon(R) E-2276G CPU  3.80GHz
    [info]
    [info] f2jBLAS    = dev.ludovic.netlib.blas.F2jBLAS
    [info] javaBLAS   = dev.ludovic.netlib.blas.Java11BLAS
    [info] nativeBLAS = dev.ludovic.netlib.blas.JNIBLAS
    [info]
    [info] daxpy:           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        195            200           3        512.2           2.0       1.0X
    [info] java                       197            202           3        507.0           2.0       1.0X
    [info] native                     184            189           4        543.0           1.8       1.1X
    [info]
    [info] saxpy:           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        108            112           3        921.8           1.1       1.0X
    [info] java                       101            105           3        989.4           1.0       1.1X
    [info] native                      87             91           3       1147.1           0.9       1.2X
    [info]
    [info] dcopy:           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        187            191           3        535.1           1.9       1.0X
    [info] java                       182            188           3        548.8           1.8       1.0X
    [info] native                     178            182           3        562.2           1.8       1.1X
    [info]
    [info] scopy:           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        110            114           3        909.3           1.1       1.0X
    [info] java                        86             93           4       1159.3           0.9       1.3X
    [info] native                      86             90           3       1162.4           0.9       1.3X
    [info]
    [info] ddot:            Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        106            108           2        943.6           1.1       1.0X
    [info] java                        70             71           2       1426.8           0.7       1.5X
    [info] native                      54             56           2       1835.4           0.5       1.9X
    [info]
    [info] sdot:            Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                         96             97           1       1047.1           1.0       1.0X
    [info] java                        43             44           1       2331.9           0.4       2.2X
    [info] native                      29             30           1       3392.1           0.3       3.2X
    [info]
    [info] dnrm2:           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        114            115           2        880.7           1.1       1.0X
    [info] java                        42             43           1       2398.1           0.4       2.7X
    [info] native                      45             46           1       2233.3           0.4       2.5X
    [info]
    [info] snrm2:           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        140            143           2        714.6           1.4       1.0X
    [info] java                        28             29           1       3531.0           0.3       4.9X
    [info] native                      26             27           1       3820.0           0.3       5.3X
    [info]
    [info] dscal:           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        156            166           7        641.3           1.6       1.0X
    [info] java                       158            167           6        633.2           1.6       1.0X
    [info] native                     150            160           7        664.8           1.5       1.0X
    [info]
    [info] sscal:           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                         85             88           2       1181.7           0.8       1.0X
    [info] java                        85             88           2       1176.0           0.9       1.0X
    [info] native                      75             78           2       1333.2           0.8       1.1X
    [info]
    [info] dgemv[N]:        Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                         58             59           1       1731.1           0.6       1.0X
    [info] java                        41             43           1       2415.5           0.4       1.4X
    [info] native                      30             31           1       3293.9           0.3       1.9X
    [info]
    [info] dgemv[T]:        Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                         94             96           1       1063.4           0.9       1.0X
    [info] java                        41             42           1       2435.8           0.4       2.3X
    [info] native                      30             30           1       3379.8           0.3       3.2X
    [info]
    [info] sgemv[N]:        Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                         44             45           1       2278.9           0.4       1.0X
    [info] java                        37             38           0       2686.8           0.4       1.2X
    [info] native                      15             16           1       6555.4           0.2       2.9X
    [info]
    [info] sgemv[T]:        Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                         88             89           1       1142.1           0.9       1.0X
    [info] java                        33             34           1       3010.7           0.3       2.6X
    [info] native                      15             16           1       6553.9           0.2       5.7X
    [info]
    [info] dger:            Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        164            172           4        609.4           1.6       1.0X
    [info] java                       163            172           5        612.6           1.6       1.0X
    [info] native                     150            159           4        667.0           1.5       1.1X
    [info]
    [info] dspmv[U]:        Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                         49             50           1       1029.4           1.0       1.0X
    [info] java                        41             42           1       1209.4           0.8       1.2X
    [info] native                      25             27           1       2029.2           0.5       2.0X
    [info]
    [info] dspr[U]:         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                         80             85           3        622.2           1.6       1.0X
    [info] java                        80             85           3        622.4           1.6       1.0X
    [info] native                      75             79           3        668.7           1.5       1.1X
    [info]
    [info] dsyr[U]:         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        137            142           3        364.1           2.7       1.0X
    [info] java                       139            142           2        360.4           2.8       1.0X
    [info] native                     131            135           3        380.4           2.6       1.0X
    [info]
    [info] dgemm[N,N]:      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        517            525           5       1935.5           0.5       1.0X
    [info] java                       213            216           3       4704.8           0.2       2.4X
    [info] native                      31             31           1      32705.6           0.0      16.9X
    [info]
    [info] dgemm[N,T]:      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        589            601           6       1698.6           0.6       1.0X
    [info] java                       213            217           3       4693.3           0.2       2.8X
    [info] native                      31             32           1      32498.9           0.0      19.1X
    [info]
    [info] dgemm[T,N]:      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        851            865           6       1175.3           0.9       1.0X
    [info] java                       212            216           3       4717.0           0.2       4.0X
    [info] native                      30             32           1      32903.0           0.0      28.0X
    [info]
    [info] dgemm[T,T]:      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                       1301           1316           6        768.4           1.3       1.0X
    [info] java                       212            216           2       4717.4           0.2       6.1X
    [info] native                      31             32           1      32606.0           0.0      42.4X
    [info]
    [info] sgemm[N,N]:      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        454            460           2       2203.0           0.5       1.0X
    [info] java                       208            212           3       4803.8           0.2       2.2X
    [info] native                      15             16           0      66586.0           0.0      30.2X
    [info]
    [info] sgemm[N,T]:      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        529            536           4       1889.7           0.5       1.0X
    [info] java                       208            212           3       4798.6           0.2       2.5X
    [info] native                      15             16           1      66751.4           0.0      35.3X
    [info]
    [info] sgemm[T,N]:      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        830            840           5       1205.1           0.8       1.0X
    [info] java                       208            211           2       4814.1           0.2       4.0X
    [info] native                      15             15           1      67676.4           0.0      56.2X
    [info]
    [info] sgemm[T,T]:      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        894            907           7       1118.7           0.9       1.0X
    [info] java                       208            211           3       4809.6           0.2       4.3X
    [info] native                      15             16           1      66675.2           0.0      59.6X
    ```
    
    #### JDK16:
    ```
    [info] OpenJDK 64-Bit Server VM 16+36 on Linux 5.8.0-50-generic
    [info] Intel(R) Xeon(R) E-2276G CPU  3.80GHz
    [info]
    [info] f2jBLAS    = dev.ludovic.netlib.blas.F2jBLAS
    [info] javaBLAS   = dev.ludovic.netlib.blas.VectorBLAS
    [info] nativeBLAS = dev.ludovic.netlib.blas.JNIBLAS
    [info]
    [info] daxpy:           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        193            199           3        517.5           1.9       1.0X
    [info] java                       181            186           4        553.2           1.8       1.1X
    [info] native                     181            185           5        553.6           1.8       1.1X
    [info]
    [info] saxpy:           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        108            112           2        925.1           1.1       1.0X
    [info] java                        88             91           3       1138.6           0.9       1.2X
    [info] native                      87             91           3       1144.2           0.9       1.2X
    [info]
    [info] dcopy:           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        184            189           3        542.5           1.8       1.0X
    [info] java                       181            185           3        552.8           1.8       1.0X
    [info] native                     179            183           2        558.0           1.8       1.0X
    [info]
    [info] scopy:           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                         97            101           3       1031.6           1.0       1.0X
    [info] java                        86             90           2       1163.7           0.9       1.1X
    [info] native                      85             88           2       1182.9           0.8       1.1X
    [info]
    [info] ddot:            Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        107            109           2        932.4           1.1       1.0X
    [info] java                        54             56           2       1846.7           0.5       2.0X
    [info] native                      54             56           2       1846.7           0.5       2.0X
    [info]
    [info] sdot:            Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                         96             97           1       1043.6           1.0       1.0X
    [info] java                        29             30           1       3439.3           0.3       3.3X
    [info] native                      29             30           1       3423.9           0.3       3.3X
    [info]
    [info] dnrm2:           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        121            123           2        829.8           1.2       1.0X
    [info] java                        32             32           1       3171.3           0.3       3.8X
    [info] native                      45             46           1       2246.2           0.4       2.7X
    [info]
    [info] snrm2:           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        142            144           2        705.9           1.4       1.0X
    [info] java                        15             16           1       6585.8           0.2       9.3X
    [info] native                      26             27           1       3839.5           0.3       5.4X
    [info]
    [info] dscal:           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        157            165           5        635.6           1.6       1.0X
    [info] java                       151            159           5        664.0           1.5       1.0X
    [info] native                     151            160           5        663.6           1.5       1.0X
    [info]
    [info] sscal:           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                         85             89           2       1172.3           0.9       1.0X
    [info] java                        75             79           3       1337.3           0.7       1.1X
    [info] native                      75             79           2       1335.5           0.7       1.1X
    [info]
    [info] dgemv[N]:        Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                         58             59           1       1731.5           0.6       1.0X
    [info] java                        28             29           1       3544.2           0.3       2.0X
    [info] native                      30             31           1       3306.2           0.3       1.9X
    [info]
    [info] dgemv[T]:        Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                         90             92           1       1108.3           0.9       1.0X
    [info] java                        28             28           1       3622.5           0.3       3.3X
    [info] native                      30             31           1       3381.3           0.3       3.1X
    [info]
    [info] sgemv[N]:        Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                         44             45           1       2284.7           0.4       1.0X
    [info] java                        14             15           1       7034.0           0.1       3.1X
    [info] native                      15             16           1       6643.7           0.2       2.9X
    [info]
    [info] sgemv[T]:        Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                         85             86           1       1177.4           0.8       1.0X
    [info] java                        15             15           1       6886.1           0.1       5.8X
    [info] native                      15             16           1       6560.1           0.2       5.6X
    [info]
    [info] dger:            Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        164            173           6        608.1           1.6       1.0X
    [info] java                       148            157           5        675.2           1.5       1.1X
    [info] native                     152            160           5        659.9           1.5       1.1X
    [info]
    [info] dspmv[U]:        Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                         61             63           1        815.4           1.2       1.0X
    [info] java                        16             17           1       3104.3           0.3       3.8X
    [info] native                      24             27           1       2071.9           0.5       2.5X
    [info]
    [info] dspr[U]:         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                         81             85           2        616.4           1.6       1.0X
    [info] java                        81             85           2        614.7           1.6       1.0X
    [info] native                      75             78           2        669.5           1.5       1.1X
    [info]
    [info] dsyr[U]:         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        138            141           3        362.7           2.8       1.0X
    [info] java                       137            140           2        365.3           2.7       1.0X
    [info] native                     131            134           2        382.9           2.6       1.1X
    [info]
    [info] dgemm[N,N]:      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        525            544           8       1906.2           0.5       1.0X
    [info] java                        61             68           3      16358.1           0.1       8.6X
    [info] native                      31             32           1      32623.7           0.0      17.1X
    [info]
    [info] dgemm[N,T]:      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        580            598          12       1724.5           0.6       1.0X
    [info] java                        61             68           4      16302.5           0.1       9.5X
    [info] native                      30             32           1      32962.8           0.0      19.1X
    [info]
    [info] dgemm[T,N]:      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        829            838           4       1206.2           0.8       1.0X
    [info] java                        61             69           3      16339.7           0.1      13.5X
    [info] native                      30             31           1      33231.9           0.0      27.6X
    [info]
    [info] dgemm[T,T]:      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                       1352           1363           5        739.6           1.4       1.0X
    [info] java                        61             69           3      16347.0           0.1      22.1X
    [info] native                      31             32           1      32740.3           0.0      44.3X
    [info]
    [info] sgemm[N,N]:      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        482            493           7       2073.1           0.5       1.0X
    [info] java                        35             38           2      28315.3           0.0      13.7X
    [info] native                      15             15           1      67579.7           0.0      32.6X
    [info]
    [info] sgemm[N,T]:      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        472            482           4       2119.0           0.5       1.0X
    [info] java                        36             38           2      28138.1           0.0      13.3X
    [info] native                      15             16           1      66616.5           0.0      31.4X
    [info]
    [info] sgemm[T,N]:      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        823            830           5       1215.2           0.8       1.0X
    [info] java                        35             38           2      28681.4           0.0      23.6X
    [info] native                      15             15           1      67908.4           0.0      55.9X
    [info]
    [info] sgemm[T,T]:      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    [info] -----------------------------------------------------------------------------------------------
    [info] f2j                        896            908           7       1115.8           0.9       1.0X
    [info] java                        35             38           2      28402.0           0.0      25.5X
    [info] native                      15             16           0      66691.2           0.0      59.8X
    ```
    
    TODO:
    - [x] update documentation in `docs/` and `docs/ml-linalg-guide.md` refering `com.github.fommil.netlib`
    - [ ] merge luhenry/netlib#1 with all feedback from this PR + remove references to snapshot repositories in `pom.xml` and `project/SparkBuild.scala`.
    
    Closes #32415 from luhenry/master.
    
    Authored-by: Ludovic Henry <[email protected]>
    Signed-off-by: Sean Owen <[email protected]>
    luhenry authored and srowen committed May 12, 2021
    Configuration menu
    Copy the full SHA
    b52d47a View commit details
    Browse the repository at this point in the history
  12. [SPARK-35349][SQL] Add code-gen for left/right outer sort merge join

    ### What changes were proposed in this pull request?
    
    This PR is to add code-gen support for LEFT OUTER / RIGHT OUTER sort merge join. Currently sort merge join only supports inner join type (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala#L374 ). There's no fundamental reason why we cannot support code-gen for other join types. Here we add code-gen for LEFT OUTER / RIGHT OUTER join. Will submit followup PRs to add LEFT SEMI, LEFT ANTI and FULL OUTER code-gen separately.
    
    The change is to extend current sort merge join logic to work with LEFT OUTER and RIGHT OUTER (should work with LEFT SEMI/ANTI as well, but FULL OUTER join needs some other more code change). Replace left/right with streamed/buffered to make code extendable to other join types besides inner join.
    
    Example query:
    
    ```
    val df1 = spark.range(10).select($"id".as("k1"), $"id".as("k3"))
    val df2 = spark.range(4).select($"id".as("k2"), $"id".as("k4"))
    df1.join(df2.hint("SHUFFLE_MERGE"), $"k1" === $"k2" && $"k3" + 1 < $"k4", "left_outer").explain("codegen")
    ```
    
    Example generated code:
    
    ```
    == Subtree 5 / 5 (maxMethodCodeSize:396; maxConstantPoolSize:159(0.24% used); numInnerClasses:0) ==
    *(5) SortMergeJoin [k1#2L], [k2#8L], LeftOuter, ((k3#3L + 1) < k4#9L)
    :- *(2) Sort [k1#2L ASC NULLS FIRST], false, 0
    :  +- Exchange hashpartitioning(k1#2L, 5), ENSURE_REQUIREMENTS, [id=#26]
    :     +- *(1) Project [id#0L AS k1#2L, id#0L AS k3#3L]
    :        +- *(1) Range (0, 10, step=1, splits=2)
    +- *(4) Sort [k2#8L ASC NULLS FIRST], false, 0
       +- Exchange hashpartitioning(k2#8L, 5), ENSURE_REQUIREMENTS, [id=#32]
          +- *(3) Project [id#6L AS k2#8L, id#6L AS k4#9L]
             +- *(3) Range (0, 4, step=1, splits=2)
    
    Generated code:
    /* 001 */ public Object generate(Object[] references) {
    /* 002 */   return new GeneratedIteratorForCodegenStage5(references);
    /* 003 */ }
    /* 004 */
    /* 005 */ // codegenStageId=5
    /* 006 */ final class GeneratedIteratorForCodegenStage5 extends org.apache.spark.sql.execution.BufferedRowIterator {
    /* 007 */   private Object[] references;
    /* 008 */   private scala.collection.Iterator[] inputs;
    /* 009 */   private scala.collection.Iterator smj_streamedInput_0;
    /* 010 */   private scala.collection.Iterator smj_bufferedInput_0;
    /* 011 */   private InternalRow smj_streamedRow_0;
    /* 012 */   private InternalRow smj_bufferedRow_0;
    /* 013 */   private long smj_value_2;
    /* 014 */   private org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray smj_matches_0;
    /* 015 */   private long smj_value_3;
    /* 016 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] smj_mutableStateArray_0 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[1];
    /* 017 */
    /* 018 */   public GeneratedIteratorForCodegenStage5(Object[] references) {
    /* 019 */     this.references = references;
    /* 020 */   }
    /* 021 */
    /* 022 */   public void init(int index, scala.collection.Iterator[] inputs) {
    /* 023 */     partitionIndex = index;
    /* 024 */     this.inputs = inputs;
    /* 025 */     smj_streamedInput_0 = inputs[0];
    /* 026 */     smj_bufferedInput_0 = inputs[1];
    /* 027 */
    /* 028 */     smj_matches_0 = new org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray(2147483632, 2147483647);
    /* 029 */     smj_mutableStateArray_0[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(4, 0);
    /* 030 */
    /* 031 */   }
    /* 032 */
    /* 033 */   private boolean findNextJoinRows(
    /* 034 */     scala.collection.Iterator streamedIter,
    /* 035 */     scala.collection.Iterator bufferedIter) {
    /* 036 */     smj_streamedRow_0 = null;
    /* 037 */     int comp = 0;
    /* 038 */     while (smj_streamedRow_0 == null) {
    /* 039 */       if (!streamedIter.hasNext()) return false;
    /* 040 */       smj_streamedRow_0 = (InternalRow) streamedIter.next();
    /* 041 */       long smj_value_0 = smj_streamedRow_0.getLong(0);
    /* 042 */       if (false) {
    /* 043 */         if (!smj_matches_0.isEmpty()) {
    /* 044 */           smj_matches_0.clear();
    /* 045 */         }
    /* 046 */         return false;
    /* 047 */
    /* 048 */       }
    /* 049 */       if (!smj_matches_0.isEmpty()) {
    /* 050 */         comp = 0;
    /* 051 */         if (comp == 0) {
    /* 052 */           comp = (smj_value_0 > smj_value_3 ? 1 : smj_value_0 < smj_value_3 ? -1 : 0);
    /* 053 */         }
    /* 054 */
    /* 055 */         if (comp == 0) {
    /* 056 */           return true;
    /* 057 */         }
    /* 058 */         smj_matches_0.clear();
    /* 059 */       }
    /* 060 */
    /* 061 */       do {
    /* 062 */         if (smj_bufferedRow_0 == null) {
    /* 063 */           if (!bufferedIter.hasNext()) {
    /* 064 */             smj_value_3 = smj_value_0;
    /* 065 */             return !smj_matches_0.isEmpty();
    /* 066 */           }
    /* 067 */           smj_bufferedRow_0 = (InternalRow) bufferedIter.next();
    /* 068 */           long smj_value_1 = smj_bufferedRow_0.getLong(0);
    /* 069 */           if (false) {
    /* 070 */             smj_bufferedRow_0 = null;
    /* 071 */             continue;
    /* 072 */           }
    /* 073 */           smj_value_2 = smj_value_1;
    /* 074 */         }
    /* 075 */
    /* 076 */         comp = 0;
    /* 077 */         if (comp == 0) {
    /* 078 */           comp = (smj_value_0 > smj_value_2 ? 1 : smj_value_0 < smj_value_2 ? -1 : 0);
    /* 079 */         }
    /* 080 */
    /* 081 */         if (comp > 0) {
    /* 082 */           smj_bufferedRow_0 = null;
    /* 083 */         } else if (comp < 0) {
    /* 084 */           if (!smj_matches_0.isEmpty()) {
    /* 085 */             smj_value_3 = smj_value_0;
    /* 086 */             return true;
    /* 087 */           } else {
    /* 088 */             return false;
    /* 089 */           }
    /* 090 */         } else {
    /* 091 */           smj_matches_0.add((UnsafeRow) smj_bufferedRow_0);
    /* 092 */           smj_bufferedRow_0 = null;
    /* 093 */         }
    /* 094 */       } while (smj_streamedRow_0 != null);
    /* 095 */     }
    /* 096 */     return false; // unreachable
    /* 097 */   }
    /* 098 */
    /* 099 */   protected void processNext() throws java.io.IOException {
    /* 100 */     while (smj_streamedInput_0.hasNext()) {
    /* 101 */       findNextJoinRows(smj_streamedInput_0, smj_bufferedInput_0);
    /* 102 */       long smj_value_4 = -1L;
    /* 103 */       long smj_value_5 = -1L;
    /* 104 */       boolean smj_loaded_0 = false;
    /* 105 */       smj_value_5 = smj_streamedRow_0.getLong(1);
    /* 106 */       scala.collection.Iterator<UnsafeRow> smj_iterator_0 = smj_matches_0.generateIterator();
    /* 107 */       boolean smj_foundMatch_0 = false;
    /* 108 */
    /* 109 */       // the last iteration of this loop is to emit an empty row if there is no matched rows.
    /* 110 */       while (smj_iterator_0.hasNext() || !smj_foundMatch_0) {
    /* 111 */         InternalRow smj_bufferedRow_1 = smj_iterator_0.hasNext() ?
    /* 112 */         (InternalRow) smj_iterator_0.next() : null;
    /* 113 */         boolean smj_isNull_5 = true;
    /* 114 */         long smj_value_9 = -1L;
    /* 115 */         if (smj_bufferedRow_1 != null) {
    /* 116 */           long smj_value_8 = smj_bufferedRow_1.getLong(1);
    /* 117 */           smj_isNull_5 = false;
    /* 118 */           smj_value_9 = smj_value_8;
    /* 119 */         }
    /* 120 */         if (smj_bufferedRow_1 != null) {
    /* 121 */           boolean smj_isNull_6 = true;
    /* 122 */           boolean smj_value_10 = false;
    /* 123 */           long smj_value_11 = -1L;
    /* 124 */
    /* 125 */           smj_value_11 = smj_value_5 + 1L;
    /* 126 */
    /* 127 */           if (!smj_isNull_5) {
    /* 128 */             smj_isNull_6 = false; // resultCode could change nullability.
    /* 129 */             smj_value_10 = smj_value_11 < smj_value_9;
    /* 130 */
    /* 131 */           }
    /* 132 */           if (smj_isNull_6 || !smj_value_10) {
    /* 133 */             continue;
    /* 134 */           }
    /* 135 */         }
    /* 136 */         if (!smj_loaded_0) {
    /* 137 */           smj_loaded_0 = true;
    /* 138 */           smj_value_4 = smj_streamedRow_0.getLong(0);
    /* 139 */         }
    /* 140 */         boolean smj_isNull_3 = true;
    /* 141 */         long smj_value_7 = -1L;
    /* 142 */         if (smj_bufferedRow_1 != null) {
    /* 143 */           long smj_value_6 = smj_bufferedRow_1.getLong(0);
    /* 144 */           smj_isNull_3 = false;
    /* 145 */           smj_value_7 = smj_value_6;
    /* 146 */         }
    /* 147 */         smj_foundMatch_0 = true;
    /* 148 */         ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] /* numOutputRows */).add(1);
    /* 149 */
    /* 150 */         smj_mutableStateArray_0[0].reset();
    /* 151 */
    /* 152 */         smj_mutableStateArray_0[0].zeroOutNullBytes();
    /* 153 */
    /* 154 */         smj_mutableStateArray_0[0].write(0, smj_value_4);
    /* 155 */
    /* 156 */         smj_mutableStateArray_0[0].write(1, smj_value_5);
    /* 157 */
    /* 158 */         if (smj_isNull_3) {
    /* 159 */           smj_mutableStateArray_0[0].setNullAt(2);
    /* 160 */         } else {
    /* 161 */           smj_mutableStateArray_0[0].write(2, smj_value_7);
    /* 162 */         }
    /* 163 */
    /* 164 */         if (smj_isNull_5) {
    /* 165 */           smj_mutableStateArray_0[0].setNullAt(3);
    /* 166 */         } else {
    /* 167 */           smj_mutableStateArray_0[0].write(3, smj_value_9);
    /* 168 */         }
    /* 169 */         append((smj_mutableStateArray_0[0].getRow()).copy());
    /* 170 */
    /* 171 */       }
    /* 172 */       if (shouldStop()) return;
    /* 173 */     }
    /* 174 */     ((org.apache.spark.sql.execution.joins.SortMergeJoinExec) references[1] /* plan */).cleanupResources();
    /* 175 */   }
    /* 176 */
    /* 177 */ }
    ```
    
    ### Why are the changes needed?
    
    Improve query CPU performance. Example micro benchmark below showed 10% run-time improvement.
    
    ```
    def sortMergeJoinWithDuplicates(): Unit = {
        val N = 2 << 20
        codegenBenchmark("sort merge join with duplicates", N) {
          val df1 = spark.range(N)
            .selectExpr(s"(id * 15485863) % ${N*10} as k1", "id as k3")
          val df2 = spark.range(N)
            .selectExpr(s"(id * 15485867) % ${N*10} as k2", "id as k4")
          val df = df1.join(df2, col("k1") === col("k2") && col("k3") * 3 < col("k4"), "left_outer")
          assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[SortMergeJoinExec]).isDefined)
          df.noop()
        }
     }
    ```
    
    ```
    Running benchmark: sort merge join with duplicates
      Running case: sort merge join with duplicates outer-smj-codegen off
      Stopped after 2 iterations, 2696 ms
      Running case: sort merge join with duplicates outer-smj-codegen on
      Stopped after 5 iterations, 6058 ms
    
    Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.16
    Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
    sort merge join with duplicates:                       Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    -------------------------------------------------------------------------------------------------------------------------------------
    sort merge join with duplicates outer-smj-codegen off           1333           1348          21          1.6         635.7       1.0X
    sort merge join with duplicates outer-smj-codegen on            1169           1212          47          1.8         557.4       1.1X
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Added unit test in `WholeStageCodegenSuite.scala` and `WholeStageCodegenSuite.scala`.
    
    Closes #32476 from c21/smj-outer-codegen.
    
    Authored-by: Cheng Su <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    c21 authored and cloud-fan committed May 12, 2021
    Configuration menu
    Copy the full SHA
    7bcaded View commit details
    Browse the repository at this point in the history
  13. [SPARK-35347][SQL][FOLLOWUP] Throw exception with an explicit excepti…

    …on type when cannot find the method instead of sys.error
    
    ### What changes were proposed in this pull request?
    
    A simple follow-up of #32474 to throw exception instead of sys.error.
    
    ### Why are the changes needed?
    
    An exception only fails the query, instead of sys.error.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, if `Invoke` or `StaticInvoke` cannot find the method, instead of original `sys.error` now we only throw an exception.
    
    ### How was this patch tested?
    
    Existing tests.
    
    Closes #32519 from viirya/SPARK-35347-followup.
    
    Authored-by: Liang-Chi Hsieh <[email protected]>
    Signed-off-by: Liang-Chi Hsieh <[email protected]>
    viirya committed May 12, 2021
    Configuration menu
    Copy the full SHA
    f156a95 View commit details
    Browse the repository at this point in the history
  14. [SPARK-35387][INFRA] Increase the JVM stack size for Java 11 build test

    ### What changes were proposed in this pull request?
    
    After merging #32439, there is flaky error from the Github action job "Java 11 build with Maven":
    
    ```
    Error:  ## Exception when compiling 473 sources to /home/runner/work/spark/spark/sql/catalyst/target/scala-2.12/classes
    java.lang.StackOverflowError
    scala.reflect.internal.Trees.itransform(Trees.scala:1376)
    scala.reflect.internal.Trees.itransform$(Trees.scala:1374)
    scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
    scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
    scala.reflect.api.Trees$Transformer.transform(Trees.scala:2563)
    scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:51)
    ```
    We can resolve it by increasing the stack size of JVM to 256M. The container for Github action jobs has 7G memory so this should be fine.
    
    ### Why are the changes needed?
    
    Fix flaky test failure in Java 11 build test
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Github action test
    
    Closes #32521 from gengliangwang/increaseStackSize.
    
    Authored-by: Gengliang Wang <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    gengliangwang authored and dongjoon-hyun committed May 12, 2021
    Configuration menu
    Copy the full SHA
    dac6f17 View commit details
    Browse the repository at this point in the history
  15. [SPARK-35383][CORE] Improve s3a magic committer support by inferring …

    …missing configs
    
    ### What changes were proposed in this pull request?
    
    This PR aims to improve S3A magic committer support by inferring all missing configs from a single minimum configuration, `spark.hadoop.fs.s3a.bucket.<bucket>.committer.magic.enabled=true`.
    
    Given that AWS S3 provides a [strong read-after-write consistency](https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-read-after-write-consistency/) since December 2020, we can ignore DynamoDB-related configurations. As a result, the minimum set of configuration are the following:
    
    ```
    spark.hadoop.fs.s3a.committer.magic.enabled=true
    spark.hadoop.fs.s3a.bucket.<bucket>.committer.magic.enabled=true
    spark.hadoop.fs.s3a.committer.name=magic
    spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a=org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
    spark.sql.parquet.output.committer.class=org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter
    spark.sql.sources.commitProtocolClass=org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
    ```
    
    ### Why are the changes needed?
    
    To use S3A magic committer in Apache Spark, the users need to setup a set of configurations. And, if something is missed, it will end up with the error messages like the following.
    ```
    Exception in thread "main" org.apache.hadoop.fs.s3a.commit.PathCommitException:
    `s3a://my-spark-bucket`: Filesystem does not have support for 'magic' committer enabled in configuration option fs.s3a.committer.magic.enabled
    	at org.apache.hadoop.fs.s3a.commit.CommitUtils.verifyIsMagicCommitFS(CommitUtils.java:74)
    	at org.apache.hadoop.fs.s3a.commit.CommitUtils.getS3AFileSystem(CommitUtils.java:109)
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, after this improvement PR, all Spark users can use S3A committer by using a single configuration.
    ```
    spark.hadoop.fs.s3a.bucket.<bucket>.committer.magic.enabled=true
    ```
    
    This PR is going to inferring the missing configurations. So, there is no side-effect if the existing users who have all configurations already.
    
    ### How was this patch tested?
    
    Pass the CIs with the newly added test cases.
    
    Closes #32518 from dongjoon-hyun/SPARK-35383.
    
    Authored-by: Dongjoon Hyun <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    dongjoon-hyun committed May 12, 2021
    Configuration menu
    Copy the full SHA
    77b7fe1 View commit details
    Browse the repository at this point in the history
  16. [SPARK-35361][SQL][FOLLOWUP] Switch to use while loop

    ### What changes were proposed in this pull request?
    
    Switch to plain `while` loop following Spark [style guide](https://github.com/databricks/scala-style-guide#traversal-and-zipwithindex).
    
    ### Why are the changes needed?
    
    `while` loop may yield better performance comparing to `foreach`.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    N/A
    
    Closes #32522 from sunchao/SPARK-35361-follow-up.
    
    Authored-by: Chao Sun <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    sunchao authored and dongjoon-hyun committed May 12, 2021
    Configuration menu
    Copy the full SHA
    bc95c3a View commit details
    Browse the repository at this point in the history
  17. [SPARK-35013][CORE] Don't allow to set spark.driver.cores=0

    ### What changes were proposed in this pull request?
    Currently spark is not allowing to set spark.driver.memory, spark.executor.cores, spark.executor.memory to 0, but allowing driver cores to 0. This PR checks for driver core size as well. Thanks Oleg Lypkan for finding this.
    
    ### Why are the changes needed?
    To make the configuration check consistent.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Manual testing
    
    Closes #32504 from shahidki31/shahid/drivercore.
    
    Lead-authored-by: shahid <[email protected]>
    Co-authored-by: Hyukjin Kwon <[email protected]>
    Co-authored-by: Shahid <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    2 people authored and dongjoon-hyun committed May 12, 2021
    Configuration menu
    Copy the full SHA
    b3c916e View commit details
    Browse the repository at this point in the history
  18. [SPARK-35369][DOC] Document ExecutorAllocationManager metrics

    ### What changes were proposed in this pull request?
    This proposes to document the available metrics for ExecutorAllocationManager in the Spark monitoring documentation.
    
    ### Why are the changes needed?
    The ExecutorAllocationManager is instrumented with metrics using the Spark metrics system.
    The relevant work is in SPARK-7007 and SPARK-33763
    ExecutorAllocationManager metrics are currently undocumented.
    
    ### Does this PR introduce _any_ user-facing change?
    This PR adds documentation only.
    
    ### How was this patch tested?
    na
    
    Closes #32500 from LucaCanali/followupMetricsDocSPARK33763.
    
    Authored-by: Luca Canali <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    LucaCanali authored and dongjoon-hyun committed May 12, 2021
    Configuration menu
    Copy the full SHA
    ae0579a View commit details
    Browse the repository at this point in the history

Commits on May 13, 2021

  1. [SPARK-35385][SQL][TESTS] Skip duplicate queries in the TPCDS-related…

    … tests
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to skip the "q6", "q34", "q64", "q74", "q75", "q78" queries in the TPCDS-related tests because the TPCDS v2.7 queries have almost the same ones; the only differences in these queries are ORDER BY columns.
    
    ### Why are the changes needed?
    
    To improve test performance.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No, dev only.
    
    ### How was this patch tested?
    
    Existing tests.
    
    Closes #32520 from maropu/SkipDupQueries.
    
    Authored-by: Takeshi Yamamuro <[email protected]>
    Signed-off-by: Takeshi Yamamuro <[email protected]>
    maropu committed May 13, 2021
    Configuration menu
    Copy the full SHA
    3241aeb View commit details
    Browse the repository at this point in the history
  2. [SPARK-35388][INFRA] Allow the PR source branch to include slashes

    ### What changes were proposed in this pull request?
    
    This PR allows the PR source branch to include slashes.
    
    ### Why are the changes needed?
    
    There are PRs whose source branches include slashes, like `issues/SPARK-35119/gha` here or #32523.
    
    Before the fix, the PR build fails in `Sync the current branch with the latest in Apache Spark` phase.
    For example, at #32523, the source branch is `issues/SPARK-35382/nested_higher_order_functions`:
    
    ```
    ...
    fatal: couldn't find remote ref nested_higher_order_functions
    Error: Process completed with exit code 128.
    ```
    
    (https://github.com/ueshin/apache-spark/runs/2569356241)
    
    ### Does this PR introduce _any_ user-facing change?
    
    No, this is a dev-only change.
    
    ### How was this patch tested?
    
    This PR source branch includes slashes and #32525 doesn't.
    
    Closes #32524 from ueshin/issues/SPARK-35119/gha.
    
    Authored-by: Takuya UESHIN <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    ueshin authored and HyukjinKwon committed May 13, 2021
    Configuration menu
    Copy the full SHA
    c0b52da View commit details
    Browse the repository at this point in the history
  3. [SPARK-35384][SQL] Improve performance for InvokeLike.invoke

    ### What changes were proposed in this pull request?
    
    Change `map` in `InvokeLike.invoke` to a while loop to improve performance, following Spark [style guide](https://github.com/databricks/scala-style-guide#traversal-and-zipwithindex).
    
    ### Why are the changes needed?
    
    `InvokeLike.invoke`, which is used in non-codegen path for `Invoke` and `StaticInvoke`, currently uses `map` to evaluate arguments:
    ```scala
    val args = arguments.map(e => e.eval(input).asInstanceOf[Object])
    if (needNullCheck && args.exists(_ == null)) {
      // return null if one of arguments is null
      null
    } else {
      ...
    ```
    which is pretty expensive if the method itself is trivial. We can change it to a plain while loop.
    
    <img width="871" alt="Screen Shot 2021-05-12 at 12 19 59 AM" src="https://user-images.githubusercontent.com/506679/118055719-7f985a00-b33d-11eb-943b-cf85eab35f44.png">
    
    Benchmark results show this can improve as much as 3x from `V2FunctionBenchmark`:
    
    Before
    ```
     OpenJDK 64-Bit Server VM 1.8.0_292-b10 on Linux 5.4.0-1046-azure
     Intel(R) Xeon(R) CPU E5-2673 v3  2.40GHz
     scalar function (long + long) -> long, result_nullable = false codegen = false:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
     --------------------------------------------------------------------------------------------------------------------------------------------------------------
     native_long_add                                                                         36506          36656         251         13.7          73.0       1.0X
     java_long_add_default                                                                   47151          47540         370         10.6          94.3       0.8X
     java_long_add_magic                                                                    178691         182457        1327          2.8         357.4       0.2X
     java_long_add_static_magic                                                             177151         178258        1151          2.8         354.3       0.2X
    ```
    
    After
    ```
     OpenJDK 64-Bit Server VM 1.8.0_292-b10 on Linux 5.4.0-1046-azure
     Intel(R) Xeon(R) CPU E5-2673 v3  2.40GHz
     scalar function (long + long) -> long, result_nullable = false codegen = false:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
     --------------------------------------------------------------------------------------------------------------------------------------------------------------
     native_long_add                                                                         29897          30342         568         16.7          59.8       1.0X
     java_long_add_default                                                                   40628          41075         664         12.3          81.3       0.7X
     java_long_add_magic                                                                     54553          54755         182          9.2         109.1       0.5X
     java_long_add_static_magic                                                              55410          55532         127          9.0         110.8       0.5X
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Existing tests.
    
    Closes #32527 from sunchao/SPARK-35384.
    
    Authored-by: Chao Sun <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    sunchao authored and dongjoon-hyun committed May 13, 2021
    Configuration menu
    Copy the full SHA
    0ab9bd7 View commit details
    Browse the repository at this point in the history
  4. [SPARK-35382][PYTHON] Fix lambda variable name issues in nested DataF…

    …rame functions in Python APIs
    
    ### What changes were proposed in this pull request?
    
    This PR fixes the same issue as #32424.
    
    ```py
    from pyspark.sql.functions import flatten, struct, transform
    df = spark.sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as letters")
    df.select(flatten(
        transform(
            "numbers",
            lambda number: transform(
                "letters",
                lambda letter: struct(number.alias("n"), letter.alias("l"))
            )
        )
    ).alias("zipped")).show(truncate=False)
    ```
    
    **Before:**
    
    ```
    +------------------------------------------------------------------------+
    |zipped                                                                  |
    +------------------------------------------------------------------------+
    |[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]|
    +------------------------------------------------------------------------+
    ```
    
    **After:**
    
    ```
    +------------------------------------------------------------------------+
    |zipped                                                                  |
    +------------------------------------------------------------------------+
    |[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]|
    +------------------------------------------------------------------------+
    ```
    
    ### Why are the changes needed?
    
    To produce the correct results.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, it fixes the results to be correct as mentioned above.
    
    ### How was this patch tested?
    
    Added a unit test as well as manually.
    
    Closes #32523 from ueshin/issues/SPARK-35382/nested_higher_order_functions.
    
    Authored-by: Takuya UESHIN <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    ueshin authored and HyukjinKwon committed May 13, 2021
    Configuration menu
    Copy the full SHA
    17b59a9 View commit details
    Browse the repository at this point in the history
  5. [SPARK-35394][K8S][BUILD] Move kubernetes-client.version to root pom …

    …file
    
    ### What changes were proposed in this pull request?
    
    This PR aims to unify two K8s version variables in two `pom.xml`s into one. `kubernetes-client.version` is correct because the artifact ID is `kubernetes-client`.
    
    ```
    kubernetes.client.version (kubernetes/core module)
    kubernetes-client.version (kubernetes/integration-test module)
    ```
    
    ### Why are the changes needed?
    
    Having two variables for the same value is confusing and inconvenient when we upgrade K8s versions.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Pass the CIs. (The compilation test passes are enough.)
    
    Closes #32531 from dongjoon-hyun/SPARK-35394.
    
    Authored-by: Dongjoon Hyun <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    dongjoon-hyun committed May 13, 2021
    Configuration menu
    Copy the full SHA
    dd54649 View commit details
    Browse the repository at this point in the history
  6. [SPARK-35380][SQL] Loading SparkSessionExtensions from ServiceLoader

    ### What changes were proposed in this pull request?
    
    In yaooqinn/itachi#8, we had a discussion about the current extension injection for the spark session.  We've agreed that the current way is not that convenient for both third-party developers and end-users.
    
    It's much simple if third-party developers can provide a resource file that contains default extensions for Spark to  load ahead
    
    ### Why are the changes needed?
    
    better use experience
    
    ### Does this PR introduce _any_ user-facing change?
    
    no, dev change
    
    ### How was this patch tested?
    
    new tests
    
    Closes #32515 from yaooqinn/SPARK-35380.
    
    Authored-by: Kent Yao <[email protected]>
    Signed-off-by: Kent Yao <[email protected]>
    yaooqinn committed May 13, 2021
    Configuration menu
    Copy the full SHA
    5181543 View commit details
    Browse the repository at this point in the history
  7. [SPARK-35350][SQL] Add code-gen for left semi sort merge join

    ### What changes were proposed in this pull request?
    
    As title. This PR is to add code-gen support for LEFT SEMI sort merge join. The main change is to add `semiJoin` code path in `SortMergeJoinExec.doProduce()` and introduce `onlyBufferFirstMatchedRow` in `SortMergeJoinExec.genScanner()`. The latter is for left semi sort merge join without condition. For this kind of query, we don't need to buffer all matched rows, but only the first one (this is same as non-code-gen code path).
    
    Example query:
    
    ```
    val df1 = spark.range(10).select($"id".as("k1"))
    val df2 = spark.range(4).select($"id".as("k2"))
    val oneJoinDF = df1.join(df2.hint("SHUFFLE_MERGE"), $"k1" === $"k2", "left_semi")
    ```
    
    Example of generated code for the query:
    
    ```
    == Subtree 5 / 5 (maxMethodCodeSize:302; maxConstantPoolSize:156(0.24% used); numInnerClasses:0) ==
    *(5) Project [id#0L AS k1#2L]
    +- *(5) SortMergeJoin [id#0L], [k2#6L], LeftSemi
       :- *(2) Sort [id#0L ASC NULLS FIRST], false, 0
       :  +- Exchange hashpartitioning(id#0L, 5), ENSURE_REQUIREMENTS, [id=#27]
       :     +- *(1) Range (0, 10, step=1, splits=2)
       +- *(4) Sort [k2#6L ASC NULLS FIRST], false, 0
          +- Exchange hashpartitioning(k2#6L, 5), ENSURE_REQUIREMENTS, [id=#33]
             +- *(3) Project [id#4L AS k2#6L]
                +- *(3) Range (0, 4, step=1, splits=2)
    
    Generated code:
    /* 001 */ public Object generate(Object[] references) {
    /* 002 */   return new GeneratedIteratorForCodegenStage5(references);
    /* 003 */ }
    /* 004 */
    /* 005 */ // codegenStageId=5
    /* 006 */ final class GeneratedIteratorForCodegenStage5 extends org.apache.spark.sql.execution.BufferedRowIterator {
    /* 007 */   private Object[] references;
    /* 008 */   private scala.collection.Iterator[] inputs;
    /* 009 */   private scala.collection.Iterator smj_streamedInput_0;
    /* 010 */   private scala.collection.Iterator smj_bufferedInput_0;
    /* 011 */   private InternalRow smj_streamedRow_0;
    /* 012 */   private InternalRow smj_bufferedRow_0;
    /* 013 */   private long smj_value_2;
    /* 014 */   private org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray smj_matches_0;
    /* 015 */   private long smj_value_3;
    /* 016 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] smj_mutableStateArray_0 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[2];
    /* 017 */
    /* 018 */   public GeneratedIteratorForCodegenStage5(Object[] references) {
    /* 019 */     this.references = references;
    /* 020 */   }
    /* 021 */
    /* 022 */   public void init(int index, scala.collection.Iterator[] inputs) {
    /* 023 */     partitionIndex = index;
    /* 024 */     this.inputs = inputs;
    /* 025 */     smj_streamedInput_0 = inputs[0];
    /* 026 */     smj_bufferedInput_0 = inputs[1];
    /* 027 */
    /* 028 */     smj_matches_0 = new org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray(1, 2147483647);
    /* 029 */     smj_mutableStateArray_0[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);
    /* 030 */     smj_mutableStateArray_0[1] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);
    /* 031 */
    /* 032 */   }
    /* 033 */
    /* 034 */   private boolean findNextJoinRows(
    /* 035 */     scala.collection.Iterator streamedIter,
    /* 036 */     scala.collection.Iterator bufferedIter) {
    /* 037 */     smj_streamedRow_0 = null;
    /* 038 */     int comp = 0;
    /* 039 */     while (smj_streamedRow_0 == null) {
    /* 040 */       if (!streamedIter.hasNext()) return false;
    /* 041 */       smj_streamedRow_0 = (InternalRow) streamedIter.next();
    /* 042 */       long smj_value_0 = smj_streamedRow_0.getLong(0);
    /* 043 */       if (false) {
    /* 044 */         smj_streamedRow_0 = null;
    /* 045 */         continue;
    /* 046 */
    /* 047 */       }
    /* 048 */       if (!smj_matches_0.isEmpty()) {
    /* 049 */         comp = 0;
    /* 050 */         if (comp == 0) {
    /* 051 */           comp = (smj_value_0 > smj_value_3 ? 1 : smj_value_0 < smj_value_3 ? -1 : 0);
    /* 052 */         }
    /* 053 */
    /* 054 */         if (comp == 0) {
    /* 055 */           return true;
    /* 056 */         }
    /* 057 */         smj_matches_0.clear();
    /* 058 */       }
    /* 059 */
    /* 060 */       do {
    /* 061 */         if (smj_bufferedRow_0 == null) {
    /* 062 */           if (!bufferedIter.hasNext()) {
    /* 063 */             smj_value_3 = smj_value_0;
    /* 064 */             return !smj_matches_0.isEmpty();
    /* 065 */           }
    /* 066 */           smj_bufferedRow_0 = (InternalRow) bufferedIter.next();
    /* 067 */           long smj_value_1 = smj_bufferedRow_0.getLong(0);
    /* 068 */           if (false) {
    /* 069 */             smj_bufferedRow_0 = null;
    /* 070 */             continue;
    /* 071 */           }
    /* 072 */           smj_value_2 = smj_value_1;
    /* 073 */         }
    /* 074 */
    /* 075 */         comp = 0;
    /* 076 */         if (comp == 0) {
    /* 077 */           comp = (smj_value_0 > smj_value_2 ? 1 : smj_value_0 < smj_value_2 ? -1 : 0);
    /* 078 */         }
    /* 079 */
    /* 080 */         if (comp > 0) {
    /* 081 */           smj_bufferedRow_0 = null;
    /* 082 */         } else if (comp < 0) {
    /* 083 */           if (!smj_matches_0.isEmpty()) {
    /* 084 */             smj_value_3 = smj_value_0;
    /* 085 */             return true;
    /* 086 */           } else {
    /* 087 */             smj_streamedRow_0 = null;
    /* 088 */           }
    /* 089 */         } else {
    /* 090 */           if (smj_matches_0.isEmpty()) {
    /* 091 */             smj_matches_0.add((UnsafeRow) smj_bufferedRow_0);
    /* 092 */           }
    /* 093 */
    /* 094 */           smj_bufferedRow_0 = null;
    /* 095 */         }
    /* 096 */       } while (smj_streamedRow_0 != null);
    /* 097 */     }
    /* 098 */     return false; // unreachable
    /* 099 */   }
    /* 100 */
    /* 101 */   protected void processNext() throws java.io.IOException {
    /* 102 */     while (findNextJoinRows(smj_streamedInput_0, smj_bufferedInput_0)) {
    /* 103 */       long smj_value_4 = -1L;
    /* 104 */       smj_value_4 = smj_streamedRow_0.getLong(0);
    /* 105 */       scala.collection.Iterator<UnsafeRow> smj_iterator_0 = smj_matches_0.generateIterator();
    /* 106 */       boolean smj_hasOutputRow_0 = false;
    /* 107 */
    /* 108 */       while (!smj_hasOutputRow_0 && smj_iterator_0.hasNext()) {
    /* 109 */         InternalRow smj_bufferedRow_1 = (InternalRow) smj_iterator_0.next();
    /* 110 */
    /* 111 */         smj_hasOutputRow_0 = true;
    /* 112 */         ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] /* numOutputRows */).add(1);
    /* 113 */
    /* 114 */         // common sub-expressions
    /* 115 */
    /* 116 */         smj_mutableStateArray_0[1].reset();
    /* 117 */
    /* 118 */         smj_mutableStateArray_0[1].write(0, smj_value_4);
    /* 119 */         append((smj_mutableStateArray_0[1].getRow()).copy());
    /* 120 */
    /* 121 */       }
    /* 122 */       if (shouldStop()) return;
    /* 123 */     }
    /* 124 */     ((org.apache.spark.sql.execution.joins.SortMergeJoinExec) references[1] /* plan */).cleanupResources();
    /* 125 */   }
    /* 126 */
    /* 127 */ }
    ```
    
    ### Why are the changes needed?
    
    Improve query CPU performance. Test with one query:
    
    ```
     def sortMergeJoin(): Unit = {
        val N = 2 << 20
        codegenBenchmark("left semi sort merge join", N) {
          val df1 = spark.range(N).selectExpr(s"id * 2 as k1")
          val df2 = spark.range(N).selectExpr(s"id * 3 as k2")
          val df = df1.join(df2, col("k1") === col("k2"), "left_semi")
          assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[SortMergeJoinExec]).isDefined)
          df.noop()
        }
      }
    ```
    
    Seeing 30% of run-time improvement:
    
    ```
    Running benchmark: left semi sort merge join
      Running case: left semi sort merge join code-gen off
      Stopped after 2 iterations, 1369 ms
      Running case: left semi sort merge join code-gen on
      Stopped after 5 iterations, 2743 ms
    
    Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.16
    Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
    left semi sort merge join:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------------------------------
    left semi sort merge join code-gen off              676            685          13          3.1         322.2       1.0X
    left semi sort merge join code-gen on               524            549          32          4.0         249.7       1.3X
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Added unit test in `WholeStageCodegenSuite.scala` and `ExistenceJoinSuite.scala`.
    
    Closes #32528 from c21/smj-left-semi.
    
    Authored-by: Cheng Su <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    c21 authored and cloud-fan committed May 13, 2021
    Configuration menu
    Copy the full SHA
    c1e995a View commit details
    Browse the repository at this point in the history
  8. [SPARK-34720][SQL] MERGE ... UPDATE/INSERT * should do by-name resolu…

    …tion
    
    ### What changes were proposed in this pull request?
    
    In Spark, we have an extension in the MERGE syntax: INSERT/UPDATE *. This is not from ANSI standard or any other mainstream databases, so we need to define the behaviors by our own.
    
    The behavior today is very weird: assume the source table has `n1` columns, target table has `n2` columns. We generate the assignments by taking the first `min(n1, n2)` columns from source & target tables and pairing them by ordinal.
    
    This PR proposes a more reasonable behavior: take all the columns from target table as keys, and find the corresponding columns from source table by name as values.
    
    ### Why are the changes needed?
    
    Fix the MEREG INSERT/UPDATE * to be more user-friendly and easy to do schema evolution.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, but MERGE is only supported by very few data sources.
    
    ### How was this patch tested?
    
    new tests
    
    Closes #32192 from cloud-fan/merge.
    
    Authored-by: Wenchen Fan <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    cloud-fan committed May 13, 2021
    Configuration menu
    Copy the full SHA
    d1b8bd7 View commit details
    Browse the repository at this point in the history
  9. [SPARK-34637][SQL] Support DPP + AQE when the broadcast exchange can …

    …be reused
    
    ### What changes were proposed in this pull request?
    We have supported DPP in AQE when the join is Broadcast hash join before applying the AQE rules in [SPARK-34168](https://issues.apache.org/jira/browse/SPARK-34168), which has some limitations. It only apply DPP when the small table side executed firstly and then the big table side can reuse the broadcast exchange in small table side. This PR is to address the above limitations and can apply the DPP when the broadcast exchange can be reused.
    
    ### Why are the changes needed?
    Resolve the limitations when both enabling DPP and AQE
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Adding new ut
    
    Closes #31756 from JkSelf/supportDPP2.
    
    Authored-by: jiake <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    JkSelf authored and cloud-fan committed May 13, 2021
    Configuration menu
    Copy the full SHA
    b6d57b6 View commit details
    Browse the repository at this point in the history
  10. [SPARK-35392][ML][PYTHON] Fix flaky tests in ml/clustering.py and ml/…

    …feature.py
    
    ### What changes were proposed in this pull request?
    
    This PR removes the check of `summary.logLikelihood` in  ml/clustering.py - this GMM test is quite flaky. It fails easily e.g., if:
    - change number of partitions;
    - just change the way to compute the sum of weights;
    - change the underlying BLAS impl
    
    Also uses more permissive precision on `Word2Vec` test case.
    
    ### Why are the changes needed?
    
    To recover the build and tests.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Existing test cases.
    
    Closes #32533 from zhengruifeng/SPARK_35392_disable_flaky_gmm_test.
    
    Lead-authored-by: Ruifeng Zheng <[email protected]>
    Co-authored-by: Hyukjin Kwon <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    zhengruifeng and HyukjinKwon committed May 13, 2021
    Configuration menu
    Copy the full SHA
    f7704ec View commit details
    Browse the repository at this point in the history
  11. [SPARK-35373][BUILD] Check Maven artifact checksum in build/mvn

    ### What changes were proposed in this pull request?
    
    `./build/mvn` now downloads the .sha512 checksum of Maven artifacts it downloads, and checks the checksum after download.
    
    ### Why are the changes needed?
    
    This ensures the integrity of the Maven artifact during a user's build, which may come from several non-ASF mirrors.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Should not affect anything about Spark per se, just the build.
    
    ### How was this patch tested?
    
    Manual testing wherein I forced Maven/Scala download, verified checksums are downloaded and checked, and verified it fails on error with a corrupted checksum.
    
    Closes #32505 from srowen/SPARK-35373.
    
    Authored-by: Sean Owen <[email protected]>
    Signed-off-by: Sean Owen <[email protected]>
    srowen committed May 13, 2021
    Configuration menu
    Copy the full SHA
    6c5fcac View commit details
    Browse the repository at this point in the history
  12. [SPARK-35162][SQL] New SQL functions: TRY_ADD/TRY_DIVIDE

    ### What changes were proposed in this pull request?
    
    Add New SQL functions:
    * TRY_ADD
    * TRY_DIVIDE
    
    These expressions are identical to the following expression under ANSI mode except that it returns null if error occurs:
    * ADD
    * DIVIDE
    
    Note: it is easy to add other expressions like `TRY_SUBTRACT`/`TRY_MULTIPLY` but let's control the number of these new expressions and just add `TRY_ADD` and `TRY_DIVIDE` for now.
    
    ### Why are the changes needed?
    
    1. Users can manage to finish queries without interruptions in ANSI mode.
    2. Users can get NULLs instead of unreasonable results if overflow occurs when ANSI mode is off.
    For example, the behavior of the following SQL operations is unreasonable:
    ```
    2147483647 + 2 => -2147483647
    ```
    
    With the new safe version SQL functions:
    ```
    TRY_ADD(2147483647, 2) => null
    ```
    
    Note: **We should only add new expressions to important operators, instead of adding new safe expressions for all the expressions that can throw errors.**
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, new SQL functions: TRY_ADD/TRY_DIVIDE
    
    ### How was this patch tested?
    
    Unit test
    
    Closes #32292 from gengliangwang/try_add.
    
    Authored-by: Gengliang Wang <[email protected]>
    Signed-off-by: Gengliang Wang <[email protected]>
    gengliangwang committed May 13, 2021
    Configuration menu
    Copy the full SHA
    02c99f1 View commit details
    Browse the repository at this point in the history
  13. [SPARK-35332][SQL] Make cache plan disable configs configurable

    ### What changes were proposed in this pull request?
    
    Add a new config to make cache plan disable configs configurable.
    
    ### Why are the changes needed?
    
    The disable configs of cache plan if to avoid the perfermance regression, but not all the query will slow than before due to AQE or bucket scan enabled. It's useful to make a new config so that user can decide if some configs should be disabled during cache plan.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, a new config.
    
    ### How was this patch tested?
    
    Add test.
    
    Closes #32482 from ulysses-you/SPARK-35332.
    
    Authored-by: ulysses-you <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    ulysses-you authored and cloud-fan committed May 13, 2021
    Configuration menu
    Copy the full SHA
    6f63057 View commit details
    Browse the repository at this point in the history
  14. [SPARK-35062][SQL] Group exception messages in sql/streaming

    ### What changes were proposed in this pull request?
    This PR group exception messages in `sql/core/src/main/scala/org/apache/spark/sql/streaming`.
    
    ### Why are the changes needed?
    It will largely help with standardization of error messages and its maintenance.
    
    ### Does this PR introduce _any_ user-facing change?
    No. Error messages remain unchanged.
    
    ### How was this patch tested?
    No new tests - pass all original tests to make sure it doesn't break any existing behavior.
    
    Closes #32464 from beliefer/SPARK-35062.
    
    Lead-authored-by: gengjiaan <[email protected]>
    Co-authored-by: Jiaan Geng <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    2 people authored and cloud-fan committed May 13, 2021
    Configuration menu
    Copy the full SHA
    c2e15cc View commit details
    Browse the repository at this point in the history
  15. [SPARK-35366][SQL] Avoid using deprecated buildForBatch and `buildF…

    …orStreaming`
    
    ### What changes were proposed in this pull request?
    Currently, in DSv2, we are still using the deprecated `buildForBatch` and `buildForStreaming`.
    This PR implements the `build`, `toBatch`, `toStreaming` interfaces to replace the deprecated ones.
    
    ### Why are the changes needed?
    Code refactor
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    exsting UT
    
    Closes #32497 from linhongliu-db/dsv2-writer.
    
    Lead-authored-by: Linhong Liu <[email protected]>
    Co-authored-by: Linhong Liu <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    2 people authored and cloud-fan committed May 13, 2021
    Configuration menu
    Copy the full SHA
    6aa2594 View commit details
    Browse the repository at this point in the history
  16. [SPARK-35393][PYTHON][INFRA][TESTS] Recover pip packaging test in Git…

    …hub Actions
    
    ### What changes were proposed in this pull request?
    
    Currently pip packaging test is being skipped:
    
    ```
    ========================================================================
    Running PySpark packaging tests
    ========================================================================
    Constructing virtual env for testing
    Missing virtualenv & conda, skipping pip installability tests
    Cleaning up temporary directory - /tmp/tmp.iILYWISPXW
    ```
    
    See https://github.com/apache/spark/runs/2568923639?check_suite_focus=true
    
    GitHub Actions's image has its default Conda installed at `/usr/share/miniconda` but seems like the image we're using for PySpark does not have it (which is legitimate).
    
    This PR proposes to install Conda to use in pip packaging tests in GitHub Actions.
    
    ### Why are the changes needed?
    
    To recover the test coverage.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No, dev-only.
    
    ### How was this patch tested?
    
    It was tested in my fork: https://github.com/HyukjinKwon/spark/runs/2575126882?check_suite_focus=true
    
    ```
    ========================================================================
    Running PySpark packaging tests
    ========================================================================
    Constructing virtual env for testing
    Using conda virtual environments
    Testing pip installation with python 3.6
    Using /tmp/tmp.qPjTenqfGn for virtualenv
    Collecting package metadata (current_repodata.json): ...working... done
    Solving environment: ...working... failed with repodata from current_repodata.json, will retry with next repodata source.
    Collecting package metadata (repodata.json): ...working... done
    Solving environment: ...working... done
    
    ## Package Plan ##
    
      environment location: /tmp/tmp.qPjTenqfGn/3.6
    
      added / updated specs:
        - numpy
        - pandas
        - pip
        - python=3.6
        - setuptools
    
    ...
    
    Successfully ran pip sanity check
    ```
    
    Closes #32537 from HyukjinKwon/SPARK-35393.
    
    Authored-by: Hyukjin Kwon <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    HyukjinKwon authored and dongjoon-hyun committed May 13, 2021
    Configuration menu
    Copy the full SHA
    7d371d2 View commit details
    Browse the repository at this point in the history
  17. [SPARK-35397][SQL] Replace sys.err usage with explicit exception type

    ### What changes were proposed in this pull request?
    
    This patch replaces `sys.err` usages with explicit exception types.
    
    ### Why are the changes needed?
    
    Motivated by the previous comment #32519 (comment), it sounds better to replace `sys.err` usages with explicit exception type.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Existing tests.
    
    Closes #32535 from viirya/replace-sys-err.
    
    Authored-by: Liang-Chi Hsieh <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    viirya authored and dongjoon-hyun committed May 13, 2021
    Configuration menu
    Copy the full SHA
    6a949d1 View commit details
    Browse the repository at this point in the history
  18. [SPARK-34764][CORE][K8S][UI] Propagate reason for exec loss to Web UI

    ### What changes were proposed in this pull request?
    
    Adds the exec loss reason to the Spark web UI & in doing so also fix the Kube integration to pass exec loss reason into core.
    
    UI change:
    
    ![image](https://user-images.githubusercontent.com/59893/117045762-b975ba80-acc4-11eb-9679-8edab3cfadc2.png)
    
    ### Why are the changes needed?
    
    Debugging Spark jobs is *hard*, making it clearer why executors have exited could help.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes a new column on the executor page.
    
    ### How was this patch tested?
    
    K8s unit test updated to validate exec loss reasons are passed through regardless of exec alive state, manual testing to validate the UI.
    
    Closes #32436 from holdenk/SPARK-34764-propegate-reason-for-exec-loss.
    
    Lead-authored-by: Holden Karau <[email protected]>
    Co-authored-by: Holden Karau <[email protected]>
    Signed-off-by: Holden Karau <[email protected]>
    holdenk and holdenk committed May 13, 2021
    Configuration menu
    Copy the full SHA
    160b3be View commit details
    Browse the repository at this point in the history

Commits on May 14, 2021

  1. [SPARK-35329][SQL] Split generated switch code into pieces in ExpandExec

    ### What changes were proposed in this pull request?
    
    This PR intends to split generated switch code into smaller ones in `ExpandExec`. In the current master, even a simple query like the one below generates a large method whose size (`maxMethodCodeSize:7448`) is close to `8000` (`CodeGenerator.DEFAULT_JVM_HUGE_METHOD_LIMIT`);
    ```
    scala> val df = Seq(("2016-03-27 19:39:34", 1, "a"), ("2016-03-27 19:39:56", 2, "a"), ("2016-03-27 19:39:27", 4, "b")).toDF("time", "value", "id")
    scala> val rdf = df.select(window($"time", "10 seconds", "3 seconds", "0 second"), $"value").orderBy($"window.start".asc, $"value".desc).select("value")
    scala> sql("SET spark.sql.adaptive.enabled=false")
    scala> import org.apache.spark.sql.execution.debug._
    scala> rdf.debugCodegen
    
    Found 2 WholeStageCodegen subtrees.
    == Subtree 1 / 2 (maxMethodCodeSize:7448; maxConstantPoolSize:189(0.29% used); numInnerClasses:0) ==
                                        ^^^^
    *(1) Project [window#34.start AS _gen_alias_39#39, value#11]
    +- *(1) Filter ((isnotnull(window#34) AND (cast(time#10 as timestamp) >= window#34.start)) AND (cast(time#10 as timestamp) < window#34.end))
       +- *(1) Expand [List(named_struct(start, precisetimestampcon...
    
    /* 028 */   private void expand_doConsume_0(InternalRow localtablescan_row_0, UTF8String expand_expr_0_0, boolean expand_exprIsNull_0_0, int expand_expr_1_0) throws java.io.IOException {
    /* 029 */     boolean expand_isNull_0 = true;
    /* 030 */     InternalRow expand_value_0 =
    /* 031 */     null;
    /* 032 */     for (int expand_i_0 = 0; expand_i_0 < 4; expand_i_0 ++) {
    /* 033 */       switch (expand_i_0) {
    /* 034 */       case 0:
                      (too many code lines)
    /* 517 */         break;
    /* 518 */
    /* 519 */       case 1:
                      (too many code lines)
    /* 1002 */         break;
    /* 1003 */
    /* 1004 */       case 2:
                      (too many code lines)
    /* 1487 */         break;
    /* 1488 */
    /* 1489 */       case 3:
                      (too many code lines)
    /* 1972 */         break;
    /* 1973 */       }
    /* 1974 */       ((org.apache.spark.sql.execution.metric.SQLMetric) references[33] /* numOutputRows */).add(1);
    /* 1975 */
    /* 1976 */       do {
    /* 1977 */         boolean filter_value_2 = !expand_isNull_0;
    /* 1978 */         if (!filter_value_2) continue;
    ```
    The fix in this PR can make the method smaller as follows;
    ```
    Found 2 WholeStageCodegen subtrees.
    == Subtree 1 / 2 (maxMethodCodeSize:1713; maxConstantPoolSize:210(0.32% used); numInnerClasses:0) ==
                                        ^^^^
    *(1) Project [window#17.start AS _gen_alias_32#32, value#11]
    +- *(1) Filter ((isnotnull(window#17) AND (cast(time#10 as timestamp) >= window#17.start)) AND (cast(time#10 as timestamp) < window#17.end))
       +- *(1) Expand [List(named_struct(start, precisetimestampcon...
    
    /* 032 */   private void expand_doConsume_0(InternalRow localtablescan_row_0, UTF8String expand_expr_0_0, boolean expand_exprIsNull_0_0, int expand_expr_1_0) throws java.io.IOException {
    /* 033 */     for (int expand_i_0 = 0; expand_i_0 < 4; expand_i_0 ++) {
    /* 034 */       switch (expand_i_0) {
    /* 035 */       case 0:
    /* 036 */         expand_switchCaseCode_0(expand_exprIsNull_0_0, expand_expr_0_0);
    /* 037 */         break;
    /* 038 */
    /* 039 */       case 1:
    /* 040 */         expand_switchCaseCode_1(expand_exprIsNull_0_0, expand_expr_0_0);
    /* 041 */         break;
    /* 042 */
    /* 043 */       case 2:
    /* 044 */         expand_switchCaseCode_2(expand_exprIsNull_0_0, expand_expr_0_0);
    /* 045 */         break;
    /* 046 */
    /* 047 */       case 3:
    /* 048 */         expand_switchCaseCode_3(expand_exprIsNull_0_0, expand_expr_0_0);
    /* 049 */         break;
    /* 050 */       }
    /* 051 */       ((org.apache.spark.sql.execution.metric.SQLMetric) references[33] /* numOutputRows */).add(1);
    /* 052 */
    /* 053 */       do {
    /* 054 */         boolean filter_value_2 = !expand_resultIsNull_0;
    /* 055 */         if (!filter_value_2) continue;
    /* 056 */
    ...
    ```
    
    ### Why are the changes needed?
    
    For better generated code.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    GA passed.
    
    Closes #32457 from maropu/splitSwitchCode.
    
    Authored-by: Takeshi Yamamuro <[email protected]>
    Signed-off-by: Liang-Chi Hsieh <[email protected]>
    maropu authored and viirya committed May 14, 2021
    Configuration menu
    Copy the full SHA
    8fa739f View commit details
    Browse the repository at this point in the history
  2. [SPARK-35311][SS][UI][DOCS] Structured Streaming Web UI state informa…

    …tion documentation
    
    ### What changes were proposed in this pull request?
    In this PR I'm adding Structured Streaming Web UI state information documentation.
    
    ### Why are the changes needed?
    Missing documentation.
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    ```
    cd docs/
    SKIP_API=1 bundle exec jekyll build
    ```
    Manual webpage check.
    
    Closes #32433 from gaborgsomogyi/SPARK-35311.
    
    Authored-by: Gabor Somogyi <[email protected]>
    Signed-off-by: Jungtaek Lim <[email protected]>
    gaborgsomogyi authored and HeartSaVioR committed May 14, 2021
    Configuration menu
    Copy the full SHA
    b6a0a7e View commit details
    Browse the repository at this point in the history
  3. [SPARK-34764][UI][FOLLOW-UP] Fix indentation and missing arguments fo…

    …r JavaScript linter
    
    ### What changes were proposed in this pull request?
    
    This PR is a followup of #32436 which broke JavaScript linter. There was a logical conflict - the linter was added after the last successful test run in that PR.
    
    ```
    added 118 packages in 1.482s
    
    /__w/spark/spark/core/src/main/resources/org/apache/spark/ui/static/executorspage.js
       34:41  error  'type' is defined but never used. Allowed unused args must match /^_ignored_.*/u  no-unused-vars
       34:47  error  'row' is defined but never used. Allowed unused args must match /^_ignored_.*/u   no-unused-vars
       35:1   error  Expected indentation of 2 spaces but found 4                                      indent
       36:1   error  Expected indentation of 4 spaces but found 7                                      indent
       37:1   error  Expected indentation of 2 spaces but found 4                                      indent
       38:1   error  Expected indentation of 4 spaces but found 7                                      indent
       39:1   error  Expected indentation of 2 spaces but found 4                                      indent
      556:1   error  Expected indentation of 14 spaces but found 16                                    indent
      557:1   error  Expected indentation of 14 spaces but found 16                                    indent
    ```
    
    ### Why are the changes needed?
    
    To recover the build
    
    ### Does this PR introduce _any_ user-facing change?
    
    No, dev-only.
    
    ### How was this patch tested?
    
    Manually tested:
    
    ```bash
     ./dev/lint-js
    lint-js checks passed.
    ```
    
    Closes #32541 from HyukjinKwon/SPARK-34764-followup.
    
    Authored-by: Hyukjin Kwon <[email protected]>
    Signed-off-by: Kousuke Saruta <[email protected]>
    HyukjinKwon authored and sarutak committed May 14, 2021
    Configuration menu
    Copy the full SHA
    f7af9ab View commit details
    Browse the repository at this point in the history
  4. [SPARK-35207][SQL] Normalize hash function behavior with negative zer…

    …o (floating point types)
    
    ### What changes were proposed in this pull request?
    
    Generally, we would expect that x = y => hash( x ) = hash( y ). However +-0 hash to different values for floating point types.
    ```
    scala> spark.sql("select hash(cast('0.0' as double)), hash(cast('-0.0' as double))").show
    +-------------------------+--------------------------+
    |hash(CAST(0.0 AS DOUBLE))|hash(CAST(-0.0 AS DOUBLE))|
    +-------------------------+--------------------------+
    |              -1670924195|                -853646085|
    +-------------------------+--------------------------+
    scala> spark.sql("select cast('0.0' as double) == cast('-0.0' as double)").show
    +--------------------------------------------+
    |(CAST(0.0 AS DOUBLE) = CAST(-0.0 AS DOUBLE))|
    +--------------------------------------------+
    |                                        true|
    +--------------------------------------------+
    ```
    Here is an extract from IEEE 754:
    
    > The two zeros are distinguishable arithmetically only by either division-byzero ( producing appropriately signed infinities ) or else by the CopySign function recommended by IEEE 754 /854. Infinities, SNaNs, NaNs and Subnormal numbers necessitate four more special cases
    
    From this, I deduce that the hash function must produce the same result for 0 and -0.
    
    ### Why are the changes needed?
    
    It is a correctness issue
    
    ### Does this PR introduce _any_ user-facing change?
    
    This changes only affect to the hash function applied to -0 value in float and double types
    
    ### How was this patch tested?
    
    Unit testing and manual testing
    
    Closes #32496 from planga82/feature/spark35207_hashnegativezero.
    
    Authored-by: Pablo Langa <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    planga82 authored and cloud-fan committed May 14, 2021
    Configuration menu
    Copy the full SHA
    9ea55fe View commit details
    Browse the repository at this point in the history
  5. [MINOR][DOC] ADD toc for monitoring page

    ### What changes were proposed in this pull request?
    
    Add toc tag on monitoring.md
    
    ### Why are the changes needed?
    
    fix doc
    
    ### Does this PR introduce _any_ user-facing change?
    
    yes, the table of content of the monitoring page will be shown on the official doc site.
    
    ### How was this patch tested?
    
    pass GA doc build
    
    Closes #32545 from yaooqinn/minor.
    
    Authored-by: Kent Yao <[email protected]>
    Signed-off-by: Kent Yao <[email protected]>
    yaooqinn committed May 14, 2021
    Configuration menu
    Copy the full SHA
    d424771 View commit details
    Browse the repository at this point in the history
  6. [SPARK-35332][SQL][FOLLOWUP] Refine wrong comment

    ### What changes were proposed in this pull request?
    
    Refine comment in `CacheManager`.
    
    ### Why are the changes needed?
    
    Avoid misleading developer.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Not needed.
    
    Closes #32543 from ulysses-you/SPARK-35332-FOLLOWUP.
    
    Authored-by: ulysses-you <[email protected]>
    Signed-off-by: Kent Yao <[email protected]>
    ulysses-you authored and yaooqinn committed May 14, 2021
    Configuration menu
    Copy the full SHA
    6218bc5 View commit details
    Browse the repository at this point in the history
  7. [SPARK-35404][CORE] Name the timers in TaskSchedulerImpl

    ### What changes were proposed in this pull request?
    
    make these threads easier to identify in thread dumps
    
    ### Why are the changes needed?
    
    make these threads easier to identify in thread dumps
    
    ### Does this PR introduce _any_ user-facing change?
    
    yes. Driver thread dumps will show the timers with pretty names
    
    ### How was this patch tested?
    
    verified locally
    
    Closes #32549 from yaooqinn/SPARK-35404.
    
    Authored-by: Kent Yao <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    yaooqinn authored and HyukjinKwon committed May 14, 2021
    Configuration menu
    Copy the full SHA
    68239d1 View commit details
    Browse the repository at this point in the history
  8. [SPARK-35206][TESTS][SQL] Extract common used get project path into a…

    … function in SparkFunctionSuite
    
    ### What changes were proposed in this pull request?
    
    Add a common functions `getWorkspaceFilePath` (which prefixed with spark home) to `SparkFunctionSuite`, and applies these the function to where they're extracted from.
    
    ### Why are the changes needed?
    
    Spark sql has test suites to read resources when running tests. The way of getting the path of resources is commonly used in different suites. We can extract them into a function to ease the code maintenance.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Pass existing tests.
    
    Closes #32315 from Ngone51/extract-common-file-path.
    
    Authored-by: yi.wu <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    Ngone51 authored and cloud-fan committed May 14, 2021
    Configuration menu
    Copy the full SHA
    94bd480 View commit details
    Browse the repository at this point in the history
  9. [SPARK-35405][DOC] Submitting Applications documentation has outdated…

    … information about K8s client mode support
    
    ### What changes were proposed in this pull request?
    [Submitting Applications doc](https://spark.apache.org/docs/latest/submitting-applications.html#master-urls) has outdated information about K8s client mode support.
    It still says "Client mode is currently unsupported and will be supported in future releases".
    ![image](https://user-images.githubusercontent.com/31073930/118268920-b5b51580-b4c6-11eb-8eed-975be8d37964.png)
    
    Whereas it's already supported and [Running Spark on Kubernetes doc](https://spark.apache.org/docs/latest/running-on-kubernetes.html#client-mode) says that it's supported started from 2.4.0 and has all needed information.
    ![image](https://user-images.githubusercontent.com/31073930/118268947-bd74ba00-b4c6-11eb-98d5-37961327642f.png)
    
    Changes:
    ![image](https://user-images.githubusercontent.com/31073930/118269179-12b0cb80-b4c7-11eb-8a37-d9d301bbda53.png)
    
    JIRA: https://issues.apache.org/jira/browse/SPARK-35405
    
    ### Why are the changes needed?
    Outdated information in the doc is misleading
    
    ### Does this PR introduce _any_ user-facing change?
    Documentation changes
    
    ### How was this patch tested?
    Documentation changes
    
    Closes #32551 from o-shevchenko/SPARK-35405.
    
    Authored-by: Oleksandr Shevchenko <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    o-shevchenko authored and dongjoon-hyun committed May 14, 2021
    Configuration menu
    Copy the full SHA
    d2fbf0d View commit details
    Browse the repository at this point in the history
  10. [SPARK-35384][SQL][FOLLOWUP] Move HashMap.get out of `InvokeLike.in…

    …voke`
    
    ### What changes were proposed in this pull request?
    
    Move hash map lookup operation out of `InvokeLike.invoke` since it doesn't depend on the input.
    
    ### Why are the changes needed?
    
    We shouldn't need to look up the hash map for every input row evaluated by `InvokeLike.invoke` since it doesn't depend on input. This could speed up the performance a bit.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Existing tests.
    
    Closes #32532 from sunchao/SPARK-35384-follow-up.
    
    Authored-by: Chao Sun <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    sunchao authored and dongjoon-hyun committed May 14, 2021
    Configuration menu
    Copy the full SHA
    a8032e7 View commit details
    Browse the repository at this point in the history