Spark3 diff on top of tag 3.0.1 #735

jdcasale · 2021-02-19T18:57:30Z

Upstream SPARK-XXXXX ticket and PR link (if not applicable, explain)

This is our entire diff post-spark3-upgrade, chunked into logical components where each commit builds successfully.
A diff this big is really hard to review, but reviewing the diff with current pt/master is even more intractable, so I believe the way to go here is to review each commit here individually.

The first commit (8f1e51e) sets up our palantir-hadoop and palantir-parquet dependencies and has a lot of the eccentricities associated with getting spark to work with foundry. This could perhaps be broken up further, if you have any opinions on that please let me know.

The second commit (8f1e51e) is basically #381 and related updates. It's giant, but so was the original PR and I didn't see a great way to split it up further.

What changes were proposed in this pull request?

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

This change is

rshkv · 2021-02-19T19:03:09Z

Link to the survey we did on diff in our fork and what we're keeping and dropping: https://pl.ntr/1Ut

core/src/main/scala/org/apache/spark/internal/plugin/PluginContainer.scala

dev/check-license

python/pyspark/sql/pandas/types.py

rshkv

Hope those comments don't get dropped when we rebase.

.circleci/config.yml

rshkv · 2021-02-26T14:27:58Z

.circleci/config.yml

+      # (And causes compilation failures.)
+      #- restore_cache:
+      #    keys:
+      #      - build-maven-{{ .Branch }}-{{ .BuildNum }}
+      #      - build-maven-{{ .Branch }}-
+      #      - build-maven-master-


We disabled this cache for this branch. It was causing puzzling compile errors and we couldn't get to the bottom of it.

rshkv · 2021-02-26T14:31:41Z

.circleci/config.yml

+            - build-binaries-{{ checksum "build/mvn" }}-{{ checksum "build/sbt" }}
+            - build-binaries-
+      - run:
+          command: ./build/mvn -DskipTests -Psparkr -Phadoop-palantir install


Added the -Phadoop-palantir profile here. Our pom previously declared hadoop-cloud and dists/palantir-pom/bom as default modules. I now moved those under the hadoop-palantir profile for a more orderly diff that makes clear we're building those modules as dependencies of our "Palantir build".

rshkv · 2021-02-26T14:32:32Z

FORK.md

+* Add pre-installed conda configuration and use to find rlib directory [(#700)](https://github.com/palantir/spark/pull/700)
+* Support Arrow-serialization of Python 2 strings [(#678)](https://github.com/palantir/spark/pull/678)
+
+# Reverted


Can remove this whole section.

rshkv · 2021-02-26T14:33:01Z

FORK.md

+* [SPARK-17059](https://issues.apache.org/jira/browse/SPARK-17059) - Allow FileFormat to specify partition pruning strategy via splits
+* [SPARK-24345](https://issues.apache.org/jira/browse/SPARK-24345) - Improve ParseError stop location when offending symbol is a token
+* [SPARK-23795](https://issues.apache.org/jira/browse/SPARK-23795) - Make AbstractLauncher#self() protected


rshkv · 2021-02-26T16:37:45Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogSuite.scala

-    val newLocationPart1 = newUriForDatabase()
-    val newLocationPart2 = newUriForDatabase()
+    val newLocationPart1 = newUriForPartition(Seq("p1=1", "p2=2"))
+    val newLocationPart2 = newUriForPartition(Seq("p1=3", "p2=4"))


We were repeatedly getting build failures, which this cherry-pick fixed: apache#30756

rshkv · 2021-02-26T17:18:08Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala

+    // TODO(rshkv): Change to match default once upstream picks it up
+    conf.set(ParquetOutputFormat.PAGE_WRITE_CHECKSUM_ENABLED, "false")


Matching what we have on master here. Not sure if we want this but IIRC we had failing tests without.

rshkv · 2021-02-26T17:46:59Z

.../scala/org/apache/spark/sql/execution/datasources/parquet/ParquetInteroperabilitySuite.scala

+                // TODO(palantir): Assertion below differs from upstream
+                // palantir/parquet-mr always returns statistics
+                assert(oneFooter.getFileMetaData.getCreatedBy.contains("impala") ^
+                  oneBlockColumnMeta.getStatistics.hasNonNullValue)


Matching palantir/spark's master here.

rshkv · 2021-02-26T17:49:42Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala

-            val file1 = new File(dir1 + "/data")
+            val partDir1 = new File(new File(dir1, "ds=2008-04-09"), "hr=11")
+            val file1 = new File(partDir1, "data")
+            file1.getParentFile.mkdirs()


This is from that cherry-pick to fix our build failures: apache#30756

rshkv · 2021-02-26T17:50:10Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala

-          assert(partStats.sizeInBytes == 601)
+          assert(partStats.sizeInBytes == 650)


Again, matching palantir/spark's master.

Co-authored-by: Robert Kruszewski <[email protected]> Co-authored-by: Josh Casale <[email protected]> Co-authored-by: Will Raschkowski <[email protected]>

…ITION` tests that delete files out of partition path ### What changes were proposed in this pull request? Modify the tests that add partitions with `LOCATION`, and where the number of nested folders in `LOCATION` doesn't match to the number of partitioned columns. In that case, `ALTER TABLE .. DROP PARTITION` tries to access (delete) folder out of the "base" path in `LOCATION`. The problem belongs to Hive's MetaStore method `drop_partition_common`: https://github.com/apache/hive/blob/8696c82d07d303b6dbb69b4d443ab6f2b241b251/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java#L4876 which tries to delete empty partition sub-folders recursively starting from the most deeper partition sub-folder up to the base folder. In the case when the number of sub-folder is not equal to the number of partitioned columns `part_vals.size()`, the method will try to list and delete folders out of the base path. ### Why are the changes needed? To fix test failures like apache#30643 (comment): ``` org.apache.spark.sql.hive.execution.command.AlterTableAddPartitionSuite.ALTER TABLE .. ADD PARTITION Hive V1: SPARK-33521: universal type conversions of partition values sbt.ForkMain$ForkError: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: File file:/home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-832cb19c-65fd-41f3-ae0b-937d76c07897 does not exist; at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:112) at org.apache.spark.sql.hive.HiveExternalCatalog.dropPartitions(HiveExternalCatalog.scala:1014) ... Caused by: sbt.ForkMain$ForkError: org.apache.hadoop.hive.metastore.api.MetaException: File file:/home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-832cb19c-65fd-41f3-ae0b-937d76c07897 does not exist at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.drop_partition_with_environment_context(HiveMetaStore.java:3381) at sun.reflect.GeneratedMethodAccessor304.invoke(Unknown Source) ``` The issue can be reproduced by the following steps: 1. Create a base folder, for example: `/Users/maximgekk/tmp/part-location` 2. Create a sub-folder in the base folder and drop permissions for it: ``` $ mkdir /Users/maximgekk/tmp/part-location/aaa $ chmod a-rwx chmod a-rwx /Users/maximgekk/tmp/part-location/aaa $ ls -al /Users/maximgekk/tmp/part-location total 0 drwxr-xr-x 3 maximgekk staff 96 Dec 13 18:42 . drwxr-xr-x 33 maximgekk staff 1056 Dec 13 18:32 .. d--------- 2 maximgekk staff 64 Dec 13 18:42 aaa ``` 3. Create a table with a partition folder in the base folder: ```sql spark-sql> create table tbl (id int) partitioned by (part0 int, part1 int); spark-sql> alter table tbl add partition (part0=1,part1=2) location '/Users/maximgekk/tmp/part-location/tbl'; ``` 4. Try to drop this partition: ``` spark-sql> alter table tbl drop partition (part0=1,part1=2); 20/12/13 18:46:07 ERROR HiveClientImpl: ====================== Attempt to drop the partition specs in table 'tbl' database 'default': Map(part0 -> 1, part1 -> 2) In this attempt, the following partitions have been dropped successfully: The remaining partitions have not been dropped: [1, 2] ====================== Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: Error accessing file:/Users/maximgekk/tmp/part-location/aaa; org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: Error accessing file:/Users/maximgekk/tmp/part-location/aaa; ``` The command fails because it tries to access to the sub-folder `aaa` that is out of the partition path `/Users/maximgekk/tmp/part-location/tbl`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the affected tests from local IDEA which does not have access to folders out of partition paths. Lead-authored-by: Max Gekk <max.gekkgmail.com> Co-authored-by: Maxim Gekk <max.gekkgmail.com> Signed-off-by: HyukjinKwon <gurwls223apache.org> (cherry picked from commit 9160d59) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes apache#30756 from MaxGekk/fix-drop-partition-location-3.1. Authored-by: Max Gekk <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

We use the InProcessLauncher internally [1] to launch to different YARN clusters. The clusters might need different configuration files, which we can't keep apart if the InProcessLauncher discovers config folders from the same HADOOP_CONF_DIR environment variable. This change allows us to specify different config directories using Spark config. See upstream PR [2] and ticket [3]. [1] https://pl.ntr/1UK [2] apache#22289 [1] https://issues.apache.org/jira/browse/SPARK-25200 Co-authored-by: Adam Balogh <[email protected]> Co-authored-by: Robert Kruszewski <[email protected]> Co-authored-by: Josh Casale <[email protected]> Co-authored-by: Will Raschkowski <[email protected]>

… SQL expressions We have internal applications (BS and C) prone to OOMs with repeated use of aliases. See ticket [1] and upstream PR [2]. [1] https://issues.apache.org/jira/browse/SPARK-26626 [2] apache#23556 Co-authored-by: j-esse <[email protected]> Co-authored-by: Josh Casale <[email protected]> Co-authored-by: Will Raschkowski <[email protected]>

@robert3005

…kjoinpool See ticket [1] and the PR [2] @robert3005 opened on apache/spark. It seems upstream wasn't convinced by our use-case. We need this change because Spark reads parquet footers on a different thread. Without this change, that thread doesn't inherit the thread-local that stores the TaskContext, meaning we don't have access to the properties stored inside that TaskContext Our internal filesystem needs those properties in the task context and fails to read footers without. [1] https://issues.apache.org/jira/browse/SPARK-20952 [2] apache#18176 Co-authored-by: Robert Kruszewski <[email protected]> Co-authored-by: Josh Casale <[email protected]> Co-authored-by: Will Raschkowski <[email protected]>

… per-partition limits We have an internal product that needs this. See upstream PR [1]. [1] apache#15614 Co-authored-by: Patrick Woody <[email protected]> Co-authored-by: Josh Casale <[email protected]> Co-authored-by: Will Raschkowski <[email protected]>

…n task start and end events Proposing a new set of APIs for ExecutorPlugins, to provide callbacks invoked at the start and end of each task of a job. Not very opinionated on the shape of the API, tried to be as minimal as possible for now. Changes described in detail on [SPARK-33088](https://issues.apache.org/jira/browse/SPARK-33088), but mostly this boils down to: 1. This feature was considered when the ExecutorPlugin API was initially introduced in apache#21923, but never implemented. 2. The use-case which **requires** this feature is to propagate tracing information from the driver to the executor, such that calls from the same job can all be traced. a. Tracing frameworks usually are setup in thread locals, therefore it's important for the setup to happen in the same thread which runs the tasks. b. Executors can be for multiple jobs, therefore it's not sufficient to set tracing information at executor startup time -- it needs to happen every time a task starts or ends. No. This PR introduces new features for future developers to use. Unit tests on `PluginContainerSuite`. Closes apache#29977 from fsamuel-bs/SPARK-33088. Authored-by: Samuel Souza <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>

This adds config that allows us to to inject a custom session builder. Internally we use it to build SparkSessions that highly configured beyond what Spark's built-in configs allow. Most importantly that includes building and registering our own session catalog (v1) implementation with the SparkSession. You can find how we use this config here [1] and our own SessionStateBuilder here [2]. [1]: https://pl.ntr/1UU [2]: https://pl.ntr/1UT Co-authored-by: Robert Kruszewski <[email protected]> Co-authored-by: Josh Casale <[email protected]> Co-authored-by: Will Raschkowski <[email protected]>

Authored-by: Robert Kruszewski <[email protected]>

Palantir's mechanism to mount local files into pods using secret-populated volumes. Used internally to provide pods with config files. Co-authored-by: mccheah <[email protected]> Co-authored-by: Josh Casale <[email protected]> Co-authored-by: Willi Raschkowski <[email protected]>

Authored-by: Onur Satici <[email protected]>

When serializing a Pandas dataframe using Arrow under Python 2, Arrow can't tell if string columns should be serialized as string type or as binary (due to how Python 2 stores strings). The result is that Arrow serializes string columns in Pandas dataframes to binary ones. We can remove this when we discontinue support for Python 2. See original PR [1] and follow-up for 'mixed' type columns [2]. [1] #679 [2] #702

Co-authored-by: Dan Sanduleac <[email protected]> Co-authored-by: Josh Casale <[email protected]> Co-authored-by: Will Raschkowski <[email protected]>

This PR upgrade Py4J from 0.10.9 to 0.10.9.1 that contains some bug fixes and improvements. It contains one bug fix (py4j/py4j@4152353). To leverage fixes from the upstream in Py4J. No. Jenkins build and GitHub Actions will test it out. Closes apache#31009 from HyukjinKwon/SPARK-33984. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…d metrics in sources commit 9704d3a47bc6ec0d6e5ac08575808a182ad503b7 Author: Robert Kruszewski <[email protected]> Date: Fri Oct 9 00:00:41 2020 +0100 better variable name commit 89f863270d11a8fd583b02b3467c49cb403bbf5e Author: Robert Kruszewski <[email protected]> Date: Thu Oct 8 23:59:17 2020 +0100 variable name commit d20847a8d0d91a6d5e436a5d5c47df1f875d84a1 Author: Robert Kruszewski <[email protected]> Date: Thu Jun 22 16:53:30 2017 +0100 Merge existing registry with default one or configure default metric registry

rshkv · 2021-03-10T11:29:51Z

Closing in favour of #737

rshkv force-pushed the jc/pt-diff-in-logical-chunks branch 9 times, most recently from cd67edb to c592ed2 Compare February 23, 2021 23:57

rshkv reviewed Feb 24, 2021

View reviewed changes

core/src/main/scala/org/apache/spark/internal/plugin/PluginContainer.scala Show resolved Hide resolved

rshkv reviewed Feb 24, 2021

View reviewed changes

dev/check-license Outdated Show resolved Hide resolved

rshkv reviewed Feb 24, 2021

View reviewed changes

python/pyspark/sql/pandas/types.py Show resolved Hide resolved

rshkv force-pushed the jc/pt-diff-in-logical-chunks branch 12 times, most recently from 32a1129 to dea0fd8 Compare February 26, 2021 15:56

rshkv reviewed Feb 26, 2021

View reviewed changes

rshkv and others added 4 commits February 26, 2021 18:08

Palantir Build infrastructure, Palantir Hadoop, Palantir Parquet

adf557c

Co-authored-by: Robert Kruszewski <[email protected]> Co-authored-by: Josh Casale <[email protected]> Co-authored-by: Will Raschkowski <[email protected]>

rshkv and others added 6 commits February 26, 2021 18:08

Custom CatalogFileIndex (#364)

30f0f60

Authored-by: Robert Kruszewski <[email protected]>

Implement a Docker image generator gradle plugin for Spark applications

5bdd474

rshkv force-pushed the jc/pt-diff-in-logical-chunks branch from 45d5084 to adf557c Compare February 26, 2021 18:09

jdcasale and others added 7 commits March 2, 2021 14:53

Palantir SafeLogging

ba7c222

K8s local deploy mode

d08fa21

Authored-by: Onur Satici <[email protected]>

Palantir Conda Runner & R-Test infrastructure

cbf1d0c

Co-authored-by: Dan Sanduleac <[email protected]> Co-authored-by: Josh Casale <[email protected]> Co-authored-by: Will Raschkowski <[email protected]>

rshkv force-pushed the jc/pt-diff-in-logical-chunks branch 2 times, most recently from e21fea6 to 5bdd474 Compare March 2, 2021 14:54

rshkv closed this Mar 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark3 diff on top of tag 3.0.1 #735

Spark3 diff on top of tag 3.0.1 #735

jdcasale commented Feb 19, 2021 •

edited by rshkv

Loading

rshkv commented Feb 19, 2021

rshkv left a comment

rshkv Feb 26, 2021

rshkv Feb 26, 2021

rshkv Feb 26, 2021

rshkv Feb 26, 2021

rshkv Feb 26, 2021

rshkv Feb 26, 2021

rshkv Feb 26, 2021

rshkv Feb 26, 2021

rshkv Feb 26, 2021

rshkv commented Mar 10, 2021

		// TODO(rshkv): Change to match default once upstream picks it up
		conf.set(ParquetOutputFormat.PAGE_WRITE_CHECKSUM_ENABLED, "false")

		assert(partStats.sizeInBytes == 601)
		assert(partStats.sizeInBytes == 650)

Spark3 diff on top of tag 3.0.1 #735

Spark3 diff on top of tag 3.0.1 #735

Conversation

jdcasale commented Feb 19, 2021 • edited by rshkv Loading

Upstream SPARK-XXXXX ticket and PR link (if not applicable, explain)

What changes were proposed in this pull request?

How was this patch tested?

rshkv commented Feb 19, 2021

rshkv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rshkv commented Mar 10, 2021

jdcasale commented Feb 19, 2021 •

edited by rshkv

Loading