Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark3 diff on top of tag 3.0.1 #735

Closed
wants to merge 17 commits into from

Conversation

jdcasale
Copy link

@jdcasale jdcasale commented Feb 19, 2021

Upstream SPARK-XXXXX ticket and PR link (if not applicable, explain)

This is our entire diff post-spark3-upgrade, chunked into logical components where each commit builds successfully.
A diff this big is really hard to review, but reviewing the diff with current pt/master is even more intractable, so I believe the way to go here is to review each commit here individually.

The first commit (8f1e51e) sets up our palantir-hadoop and palantir-parquet dependencies and has a lot of the eccentricities associated with getting spark to work with foundry. This could perhaps be broken up further, if you have any opinions on that please let me know.

The second commit (8f1e51e) is basically #381 and related updates. It's giant, but so was the original PR and I didn't see a great way to split it up further.

What changes were proposed in this pull request?

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.


This change is Reviewable

@rshkv
Copy link

rshkv commented Feb 19, 2021

Link to the survey we did on diff in our fork and what we're keeping and dropping: https://pl.ntr/1Ut

@rshkv rshkv force-pushed the jc/pt-diff-in-logical-chunks branch 9 times, most recently from cd67edb to c592ed2 Compare February 23, 2021 23:57
dev/check-license Outdated Show resolved Hide resolved
@rshkv rshkv force-pushed the jc/pt-diff-in-logical-chunks branch 12 times, most recently from 32a1129 to dea0fd8 Compare February 26, 2021 15:56
Copy link

@rshkv rshkv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hope those comments don't get dropped when we rebase.

.circleci/config.yml Outdated Show resolved Hide resolved
Comment on lines +138 to +140
# (And causes compilation failures.)
#- restore_cache:
# keys:
# - build-maven-{{ .Branch }}-{{ .BuildNum }}
# - build-maven-{{ .Branch }}-
# - build-maven-master-
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We disabled this cache for this branch. It was causing puzzling compile errors and we couldn't get to the bottom of it.

- build-binaries-{{ checksum "build/mvn" }}-{{ checksum "build/sbt" }}
- build-binaries-
- run:
command: ./build/mvn -DskipTests -Psparkr -Phadoop-palantir install
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the -Phadoop-palantir profile here. Our pom previously declared hadoop-cloud and dists/palantir-pom/bom as default modules. I now moved those under the hadoop-palantir profile for a more orderly diff that makes clear we're building those modules as dependencies of our "Palantir build".

* Add pre-installed conda configuration and use to find rlib directory [(#700)](https://github.com/palantir/spark/pull/700)
* Support Arrow-serialization of Python 2 strings [(#678)](https://github.com/palantir/spark/pull/678)

# Reverted
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can remove this whole section.

Comment on lines +6 to +8
* [SPARK-17059](https://issues.apache.org/jira/browse/SPARK-17059) - Allow FileFormat to specify partition pruning strategy via splits
* [SPARK-24345](https://issues.apache.org/jira/browse/SPARK-24345) - Improve ParseError stop location when offending symbol is a token
* [SPARK-23795](https://issues.apache.org/jira/browse/SPARK-23795) - Make AbstractLauncher#self() protected
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove.

Comment on lines -411 to +412
val newLocationPart1 = newUriForDatabase()
val newLocationPart2 = newUriForDatabase()
val newLocationPart1 = newUriForPartition(Seq("p1=1", "p2=2"))
val newLocationPart2 = newUriForPartition(Seq("p1=3", "p2=4"))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were repeatedly getting build failures, which this cherry-pick fixed: apache#30756

Comment on lines +125 to +126
// TODO(rshkv): Change to match default once upstream picks it up
conf.set(ParquetOutputFormat.PAGE_WRITE_CHECKSUM_ENABLED, "false")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matching what we have on master here. Not sure if we want this but IIRC we had failing tests without.

Comment on lines +189 to +192
// TODO(palantir): Assertion below differs from upstream
// palantir/parquet-mr always returns statistics
assert(oneFooter.getFileMetaData.getCreatedBy.contains("impala") ^
oneBlockColumnMeta.getStatistics.hasNonNullValue)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matching palantir/spark's master here.

Comment on lines -996 to +998
val file1 = new File(dir1 + "/data")
val partDir1 = new File(new File(dir1, "ds=2008-04-09"), "hr=11")
val file1 = new File(partDir1, "data")
file1.getParentFile.mkdirs()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is from that cherry-pick to fix our build failures: apache#30756

Comment on lines -1555 to +1559
assert(partStats.sizeInBytes == 601)
assert(partStats.sizeInBytes == 650)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, matching palantir/spark's master.

rshkv and others added 4 commits February 26, 2021 18:08
Co-authored-by: Robert Kruszewski <[email protected]>
Co-authored-by: Josh Casale <[email protected]>
Co-authored-by: Will Raschkowski <[email protected]>
…ITION` tests that delete files out of partition path

### What changes were proposed in this pull request?
Modify the tests that add partitions with `LOCATION`, and where the number of nested folders in `LOCATION` doesn't match to the number of partitioned columns. In that case, `ALTER TABLE .. DROP PARTITION` tries to access (delete) folder out of the "base" path in `LOCATION`.

The problem belongs to Hive's MetaStore method `drop_partition_common`:
https://github.com/apache/hive/blob/8696c82d07d303b6dbb69b4d443ab6f2b241b251/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java#L4876
which tries to delete empty partition sub-folders recursively starting from the most deeper partition sub-folder up to the base folder. In the case when the number of sub-folder is not equal to the number of partitioned columns `part_vals.size()`, the method will try to list and delete folders out of the base path.

### Why are the changes needed?
To fix test failures like apache#30643 (comment):
```
org.apache.spark.sql.hive.execution.command.AlterTableAddPartitionSuite.ALTER TABLE .. ADD PARTITION Hive V1: SPARK-33521: universal type conversions of partition values
sbt.ForkMain$ForkError: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: File file:/home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-832cb19c-65fd-41f3-ae0b-937d76c07897 does not exist;
	at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:112)
	at org.apache.spark.sql.hive.HiveExternalCatalog.dropPartitions(HiveExternalCatalog.scala:1014)
...
Caused by: sbt.ForkMain$ForkError: org.apache.hadoop.hive.metastore.api.MetaException: File file:/home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-832cb19c-65fd-41f3-ae0b-937d76c07897 does not exist
	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.drop_partition_with_environment_context(HiveMetaStore.java:3381)
	at sun.reflect.GeneratedMethodAccessor304.invoke(Unknown Source)
```

The issue can be reproduced by the following steps:
1. Create a base folder, for example: `/Users/maximgekk/tmp/part-location`
2. Create a sub-folder in the base folder and drop permissions for it:
```
$ mkdir /Users/maximgekk/tmp/part-location/aaa
$ chmod a-rwx chmod a-rwx /Users/maximgekk/tmp/part-location/aaa
$ ls -al /Users/maximgekk/tmp/part-location
total 0
drwxr-xr-x   3 maximgekk  staff    96 Dec 13 18:42 .
drwxr-xr-x  33 maximgekk  staff  1056 Dec 13 18:32 ..
d---------   2 maximgekk  staff    64 Dec 13 18:42 aaa
```
3. Create a table with a partition folder in the base folder:
```sql
spark-sql> create table tbl (id int) partitioned by (part0 int, part1 int);
spark-sql> alter table tbl add partition (part0=1,part1=2) location '/Users/maximgekk/tmp/part-location/tbl';
```
4. Try to drop this partition:
```
spark-sql> alter table tbl drop partition (part0=1,part1=2);
20/12/13 18:46:07 ERROR HiveClientImpl:
======================
Attempt to drop the partition specs in table 'tbl' database 'default':
Map(part0 -> 1, part1 -> 2)
In this attempt, the following partitions have been dropped successfully:

The remaining partitions have not been dropped:
[1, 2]
======================

Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: Error accessing file:/Users/maximgekk/tmp/part-location/aaa;
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: Error accessing file:/Users/maximgekk/tmp/part-location/aaa;
```
The command fails because it tries to access to the sub-folder `aaa` that is out of the partition path `/Users/maximgekk/tmp/part-location/tbl`.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the affected tests from local IDEA which does not have access to folders out of partition paths.

Lead-authored-by: Max Gekk <max.gekkgmail.com>
Co-authored-by: Maxim Gekk <max.gekkgmail.com>
Signed-off-by: HyukjinKwon <gurwls223apache.org>
(cherry picked from commit 9160d59)
Signed-off-by: Max Gekk <max.gekkgmail.com>

Closes apache#30756 from MaxGekk/fix-drop-partition-location-3.1.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
We use the InProcessLauncher internally [1] to launch to different YARN
clusters. The clusters might need different configuration files, which
we can't keep apart if the InProcessLauncher discovers config folders
from the same HADOOP_CONF_DIR environment variable.

This change allows us to specify different config directories using
Spark config.

See upstream PR [2] and ticket [3].

[1] https://pl.ntr/1UK
[2] apache#22289
[1] https://issues.apache.org/jira/browse/SPARK-25200

Co-authored-by: Adam Balogh <[email protected]>
Co-authored-by: Robert Kruszewski <[email protected]>
Co-authored-by: Josh Casale <[email protected]>
Co-authored-by: Will Raschkowski <[email protected]>
… SQL expressions

We have internal applications (BS and C) prone to OOMs with repeated use of
aliases. See ticket [1] and upstream PR [2].

[1] https://issues.apache.org/jira/browse/SPARK-26626
[2] apache#23556

Co-authored-by: j-esse <[email protected]>
Co-authored-by: Josh Casale <[email protected]>
Co-authored-by: Will Raschkowski <[email protected]>
rshkv and others added 6 commits February 26, 2021 18:08
…kjoinpool

See ticket [1] and the PR [2] @robert3005 opened on apache/spark. It
seems upstream wasn't convinced by our use-case.

We need this change because Spark reads parquet footers on a different
thread. Without this change, that thread doesn't inherit the
thread-local that stores the TaskContext, meaning we don't have access
to the properties stored inside that TaskContext

Our internal filesystem needs those properties in the task context and
fails to read footers without.

[1] https://issues.apache.org/jira/browse/SPARK-20952
[2] apache#18176

Co-authored-by: Robert Kruszewski <[email protected]>
Co-authored-by: Josh Casale <[email protected]>
Co-authored-by: Will Raschkowski <[email protected]>
… per-partition limits

We have an internal product that needs this. See upstream PR [1].

[1] apache#15614

Co-authored-by: Patrick Woody <[email protected]>
Co-authored-by: Josh Casale <[email protected]>
Co-authored-by: Will Raschkowski <[email protected]>
…n task start and end events

Proposing a new set of APIs for ExecutorPlugins, to provide callbacks invoked at the start and end of each task of a job. Not very opinionated on the shape of the API, tried to be as minimal as possible for now.

Changes described in detail on [SPARK-33088](https://issues.apache.org/jira/browse/SPARK-33088), but mostly this boils down to:

1. This feature was considered when the ExecutorPlugin API was initially introduced in apache#21923, but never implemented.
2. The use-case which **requires** this feature is to propagate tracing information from the driver to the executor, such that calls from the same job can all be traced.
  a. Tracing frameworks usually are setup in thread locals, therefore it's important for the setup to happen in the same thread which runs the tasks.
  b. Executors can be for multiple jobs, therefore it's not sufficient to set tracing information at executor startup time -- it needs to happen every time a task starts or ends.

No. This PR introduces new features for future developers to use.

Unit tests on `PluginContainerSuite`.

Closes apache#29977 from fsamuel-bs/SPARK-33088.

Authored-by: Samuel Souza <[email protected]>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
This adds config that allows us to to inject a custom session builder.
Internally we use it to build SparkSessions that highly configured
beyond what Spark's built-in configs allow. Most importantly that
includes building and registering our own session catalog (v1)
implementation with the SparkSession.

You can find how we use this config here [1] and our own
SessionStateBuilder here [2].

[1]: https://pl.ntr/1UU
[2]: https://pl.ntr/1UT

Co-authored-by: Robert Kruszewski <[email protected]>
Co-authored-by: Josh Casale <[email protected]>
Co-authored-by: Will Raschkowski <[email protected]>
Authored-by: Robert Kruszewski <[email protected]>
@rshkv rshkv force-pushed the jc/pt-diff-in-logical-chunks branch from 45d5084 to adf557c Compare February 26, 2021 18:09
jdcasale and others added 7 commits March 2, 2021 14:53
Palantir's mechanism to mount local files into pods using
secret-populated volumes. Used internally to provide pods with config
files.

Co-authored-by: mccheah <[email protected]>
Co-authored-by: Josh Casale <[email protected]>
Co-authored-by: Willi Raschkowski <[email protected]>
Authored-by: Onur Satici <[email protected]>
When serializing a Pandas dataframe using Arrow under Python 2, Arrow
can't tell if string columns should be serialized as string type or as
binary (due to how Python 2 stores strings). The result is that Arrow
serializes string columns in Pandas dataframes to binary ones.

We can remove this when we discontinue support for Python 2.

See original PR [1] and follow-up for 'mixed' type columns [2].

[1] #679
[2] #702
Co-authored-by: Dan Sanduleac <[email protected]>
Co-authored-by: Josh Casale <[email protected]>
Co-authored-by: Will Raschkowski <[email protected]>
This PR upgrade Py4J from 0.10.9 to 0.10.9.1 that contains some bug fixes and improvements.
It contains one bug fix (py4j/py4j@4152353).

To leverage fixes from the upstream in Py4J.

No.

Jenkins build and GitHub Actions will test it out.

Closes apache#31009 from HyukjinKwon/SPARK-33984.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…d metrics in sources

commit 9704d3a47bc6ec0d6e5ac08575808a182ad503b7
Author: Robert Kruszewski <[email protected]>
Date:   Fri Oct 9 00:00:41 2020 +0100

    better variable  name

commit 89f863270d11a8fd583b02b3467c49cb403bbf5e
Author: Robert Kruszewski <[email protected]>
Date:   Thu Oct 8 23:59:17 2020 +0100

    variable name

commit d20847a8d0d91a6d5e436a5d5c47df1f875d84a1
Author: Robert Kruszewski <[email protected]>
Date:   Thu Jun 22 16:53:30 2017 +0100

    Merge existing registry with default one or configure default metric registry
@rshkv rshkv force-pushed the jc/pt-diff-in-logical-chunks branch 2 times, most recently from e21fea6 to 5bdd474 Compare March 2, 2021 14:54
@rshkv
Copy link

rshkv commented Mar 10, 2021

Closing in favour of #737

@rshkv rshkv closed this Mar 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants