[SPARK-32381][CORE][SQL] Move and refactor parallel listing & non-location sensitive listing to core #29471

sunchao · 2020-08-18T23:03:22Z

What changes were proposed in this pull request?

This moves and refactors the parallel listing utilities from InMemoryFileIndex to Spark core so it can be reused by modules beside SQL. Along the process this also did some cleanups/refactorings:

Created a HadoopFSUtils class under core
Moved InMemoryFileIndex.bulkListLeafFiles into HadoopFSUtils.parallelListLeafFiles. It now depends on a SparkContext instead of SparkSession in SQL. Also added a few parameters which used to be read from SparkSession.conf: ignoreMissingFiles, ignoreLocality, parallelismThreshold, parallelismMax and filterFun (for additional filtering support but we may be able to merge this with filter parameter in future).
Moved InMemoryFileIndex.listLeafFiles into HadoopFSUtils.listLeafFiles with similar changes above.

Why are the changes needed?

Currently the locality-aware parallel listing mechanism only applies to InMemoryFileIndex. By moving this to core, we can potentially reuse the same mechanism for other code paths as well.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Since this is mostly a refactoring, it relies on existing unit tests such as those for InMemoryFileIndex.

…g locations

…d getting locations" This reverts commit bbe6344.

…tly manipulating bytecode

…ore.

…people mix it in

…ns, try and avoid expensive S3A location lookups (and other general file systems), etc.

dongjoon-hyun · 2020-08-20T20:36:39Z

ok to test

dongjoon-hyun · 2020-08-20T20:41:14Z

cc @holdenk

viirya · 2020-08-20T20:43:23Z

btw, we should better use Spark's PR template. But this is still in WIP, I suppose it will fill the template later.

holdenk

Thanks for picking up this PR. Let me know when you want me to do a review pass (I see it's still marked as draft).

core/src/main/scala/org/apache/spark/util/HadoopFSUtils.scala

SparkQA · 2020-08-20T23:24:29Z

Test build #127705 has finished for PR 29471 at commit 5529047.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2020-08-24T21:36:59Z

core/src/main/scala/org/apache/spark/util/HadoopFSUtils.scala

+/**
+ * Utility functions to simplify and speed-up file listing.
+ */
+@Private


Private maybe seems a bit too restrictive in scope, what about DeveloperAPI?

@zsxwing suggested to make this private in the meanwhile and change it after the follow-up PR is done.

Yep. I suggested to make this private. I think these APIs are not ready to expose yet. For example, 10 parameters in a method is not user friendly. It's better to design a better API for this.

holdenk · 2020-08-24T21:39:00Z

core/src/main/scala/org/apache/spark/util/HadoopFSUtils.scala

+    filter: PathFilter, areSQLRootPaths: Boolean, ignoreMissingFiles: Boolean,
+    ignoreLocality: Boolean, maxParallelism: Int,
+    filterFun: Option[String => Boolean] = None): Seq[(Path, Seq[FileStatus])] = {
+    HiveCatalogMetrics.incrementParallelListingJobCount(1)


We could pass in a callback for listing metrics?

holdenk · 2020-08-24T21:40:01Z

core/src/main/scala/org/apache/spark/util/HadoopFSUtils.scala

+              filter = filter,
+              ignoreMissingFiles = ignoreMissingFiles,
+              ignoreLocality = ignoreLocality,
+              isSQLRootPath = areSQLRootPaths,


So the is/are is because listLeafFiles takes a single name and this method takes in a list of files.

holdenk · 2020-08-24T21:43:11Z

core/src/main/scala/org/apache/spark/util/HadoopFSUtils.scala

+      fs match {
+        // DistributedFileSystem overrides listLocatedStatus to make 1 single call to namenode
+        // to retrieve the file status with the file block location. The reason to still fallback
+        // to listStatus is because the default implementation would potentially throw a


I think changing this in a follow up PR sounds fine to me. I'd like to us to use the faster method by default and fall back on exception, but that could be a follow on. Want to file a JIRA for broadening the scope of using the faster method?

holdenk · 2020-08-24T21:48:18Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala

-      length: Long)
-
-  /** A serializable variant of HDFS's FileStatus. */
-  private case class SerializableFileStatus(


Yeah a seperate JIRA is best.

sunchao · 2020-08-25T18:58:38Z

@HyukjinKwon @gengliangwang @viirya @zsxwing this PR is mostly a refactoring now. Could you take another look? Thanks!

core/src/main/scala/org/apache/spark/util/HadoopFSUtils.scala

SparkQA · 2020-08-28T03:46:06Z

Test build #127973 has finished for PR 29471 at commit 86c2013.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2020-08-30T07:20:24Z

core/src/main/scala/org/apache/spark/util/HadoopFSUtils.scala

+   * @param paths Input paths to list
+   * @param hadoopConf Hadoop configuration
+   * @param filter Path filter used to exclude leaf files from result
+   * @param areSQLRootPaths Whether the input paths are SQL root paths


Can we add more few words for areSQLRootPaths? Seeing SQL in core is already a bit strange, so it's nicer to let developers can quickly get better idea just from reading doc.

Yes this is unfortunate. I think this parameter doesn't have to be visible to the callers though as it is set to true on the initial call and false on subsequent recursive calls. We can potentially add another overloaded method without this parameter and make this one private.

I like that idea @sunchao

core/src/main/scala/org/apache/spark/util/HadoopFSUtils.scala

SparkQA · 2020-08-30T21:17:06Z

Test build #128051 has finished for PR 29471 at commit bfa37cc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

steveloughran · 2020-09-01T12:55:28Z

BTW, wrote something up on listing.
https://github.com/steveloughran/engineering-proposals/blob/trunk/listing-performance.md

anywhere you do listStatus(path): List[FileStatus], switch to listStatusIterator, but, if the returned iterator is Closeable, make sure you close it after. Then I or a someone else will not only add the s3a and abfs speedups (alongside today's HDFS), I'll do the same for the local FS.

holdenk

LGTM, if no one objects I'll merge this on Monday so we can unblock the follow on work that folks seem interested in.

holdenk · 2020-09-10T20:51:56Z

core/src/main/scala/org/apache/spark/util/HadoopFSUtils.scala

+    filter: PathFilter, areSQLRootPaths: Boolean, ignoreMissingFiles: Boolean,
+    ignoreLocality: Boolean, maxParallelism: Int,
+    filterFun: Option[String => Boolean] = None): Seq[(Path, Seq[FileStatus])] = {
+    HiveCatalogMetrics.incrementParallelListingJobCount(1)


Sounds good to me, @sunchao want to file a JIRA for switching this to a callback?

HyukjinKwon · 2020-09-14T02:35:33Z

core/src/main/scala/org/apache/spark/util/HadoopFSUtils.scala

+      ignoreLocality: Boolean,
+      isRootPath: Boolean,
+      filterFun: Option[String => Boolean],
+      parallelismThreshold: Int,


This seems a new parameter that did not exist before. Why do we need this? If people want parallelized listing, people can invoke parallelListLeafFiles above.

This is because listLeafFiles also recursively call parallelListLeafFiles inside so we need a way to pass down the arguments. These used to be read from conf in SparkSession but now made explicit as parameters, as we no longer have a session object.

core/src/main/scala/org/apache/spark/util/HadoopFSUtils.scala

HyukjinKwon · 2020-09-14T02:44:28Z

@sunchao Let's update PR title as well. This PR doesn't "expose" something but just moves the codes within the codebase. Also:

Along the process this also did some cleanups/refactorings.

Do you mind clarify the additional diffs this PR introduces? I took a cursory look. Seems there are some diffs above.

HyukjinKwon · 2020-09-14T02:45:06Z

Also, let's make sure file s JIRA at #29471 (comment) as @holdenk suggested. I agree with @viirya's comment there.

sunchao · 2020-09-14T18:43:40Z

Thanks @HyukjinKwon and @holdenk for the review! I updated the PR title as well as description. Also created SPARK-32880 for the follow-up work.

SparkQA · 2020-09-14T21:27:31Z

Test build #128666 has finished for PR 29471 at commit 2d8e64d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sunchao · 2020-09-21T20:17:06Z

ping @HyukjinKwon @holdenk : anymore comments?

holdenk · 2020-09-23T18:18:46Z

Looks like we've reached a lazy consesus here I'll merge this today :)

holdenk · 2020-09-24T17:51:15Z

Ok I meant to merge this yesterday but I got distracted with the K8s stuff.

holdenk · 2020-09-24T18:02:26Z

Merged to the current development branch :)

sunchao · 2020-09-24T18:03:35Z

Thanks @holdenk for the merge, and all for the review comments!

### What changes were proposed in this pull request? This PR is a follow-up of #29471 and does the following improvements for `HadoopFSUtils`: 1. Removes the extra `filterFun` from the listing API and combines it with the `filter`. 2. Removes `SerializableBlockLocation` and `SerializableFileStatus` given that `BlockLocation` and `FileStatus` are already serializable. 3. Hides the `isRootLevel` flag from the top-level API. ### Why are the changes needed? Main purpose is to simplify the logic within `HadoopFSUtils` as well as cleanup the API. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing unit tests (e.g., `FileIndexSuite`) Closes #29959 from sunchao/hadoop-fs-utils-followup. Authored-by: Chao Sun <[email protected]> Signed-off-by: Holden Karau <[email protected]>

probot-autolabeler bot added CORE SQL labels Aug 18, 2020

sunchao force-pushed the SPARK-32381 branch from e58180a to 7b61361 Compare August 19, 2020 05:27

holdenk and others added 18 commits August 19, 2020 18:36

Start working on re-implementing the resolution logic to avoid gettin…

2d6add1

…g locations

Revert "Start working on re-implementing the resolution logic to avoi…

02ea25d

…d getting locations" This reverts commit bbe6344.

Start moving InMemoryFileIndex over to HadoopFSUtils

fded394

Get the SQL layer compiling against the common shim layer

f2fdcd7

Keep working on plumbing through the type info we need to avoid direc…

37253da

…tly manipulating bytecode

Ok core compiles now

e6eee1d

Get the input filter based on the jobContext

0492bee

Backout some small changes we don't need anymore

138e14a

Revert the class change to BinaryFileRDD we don't depend on that anym…

dace630

…ore.

Fix dropped annotation

20586d3

Back out NewHadoopRDD changes, we'll expose a trait instead and have …

7bb0770

…people mix it in

Rework the HadoopFSUtils to: use the serilizablility of block locatio…

4eb770a

…ns, try and avoid expensive S3A location lookups (and other general file systems), etc.

Fix the bug and remove default params so it's more difficult to write

8a5fd8b

Remove un-used changes to NewHadoopRDD

d85c5a4

Put the utility function in HadoopFSUtils to make it easier.

3b0bf18

Fix NPE

8b7a2fa

Defaults ignoreLocality to false and small cleanups

e0ec9a6

Don't use null as BlockLocations

b1bf5e9

sunchao force-pushed the SPARK-32381 branch from eeeaee5 to b1bf5e9 Compare August 20, 2020 05:48

Revert changes on WholeTextInputFileFormat

5529047

holdenk reviewed Aug 20, 2020

View reviewed changes

core/src/main/scala/org/apache/spark/util/HadoopFSUtils.scala Show resolved Hide resolved

Remove unused import

6f5c7e5

sunchao marked this pull request as ready for review August 20, 2020 23:43

holdenk reviewed Aug 24, 2020

View reviewed changes

zsxwing reviewed Aug 27, 2020

View reviewed changes

core/src/main/scala/org/apache/spark/util/HadoopFSUtils.scala Outdated Show resolved Hide resolved

Switch to private[spark]

86c2013

viirya reviewed Aug 30, 2020

View reviewed changes

core/src/main/scala/org/apache/spark/util/HadoopFSUtils.scala Outdated Show resolved Hide resolved

viirya reviewed Aug 30, 2020

View reviewed changes

core/src/main/scala/org/apache/spark/util/HadoopFSUtils.scala Outdated Show resolved Hide resolved

viirya reviewed Aug 30, 2020

View reviewed changes

core/src/main/scala/org/apache/spark/util/HadoopFSUtils.scala Outdated Show resolved Hide resolved

Address comments

bfa37cc

holdenk approved these changes Sep 10, 2020

View reviewed changes

HyukjinKwon reviewed Sep 14, 2020

View reviewed changes

core/src/main/scala/org/apache/spark/util/HadoopFSUtils.scala Outdated Show resolved Hide resolved

sunchao changed the title ~~[SPARK-32381][CORE][SQL] Explore allowing parallel listing & non-location sensitive listing in core~~ [SPARK-32381][CORE][SQL] Move and refactor parallel listing & non-location sensitive listing to core Sep 14, 2020

Remove redundant .toSeq

2d8e64d

asfgit closed this in 8ccfbc1 Sep 24, 2020

sunchao mentioned this pull request Oct 6, 2020

[SPARK-32381][CORE][SQL][FOLLOWUP] More cleanup on HadoopFSUtils #29959

Closed

gengliangwang mentioned this pull request Jan 13, 2021

[SPARK-34075][SQL][CORE] Hidden directories are being listed for partition inference #31169

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-32381][CORE][SQL] Move and refactor parallel listing & non-location sensitive listing to core #29471

[SPARK-32381][CORE][SQL] Move and refactor parallel listing & non-location sensitive listing to core #29471

sunchao commented Aug 18, 2020 •

edited

Loading

dongjoon-hyun commented Aug 20, 2020

dongjoon-hyun commented Aug 20, 2020

viirya commented Aug 20, 2020

holdenk left a comment

SparkQA commented Aug 20, 2020

holdenk Aug 24, 2020

sunchao Aug 24, 2020

zsxwing Aug 27, 2020

holdenk Aug 24, 2020

holdenk Aug 24, 2020

holdenk Aug 24, 2020

holdenk Aug 24, 2020

sunchao commented Aug 25, 2020

SparkQA commented Aug 28, 2020

viirya Aug 30, 2020

sunchao Aug 30, 2020

holdenk Aug 31, 2020

SparkQA commented Aug 30, 2020

steveloughran commented Sep 1, 2020

holdenk left a comment

holdenk Sep 10, 2020

HyukjinKwon Sep 14, 2020

sunchao Sep 14, 2020

HyukjinKwon commented Sep 14, 2020

HyukjinKwon commented Sep 14, 2020

sunchao commented Sep 14, 2020

SparkQA commented Sep 14, 2020

sunchao commented Sep 21, 2020

holdenk commented Sep 23, 2020

holdenk commented Sep 24, 2020

holdenk commented Sep 24, 2020

sunchao commented Sep 24, 2020

[SPARK-32381][CORE][SQL] Move and refactor parallel listing & non-location sensitive listing to core #29471

[SPARK-32381][CORE][SQL] Move and refactor parallel listing & non-location sensitive listing to core #29471

Conversation

sunchao commented Aug 18, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

dongjoon-hyun commented Aug 20, 2020

dongjoon-hyun commented Aug 20, 2020

viirya commented Aug 20, 2020

holdenk left a comment

Choose a reason for hiding this comment

SparkQA commented Aug 20, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunchao commented Aug 25, 2020

SparkQA commented Aug 28, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 30, 2020

steveloughran commented Sep 1, 2020

holdenk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Sep 14, 2020

HyukjinKwon commented Sep 14, 2020

sunchao commented Sep 14, 2020

SparkQA commented Sep 14, 2020

sunchao commented Sep 21, 2020

holdenk commented Sep 23, 2020

holdenk commented Sep 24, 2020

holdenk commented Sep 24, 2020

sunchao commented Sep 24, 2020

sunchao commented Aug 18, 2020 •

edited

Loading