[SPARK-28089][SQL] File source v2: support reading output of file streaming Sink #24900

gengliangwang · 2019-06-18T05:58:00Z

What changes were proposed in this pull request?

File source V1 supports reading output of FileStreamSink as batch. #11897
We should support this in file source V2 as well. When reading with paths, we first check if there is metadata log of FileStreamSink. If yes, we use MetadataLogFileIndex for listing files; Otherwise, we use InMemoryFileIndex.

How was this patch tested?

Unit test

gengliangwang · 2019-06-18T05:58:40Z

This PR also resolves several TODO testing items in #24830.

SparkQA · 2019-06-18T07:05:02Z

Test build #106609 has finished for PR 24900 at commit f70fbe8.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-06-18T07:05:02Z

Test build #106610 has finished for PR 24900 at commit e666203.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-06-18T10:51:36Z

retest this please.

gengliangwang · 2019-06-18T10:51:54Z

@cloud-fan @dongjoon-hyun @HyukjinKwon @jose-torres

gengliangwang · 2019-06-18T11:23:52Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingDeduplicationSuite.scala

-          } finally {
-            q.stop()
-          }
+    withTempDir { output =>


This is just removing withSQLConf

gengliangwang · 2019-06-18T11:24:19Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala

-          spark.read.load(outputDir.getCanonicalPath).as[Int]
-        }
-        assertMigrationError(e.getMessage, sparkMetadataDir, legacySparkMetadataDir)
+      val e = intercept[SparkException] {


This is just removing withSQLConf and one TODO comment.

SparkQA · 2019-06-18T13:53:25Z

Test build #106622 has finished for PR 24900 at commit e666203.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-06-19T02:39:48Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSinkSuite.scala

+      .set(SQLConf.USE_V1_SOURCE_READER_LIST, "csv,json,orc,text,parquet")
+      .set(SQLConf.USE_V1_SOURCE_WRITER_LIST, "csv,json,orc,text,parquet")
+
+  test("partitioned writing and batch reading") {


can you highlight the difference between this test case and the one in FileStreamSinkV2Suite?

For V1 suite, it uses Parquet V1, and it matches HadoopFsRelation / FileScanRDD to check if the plan is as expected. Also, partition pruning is tested.

For V2 suite, it uses Parquet V2, and it matches FileTable / BatchScanExec to check if the plan is as expected. As partition pruning is not supported in V2 yet, we can't test it.

I can abstract the code if the two cases look duplicated to you.

Or, do you mean changing the test case name?

let's abstract the code, this test case is too long to duplicate.

sundarcse1216 · 2019-06-19T02:46:15Z

Mute the thread

…

On Tue 18 Jun, 2019, 4:55 PM Gengliang Wang ***@***.*** wrote: ***@***.**** commented on this pull request. ------------------------------ In sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingDeduplicationSuite.scala <#24900 (comment)>: > - try { - inputData.addData("a") - q.processAllAvailable() - checkDataset(spark.read.parquet(outputPath).as[String], "a") - - inputData.addData("a") // Dropped - q.processAllAvailable() - checkDataset(spark.read.parquet(outputPath).as[String], "a") - - inputData.addData("b") - q.processAllAvailable() - checkDataset(spark.read.parquet(outputPath).as[String], "a", "b") - } finally { - q.stop() - } + withTempDir { output => remove withSQLConf — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#24900?email_source=notifications&email_token=AF5YGWWRFIF5VX4C2LGM3STP3DA2ZA5CNFSM4HY4RZ7KYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOB33FCII#pullrequestreview-251023649>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AF5YGWRGYZOM67PCHCNPTI3P3DA2ZANCNFSM4HY4RZ7A> .

SparkQA · 2019-06-19T07:05:03Z

Test build #106658 has finished for PR 24900 at commit 1ed79cf.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-06-19T15:37:20Z

retest this please.

SparkQA · 2019-06-19T17:48:02Z

Test build #106678 has finished for PR 24900 at commit 1ed79cf.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-06-19T18:16:11Z

retest this please.

SparkQA · 2019-06-19T21:33:38Z

Test build #106683 has finished for PR 24900 at commit 1ed79cf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-06-20T04:57:28Z

thanks, merging to master!

…eaming Sink ## What changes were proposed in this pull request? File source V1 supports reading output of FileStreamSink as batch. apache#11897 We should support this in file source V2 as well. When reading with paths, we first check if there is metadata log of FileStreamSink. If yes, we use `MetadataLogFileIndex` for listing files; Otherwise, we use `InMemoryFileIndex`. ## How was this patch tested? Unit test Closes apache#24900 from gengliangwang/FileStreamV2. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

FileStreamV2

f70fbe8

add comments

e666203

dongjoon-hyun added the SQL label Jun 18, 2019

gengliangwang commented Jun 18, 2019

View reviewed changes

cloud-fan reviewed Jun 19, 2019

View reviewed changes

abstract test cases

1ed79cf

cloud-fan closed this in f510761 Jun 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-28089][SQL] File source v2: support reading output of file streaming Sink #24900

[SPARK-28089][SQL] File source v2: support reading output of file streaming Sink #24900

gengliangwang commented Jun 18, 2019

gengliangwang commented Jun 18, 2019 •

edited

Loading

SparkQA commented Jun 18, 2019

SparkQA commented Jun 18, 2019

gengliangwang commented Jun 18, 2019

gengliangwang commented Jun 18, 2019

gengliangwang Jun 18, 2019 •

edited

Loading

gengliangwang Jun 18, 2019

SparkQA commented Jun 18, 2019

cloud-fan Jun 19, 2019

gengliangwang Jun 19, 2019

gengliangwang Jun 19, 2019

cloud-fan Jun 19, 2019

sundarcse1216 commented Jun 19, 2019 via email

SparkQA commented Jun 19, 2019

gengliangwang commented Jun 19, 2019

SparkQA commented Jun 19, 2019

gengliangwang commented Jun 19, 2019

SparkQA commented Jun 19, 2019

cloud-fan commented Jun 20, 2019

[SPARK-28089][SQL] File source v2: support reading output of file streaming Sink #24900

[SPARK-28089][SQL] File source v2: support reading output of file streaming Sink #24900

Conversation

gengliangwang commented Jun 18, 2019

What changes were proposed in this pull request?

How was this patch tested?

gengliangwang commented Jun 18, 2019 • edited Loading

SparkQA commented Jun 18, 2019

SparkQA commented Jun 18, 2019

gengliangwang commented Jun 18, 2019

gengliangwang commented Jun 18, 2019

gengliangwang Jun 18, 2019 • edited Loading

Choose a reason for hiding this comment

gengliangwang Jun 18, 2019

Choose a reason for hiding this comment

SparkQA commented Jun 18, 2019

cloud-fan Jun 19, 2019

Choose a reason for hiding this comment

gengliangwang Jun 19, 2019

Choose a reason for hiding this comment

gengliangwang Jun 19, 2019

Choose a reason for hiding this comment

cloud-fan Jun 19, 2019

Choose a reason for hiding this comment

sundarcse1216 commented Jun 19, 2019 via email

SparkQA commented Jun 19, 2019

gengliangwang commented Jun 19, 2019

SparkQA commented Jun 19, 2019

gengliangwang commented Jun 19, 2019

SparkQA commented Jun 19, 2019

cloud-fan commented Jun 20, 2019

gengliangwang commented Jun 18, 2019 •

edited

Loading

gengliangwang Jun 18, 2019 •

edited

Loading