Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-28089][SQL] File source v2: support reading output of file streaming Sink #24900

Closed
wants to merge 3 commits into from

Conversation

gengliangwang
Copy link
Member

What changes were proposed in this pull request?

File source V1 supports reading output of FileStreamSink as batch. #11897
We should support this in file source V2 as well. When reading with paths, we first check if there is metadata log of FileStreamSink. If yes, we use MetadataLogFileIndex for listing files; Otherwise, we use InMemoryFileIndex.

How was this patch tested?

Unit test

@gengliangwang
Copy link
Member Author

gengliangwang commented Jun 18, 2019

This PR also resolves several TODO testing items in #24830.

@SparkQA
Copy link

SparkQA commented Jun 18, 2019

Test build #106609 has finished for PR 24900 at commit f70fbe8.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 18, 2019

Test build #106610 has finished for PR 24900 at commit e666203.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member Author

retest this please.

@gengliangwang
Copy link
Member Author

} finally {
q.stop()
}
withTempDir { output =>
Copy link
Member Author

@gengliangwang gengliangwang Jun 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just removing withSQLConf

spark.read.load(outputDir.getCanonicalPath).as[Int]
}
assertMigrationError(e.getMessage, sparkMetadataDir, legacySparkMetadataDir)
val e = intercept[SparkException] {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just removing withSQLConf and one TODO comment.

@SparkQA
Copy link

SparkQA commented Jun 18, 2019

Test build #106622 has finished for PR 24900 at commit e666203.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

.set(SQLConf.USE_V1_SOURCE_READER_LIST, "csv,json,orc,text,parquet")
.set(SQLConf.USE_V1_SOURCE_WRITER_LIST, "csv,json,orc,text,parquet")

test("partitioned writing and batch reading") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you highlight the difference between this test case and the one in FileStreamSinkV2Suite?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For V1 suite, it uses Parquet V1, and it matches HadoopFsRelation / FileScanRDD to check if the plan is as expected. Also, partition pruning is tested.

For V2 suite, it uses Parquet V2, and it matches FileTable / BatchScanExec to check if the plan is as expected. As partition pruning is not supported in V2 yet, we can't test it.

I can abstract the code if the two cases look duplicated to you.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, do you mean changing the test case name?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's abstract the code, this test case is too long to duplicate.

@sundarcse1216
Copy link

sundarcse1216 commented Jun 19, 2019 via email

@SparkQA
Copy link

SparkQA commented Jun 19, 2019

Test build #106658 has finished for PR 24900 at commit 1ed79cf.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member Author

retest this please.

@SparkQA
Copy link

SparkQA commented Jun 19, 2019

Test build #106678 has finished for PR 24900 at commit 1ed79cf.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member Author

retest this please.

@SparkQA
Copy link

SparkQA commented Jun 19, 2019

Test build #106683 has finished for PR 24900 at commit 1ed79cf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in f510761 Jun 20, 2019
kiku-jw pushed a commit to kiku-jw/spark that referenced this pull request Jun 26, 2019
…eaming Sink

## What changes were proposed in this pull request?

File source V1 supports reading output of FileStreamSink as batch. apache#11897
We should support this in file source V2 as well. When reading with paths, we first check if there is metadata log of FileStreamSink. If yes, we use `MetadataLogFileIndex` for listing files; Otherwise, we use `InMemoryFileIndex`.

## How was this patch tested?

Unit test

Closes apache#24900 from gengliangwang/FileStreamV2.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants