-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-17101][SQL] Provide consistent format identifiers for TextFileFormat and ParquetFileFormat #14680
Conversation
Can you show the before/after comparison in pr description? |
@@ -40,6 +40,8 @@ class TextFileFormat extends TextBasedFileFormat with DataSourceRegister { | |||
|
|||
override def shortName(): String = "text" | |||
|
|||
override def toString: String = shortName.toUpperCase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you might already know, I see CSVFileFormat.scala#L43 and JsonFileFormat#L144 use string value "CSV"
and "JSON"
rather then using shortName.toUpperCase
for this case. It might be great if they are all matched up in any way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did see it but I'd do the opposite and rather fix JSON and CSV than repeat what's in shortName
. I even thought about defining the toString
in a supertype, but could find nothing that would be acceptable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, I just wanted to say i'd be nicer if they are matched anyway. BTW, it seems it's "ParquetFormat"
alone in ParquetFileFormat
whereas its "ORC"
in OrcFileFormat
. It seems its "LibSVM"
in LibSVMFileFormat
, here
FYI, here for ORC and here for Parquet.
Output is as below:
- ORC
== Physical Plan ==
*FileScan orc [id#7L] Batched: false, Format: ORC, InputPaths: file:/tmp/test, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint>
- Parquet
== Physical Plan ==
*FileScan parquet [id#7L] Batched: true, Format: ParquetFormat, InputPaths: file:/tmp/test, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me propose a change for Parquet. Thanks for spotting it and your review!
Test build #63904 has finished for PR 14680 at commit
|
133e5de
to
52f5ba5
Compare
Test build #63935 has finished for PR 14680 at commit
|
@rxin @HyukjinKwon Mind reviewing it again and letting me know what you think? I know it's minor but would greatly appreciate having it merged at your earliest convenience. Thanks. |
@jaceklaskowski It seems the test here is related with this change. It seems it will passe the test if we change BTW, how about changing them to I mean.. if my understanding is correct, the proper name might be If the purpose of this change is only to see the information about plans to human via This is just my personal opinion. I think we need @rxin 's sign off here. |
…Format and ParquetFileFormat
52f5ba5
to
e780208
Compare
How about now @HyukjinKwon ? The more I look at it the more I think it should calculated automatically out of the class name when constructor's called. It's of little to no value to a FileFormat developer. |
Thanks for bearing with me. That was just my personal opinion. As you already know, I can't decide what should be added into Spark. BTW, we should fix spark/sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala Lines 715 to 724 in e50efd5
|
Test build #64105 has finished for PR 14680 at commit
|
Thanks @HyukjinKwon You're helping me a lot! I'll work on the unit test. |
## What changes were proposed in this pull request? This patch fixes the format specification in explain for file sources (Parquet and Text formats are the only two that are different from the rest): Before: ``` scala> spark.read.text("test.text").explain() == Physical Plan == *FileScan text [value#15] Batched: false, Format: org.apache.spark.sql.execution.datasources.text.TextFileFormatxyz, Location: InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value:string> ``` After: ``` scala> spark.read.text("test.text").explain() == Physical Plan == *FileScan text [value#15] Batched: false, Format: Text, Location: InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value:string> ``` Also closes #14680. ## How was this patch tested? Verified in spark-shell. Author: Reynold Xin <[email protected]> Closes #16187 from rxin/SPARK-18760. (cherry picked from commit 5f894d2) Signed-off-by: Reynold Xin <[email protected]>
## What changes were proposed in this pull request? This patch fixes the format specification in explain for file sources (Parquet and Text formats are the only two that are different from the rest): Before: ``` scala> spark.read.text("test.text").explain() == Physical Plan == *FileScan text [value#15] Batched: false, Format: org.apache.spark.sql.execution.datasources.text.TextFileFormatxyz, Location: InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value:string> ``` After: ``` scala> spark.read.text("test.text").explain() == Physical Plan == *FileScan text [value#15] Batched: false, Format: Text, Location: InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value:string> ``` Also closes apache#14680. ## How was this patch tested? Verified in spark-shell. Author: Reynold Xin <[email protected]> Closes apache#16187 from rxin/SPARK-18760.
## What changes were proposed in this pull request? This patch fixes the format specification in explain for file sources (Parquet and Text formats are the only two that are different from the rest): Before: ``` scala> spark.read.text("test.text").explain() == Physical Plan == *FileScan text [value#15] Batched: false, Format: org.apache.spark.sql.execution.datasources.text.TextFileFormatxyz, Location: InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value:string> ``` After: ``` scala> spark.read.text("test.text").explain() == Physical Plan == *FileScan text [value#15] Batched: false, Format: Text, Location: InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value:string> ``` Also closes apache#14680. ## How was this patch tested? Verified in spark-shell. Author: Reynold Xin <[email protected]> Closes apache#16187 from rxin/SPARK-18760.
What changes were proposed in this pull request?
Define the format identifier that is used in Optimized Logical Plan in
explain
for text and parquet file formats (following CSV and JSON formats).Before
After
How was this patch tested?
Local build.