[SPARK-17101][SQL] Provide consistent format identifiers for TextFileFormat and ParquetFileFormat #14680

jaceklaskowski · 2016-08-17T06:51:14Z

What changes were proposed in this pull request?

Define the format identifier that is used in Optimized Logical Plan in explain for text and parquet file formats (following CSV and JSON formats).

Before

Text

== Physical Plan ==
InMemoryTableScan [value#24]
   +- InMemoryRelation [value#24], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
         +- *FileScan text [value#24] Batched: false, Format: org.apache.spark.sql.execution.datasources.text.TextFileFormat@262e2c8c, InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value:string>

Parquet

== Physical Plan ==
*FileScan parquet [id#7L] Batched: true, Format: ParquetFormat, InputPaths: file:/tmp/test, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint>

After

Text

== Physical Plan ==
InMemoryTableScan [value#0]
   +- InMemoryRelation [value#0], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
         +- *FileScan text [value#0] Batched: false, Format: Text, InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value:string>

Parquet

== Physical Plan ==
*FileScan parquet [id#7L] Batched: true, Format: Parquet, InputPaths: file:/tmp/test, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint>

How was this patch tested?

Local build.

rxin · 2016-08-17T06:57:54Z

Can you show the before/after comparison in pr description?

HyukjinKwon · 2016-08-17T07:10:38Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala

@@ -40,6 +40,8 @@ class TextFileFormat extends TextBasedFileFormat with DataSourceRegister {

  override def shortName(): String = "text"

+  override def toString: String = shortName.toUpperCase


As you might already know, I see CSVFileFormat.scala#L43 and JsonFileFormat#L144 use string value "CSV" and "JSON" rather then using shortName.toUpperCase for this case. It might be great if they are all matched up in any way.

I did see it but I'd do the opposite and rather fix JSON and CSV than repeat what's in shortName. I even thought about defining the toString in a supertype, but could find nothing that would be acceptable.

Yup, I just wanted to say i'd be nicer if they are matched anyway. BTW, it seems it's "ParquetFormat" ~~alone~~ in ParquetFileFormat whereas its "ORC" in OrcFileFormat. It seems its "LibSVM" in LibSVMFileFormat, here

FYI, here for ORC and here for Parquet.

Output is as below:

ORC

== Physical Plan == *FileScan orc [id#7L] Batched: false, Format: ORC, InputPaths: file:/tmp/test, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint>

Parquet

== Physical Plan == *FileScan parquet [id#7L] Batched: true, Format: ParquetFormat, InputPaths: file:/tmp/test, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint>

Let me propose a change for Parquet. Thanks for spotting it and your review!

SparkQA · 2016-08-17T08:25:38Z

Test build #63904 has finished for PR 14680 at commit 133e5de.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-17T22:02:41Z

Test build #63935 has finished for PR 14680 at commit 52f5ba5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jaceklaskowski · 2016-08-18T17:15:03Z

@rxin @HyukjinKwon Mind reviewing it again and letting me know what you think? I know it's minor but would greatly appreciate having it merged at your earliest convenience. Thanks.

HyukjinKwon · 2016-08-19T04:22:22Z

@jaceklaskowski It seems the test here is related with this change. It seems it will passe the test if we change TextFileFormat to TEXT.

BTW, how about changing them to Parquet and Text maybe? I believe this might be about personal taste though.. I feel like shortName.toUpperCase is not always the correct string representation of each data source.

I mean.. if my understanding is correct, the proper name might be Parquet rather than PARQUET, at least. It seems ORC, JSON and CSV are correct names because they are abbreviated names but I feel like it is questionable for PARQUET and TEXT.

If the purpose of this change is only to see the information about plans to human via explain(...) regardless of anything, it might be better if it is closer to human readable and correct names as string representation.

This is just my personal opinion. I think we need @rxin 's sign off here.

…Format and ParquetFileFormat

jaceklaskowski · 2016-08-19T21:38:06Z

How about now @HyukjinKwon ? The more I look at it the more I think it should calculated automatically out of the class name when constructor's called. It's of little to no value to a FileFormat developer.

HyukjinKwon · 2016-08-19T22:42:11Z

Thanks for bearing with me. That was just my personal opinion. As you already know, I can't decide what should be added into Spark. BTW, we should fix

spark/sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala

Lines 715 to 724 in e50efd5

    
           val explainWithoutExtended = q.explainInternal(false) 
        
           // `extended = false` only displays the physical plan. 
        
           assert("Relation.*text".r.findAllMatchIn(explainWithoutExtended).size === 0) 
        
           assert("TextFileFormat".r.findAllMatchIn(explainWithoutExtended).size === 1) 
        
           val explainWithExtended = q.explainInternal(true) 
        
           // `extended = true` displays 3 logical plans (Parsed/Optimized/Optimized) and 1 physical 
        
           // plan. 
        
           assert("Relation.*text".r.findAllMatchIn(explainWithExtended).size === 3) 
        
           assert("TextFileFormat".r.findAllMatchIn(explainWithExtended).size === 1)

This is being failed.

SparkQA · 2016-08-19T23:08:42Z

Test build #64105 has finished for PR 14680 at commit e780208.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jaceklaskowski · 2016-08-20T07:18:25Z

Thanks @HyukjinKwon You're helping me a lot! I'll work on the unit test.

## What changes were proposed in this pull request? This patch fixes the format specification in explain for file sources (Parquet and Text formats are the only two that are different from the rest): Before: ``` scala> spark.read.text("test.text").explain() == Physical Plan == *FileScan text [value#15] Batched: false, Format: org.apache.spark.sql.execution.datasources.text.TextFileFormatxyz, Location: InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value:string> ``` After: ``` scala> spark.read.text("test.text").explain() == Physical Plan == *FileScan text [value#15] Batched: false, Format: Text, Location: InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value:string> ``` Also closes #14680. ## How was this patch tested? Verified in spark-shell. Author: Reynold Xin <[email protected]> Closes #16187 from rxin/SPARK-18760. (cherry picked from commit 5f894d2) Signed-off-by: Reynold Xin <[email protected]>

## What changes were proposed in this pull request? This patch fixes the format specification in explain for file sources (Parquet and Text formats are the only two that are different from the rest): Before: ``` scala> spark.read.text("test.text").explain() == Physical Plan == *FileScan text [value#15] Batched: false, Format: org.apache.spark.sql.execution.datasources.text.TextFileFormatxyz, Location: InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value:string> ``` After: ``` scala> spark.read.text("test.text").explain() == Physical Plan == *FileScan text [value#15] Batched: false, Format: Text, Location: InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value:string> ``` Also closes apache#14680. ## How was this patch tested? Verified in spark-shell. Author: Reynold Xin <[email protected]> Closes apache#16187 from rxin/SPARK-18760.

HyukjinKwon reviewed Aug 17, 2016
View reviewed changes

jaceklaskowski force-pushed the SPARK-17101 branch from 133e5de to 52f5ba5 Compare August 17, 2016 20:29

jaceklaskowski changed the title ~~[SPARK-17101][SQL] Provide format identifier for TextFileFormat~~ [SPARK-17101][SQL] Provide consistent format identifiers for TextFileFormat and ParquetFileFormat Aug 17, 2016

jaceklaskowski added 2 commits August 19, 2016 23:32

[SPARK-17101][SQL] Provide consistent format identifiers for TextFile…

5936e09

…Format and ParquetFileFormat

toString capitalized

e780208

jaceklaskowski force-pushed the SPARK-17101 branch from 52f5ba5 to e780208 Compare August 19, 2016 21:35

rxin mentioned this pull request Dec 7, 2016

[SPARK-18760][SQL] Consistent format specification for FileFormats #16187

Closed

asfgit closed this in 5f894d2 Dec 8, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-17101][SQL] Provide consistent format identifiers for TextFileFormat and ParquetFileFormat #14680

[SPARK-17101][SQL] Provide consistent format identifiers for TextFileFormat and ParquetFileFormat #14680

jaceklaskowski commented Aug 17, 2016 •

edited

Loading

rxin commented Aug 17, 2016

HyukjinKwon Aug 17, 2016 •

edited

Loading

jaceklaskowski Aug 17, 2016

HyukjinKwon Aug 17, 2016 •

edited

Loading

jaceklaskowski Aug 17, 2016

SparkQA commented Aug 17, 2016

SparkQA commented Aug 17, 2016

jaceklaskowski commented Aug 18, 2016

HyukjinKwon commented Aug 19, 2016 •

edited

Loading

jaceklaskowski commented Aug 19, 2016

HyukjinKwon commented Aug 19, 2016 •

edited

Loading

SparkQA commented Aug 19, 2016

jaceklaskowski commented Aug 20, 2016

		@@ -40,6 +40,8 @@ class TextFileFormat extends TextBasedFileFormat with DataSourceRegister {

		override def shortName(): String = "text"

		override def toString: String = shortName.toUpperCase

[SPARK-17101][SQL] Provide consistent format identifiers for TextFileFormat and ParquetFileFormat #14680

[SPARK-17101][SQL] Provide consistent format identifiers for TextFileFormat and ParquetFileFormat #14680

Conversation

jaceklaskowski commented Aug 17, 2016 • edited Loading

What changes were proposed in this pull request?

Before

After

How was this patch tested?

rxin commented Aug 17, 2016

HyukjinKwon Aug 17, 2016 • edited Loading

Choose a reason for hiding this comment

jaceklaskowski Aug 17, 2016

Choose a reason for hiding this comment

HyukjinKwon Aug 17, 2016 • edited Loading

Choose a reason for hiding this comment

jaceklaskowski Aug 17, 2016

Choose a reason for hiding this comment

SparkQA commented Aug 17, 2016

SparkQA commented Aug 17, 2016

jaceklaskowski commented Aug 18, 2016

HyukjinKwon commented Aug 19, 2016 • edited Loading

jaceklaskowski commented Aug 19, 2016

HyukjinKwon commented Aug 19, 2016 • edited Loading

SparkQA commented Aug 19, 2016

jaceklaskowski commented Aug 20, 2016

jaceklaskowski commented Aug 17, 2016 •

edited

Loading

HyukjinKwon Aug 17, 2016 •

edited

Loading

HyukjinKwon Aug 17, 2016 •

edited

Loading

HyukjinKwon commented Aug 19, 2016 •

edited

Loading

HyukjinKwon commented Aug 19, 2016 •

edited

Loading