Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-17101][SQL] Provide consistent format identifiers for TextFileFormat and ParquetFileFormat #14680

Closed
wants to merge 2 commits into from

Conversation

jaceklaskowski
Copy link
Contributor

@jaceklaskowski jaceklaskowski commented Aug 17, 2016

What changes were proposed in this pull request?

Define the format identifier that is used in Optimized Logical Plan in explain for text and parquet file formats (following CSV and JSON formats).

Before

  • Text
== Physical Plan ==
InMemoryTableScan [value#24]
   +- InMemoryRelation [value#24], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
         +- *FileScan text [value#24] Batched: false, Format: org.apache.spark.sql.execution.datasources.text.TextFileFormat@262e2c8c, InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value:string>
  • Parquet
== Physical Plan ==
*FileScan parquet [id#7L] Batched: true, Format: ParquetFormat, InputPaths: file:/tmp/test, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint>

After

  • Text
== Physical Plan ==
InMemoryTableScan [value#0]
   +- InMemoryRelation [value#0], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
         +- *FileScan text [value#0] Batched: false, Format: Text, InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value:string>
  • Parquet
== Physical Plan ==
*FileScan parquet [id#7L] Batched: true, Format: Parquet, InputPaths: file:/tmp/test, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint>

How was this patch tested?

Local build.

@rxin
Copy link
Contributor

rxin commented Aug 17, 2016

Can you show the before/after comparison in pr description?

@@ -40,6 +40,8 @@ class TextFileFormat extends TextBasedFileFormat with DataSourceRegister {

override def shortName(): String = "text"

override def toString: String = shortName.toUpperCase
Copy link
Member

@HyukjinKwon HyukjinKwon Aug 17, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you might already know, I see CSVFileFormat.scala#L43 and JsonFileFormat#L144 use string value "CSV" and "JSON" rather then using shortName.toUpperCase for this case. It might be great if they are all matched up in any way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did see it but I'd do the opposite and rather fix JSON and CSV than repeat what's in shortName. I even thought about defining the toString in a supertype, but could find nothing that would be acceptable.

Copy link
Member

@HyukjinKwon HyukjinKwon Aug 17, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I just wanted to say i'd be nicer if they are matched anyway. BTW, it seems it's "ParquetFormat" alone in ParquetFileFormat whereas its "ORC" in OrcFileFormat. It seems its "LibSVM" in LibSVMFileFormat, here

FYI, here for ORC and here for Parquet.

Output is as below:

  • ORC
== Physical Plan ==
*FileScan orc [id#7L] Batched: false, Format: ORC, InputPaths: file:/tmp/test, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint>
  • Parquet
== Physical Plan ==
*FileScan parquet [id#7L] Batched: true, Format: ParquetFormat, InputPaths: file:/tmp/test, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me propose a change for Parquet. Thanks for spotting it and your review!

@SparkQA
Copy link

SparkQA commented Aug 17, 2016

Test build #63904 has finished for PR 14680 at commit 133e5de.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jaceklaskowski jaceklaskowski changed the title [SPARK-17101][SQL] Provide format identifier for TextFileFormat [SPARK-17101][SQL] Provide consistent format identifiers for TextFileFormat and ParquetFileFormat Aug 17, 2016
@SparkQA
Copy link

SparkQA commented Aug 17, 2016

Test build #63935 has finished for PR 14680 at commit 52f5ba5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jaceklaskowski
Copy link
Contributor Author

@rxin @HyukjinKwon Mind reviewing it again and letting me know what you think? I know it's minor but would greatly appreciate having it merged at your earliest convenience. Thanks.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Aug 19, 2016

@jaceklaskowski It seems the test here is related with this change. It seems it will passe the test if we change TextFileFormat to TEXT.

BTW, how about changing them to Parquet and Text maybe? I believe this might be about personal taste though.. I feel like shortName.toUpperCase is not always the correct string representation of each data source.

I mean.. if my understanding is correct, the proper name might be Parquet rather than PARQUET, at least. It seems ORC, JSON and CSV are correct names because they are abbreviated names but I feel like it is questionable for PARQUET and TEXT.

If the purpose of this change is only to see the information about plans to human via explain(...) regardless of anything, it might be better if it is closer to human readable and correct names as string representation.

This is just my personal opinion. I think we need @rxin 's sign off here.

@jaceklaskowski
Copy link
Contributor Author

How about now @HyukjinKwon ? The more I look at it the more I think it should calculated automatically out of the class name when constructor's called. It's of little to no value to a FileFormat developer.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Aug 19, 2016

Thanks for bearing with me. That was just my personal opinion. As you already know, I can't decide what should be added into Spark. BTW, we should fix

val explainWithoutExtended = q.explainInternal(false)
// `extended = false` only displays the physical plan.
assert("Relation.*text".r.findAllMatchIn(explainWithoutExtended).size === 0)
assert("TextFileFormat".r.findAllMatchIn(explainWithoutExtended).size === 1)
val explainWithExtended = q.explainInternal(true)
// `extended = true` displays 3 logical plans (Parsed/Optimized/Optimized) and 1 physical
// plan.
assert("Relation.*text".r.findAllMatchIn(explainWithExtended).size === 3)
assert("TextFileFormat".r.findAllMatchIn(explainWithExtended).size === 1)
This is being failed.

@SparkQA
Copy link

SparkQA commented Aug 19, 2016

Test build #64105 has finished for PR 14680 at commit e780208.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jaceklaskowski
Copy link
Contributor Author

Thanks @HyukjinKwon You're helping me a lot! I'll work on the unit test.

asfgit pushed a commit that referenced this pull request Dec 8, 2016
## What changes were proposed in this pull request?
This patch fixes the format specification in explain for file sources (Parquet and Text formats are the only two that are different from the rest):

Before:
```
scala> spark.read.text("test.text").explain()
== Physical Plan ==
*FileScan text [value#15] Batched: false, Format: org.apache.spark.sql.execution.datasources.text.TextFileFormatxyz, Location: InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value:string>
```

After:
```
scala> spark.read.text("test.text").explain()
== Physical Plan ==
*FileScan text [value#15] Batched: false, Format: Text, Location: InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value:string>
```

Also closes #14680.

## How was this patch tested?
Verified in spark-shell.

Author: Reynold Xin <[email protected]>

Closes #16187 from rxin/SPARK-18760.

(cherry picked from commit 5f894d2)
Signed-off-by: Reynold Xin <[email protected]>
@asfgit asfgit closed this in 5f894d2 Dec 8, 2016
robert3005 pushed a commit to palantir/spark that referenced this pull request Dec 15, 2016
## What changes were proposed in this pull request?
This patch fixes the format specification in explain for file sources (Parquet and Text formats are the only two that are different from the rest):

Before:
```
scala> spark.read.text("test.text").explain()
== Physical Plan ==
*FileScan text [value#15] Batched: false, Format: org.apache.spark.sql.execution.datasources.text.TextFileFormatxyz, Location: InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value:string>
```

After:
```
scala> spark.read.text("test.text").explain()
== Physical Plan ==
*FileScan text [value#15] Batched: false, Format: Text, Location: InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value:string>
```

Also closes apache#14680.

## How was this patch tested?
Verified in spark-shell.

Author: Reynold Xin <[email protected]>

Closes apache#16187 from rxin/SPARK-18760.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?
This patch fixes the format specification in explain for file sources (Parquet and Text formats are the only two that are different from the rest):

Before:
```
scala> spark.read.text("test.text").explain()
== Physical Plan ==
*FileScan text [value#15] Batched: false, Format: org.apache.spark.sql.execution.datasources.text.TextFileFormatxyz, Location: InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value:string>
```

After:
```
scala> spark.read.text("test.text").explain()
== Physical Plan ==
*FileScan text [value#15] Batched: false, Format: Text, Location: InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value:string>
```

Also closes apache#14680.

## How was this patch tested?
Verified in spark-shell.

Author: Reynold Xin <[email protected]>

Closes apache#16187 from rxin/SPARK-18760.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants