[SPARK-2927][SQL] Add a conf to configure if we always read Binary columns stored in Parquet as String columns #1855

yhuai · 2014-08-08T17:24:07Z

This PR adds a new conf flag spark.sql.parquet.binaryAsString. When it is true, if there is no parquet metadata file available to provide the schema of the data, we will always treat binary fields stored in parquet as string fields. This conf is used to provide a way to read string fields generated without UTF8 decoration.

JIRA: https://issues.apache.org/jira/browse/SPARK-2927

…n Parquet as string columns.

SparkQA · 2014-08-08T17:29:31Z

QA tests have started for PR 1855. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18208/consoleFull

marmbrus · 2014-08-08T18:16:39Z

sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala

+   * When set to true, we always treat byte arrays in Parquet files as strings.
+   */
+  private[spark] def isParquetBinaryAsString: Boolean =
+    if (getConf(PARQUET_BINARY_AS_STRING, "false") == "true") true else false


Nit: the if here is redundant.

SparkQA · 2014-08-08T18:40:14Z

QA results for PR 1855:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18208/consoleFull

marmbrus · 2014-08-13T21:46:38Z

@yhuai can you maybe fix the if and then remove WIP? Is this ready to go then?

…ring

yhuai · 2014-08-13T22:57:01Z

Actually, it needs a unit test. Let me take a look at how to add one.

SparkQA · 2014-08-13T23:00:20Z

QA tests have started for PR 1855. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18491/consoleFull

SparkQA · 2014-08-14T00:08:12Z

QA results for PR 1855:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18491/consoleFull

yhuai · 2014-08-14T00:53:04Z

@marmbrus Can you take a look at the unit test? If it is ok, I think this PR is good to go.

marmbrus · 2014-08-14T00:54:45Z

Nice test. Will merge once jenkins is happy.

marmbrus · 2014-08-14T00:54:52Z

Jenkins, test this please.

SparkQA · 2014-08-14T00:55:05Z

QA tests have started for PR 1855. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18503/consoleFull

SparkQA · 2014-08-14T01:00:11Z

QA tests have started for PR 1855. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18504/consoleFull

SparkQA · 2014-08-14T02:08:47Z

QA results for PR 1855:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18503/consoleFull

SparkQA · 2014-08-14T02:14:18Z

QA results for PR 1855:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18504/consoleFull

marmbrus · 2014-08-14T17:47:03Z

Thanks! I've merged this to master and 1.1.

…lumns stored in Parquet as String columns This PR adds a new conf flag `spark.sql.parquet.binaryAsString`. When it is `true`, if there is no parquet metadata file available to provide the schema of the data, we will always treat binary fields stored in parquet as string fields. This conf is used to provide a way to read string fields generated without UTF8 decoration. JIRA: https://issues.apache.org/jira/browse/SPARK-2927 Author: Yin Huai <[email protected]> Closes #1855 from yhuai/parquetBinaryAsString and squashes the following commits: 689ffa9 [Yin Huai] Add missing "=". 80827de [Yin Huai] Unit test. 1765ca4 [Yin Huai] Use .toBoolean. 9d3f199 [Yin Huai] Merge remote-tracking branch 'upstream/master' into parquetBinaryAsString 5d436a1 [Yin Huai] The initial support of adding a conf to treat binary columns stored in Parquet as string columns. (cherry picked from commit add75d4) Signed-off-by: Michael Armbrust <[email protected]>

chutium · 2014-08-15T14:25:00Z

sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTypes.scala

+  def readSchemaFromFile(
+      origPath: Path,
+      conf: Option[Configuration],
+      isBinaryAsString: Boolean): Seq[Attribute] = {
    val keyValueMetadata: java.util.Map[String, String] =
      readMetaData(origPath, conf)
        .getFileMetaData


this patch will be great for impala users like us :) thanks, moreover, there is a getCreatedBy method in readMetaData(origPath, conf).getFileMetaData, and impala creates parquet files always with its own CreatedBy information (always contains string "impala"), so, maybe we can do some auto-detection like (https://github.com/apache/spark/pull/1599/files)

if (fileMetaData.getCreatedBy.contains("impala")) { isBinaryAsString = true log.info(s"Impala parquet file found, blabla...") }

does this auto-detection make sense?

My only concern with auto detection like this is, what happens when impala starts adding the correct annotation and supporting byte arrays?

good question, such a auto detection brings confusion, this is a problem of impala, not spark sql, we are not going to make a impala file format corrector :)

…lumns stored in Parquet as String columns This PR adds a new conf flag `spark.sql.parquet.binaryAsString`. When it is `true`, if there is no parquet metadata file available to provide the schema of the data, we will always treat binary fields stored in parquet as string fields. This conf is used to provide a way to read string fields generated without UTF8 decoration. JIRA: https://issues.apache.org/jira/browse/SPARK-2927 Author: Yin Huai <[email protected]> Closes apache#1855 from yhuai/parquetBinaryAsString and squashes the following commits: 689ffa9 [Yin Huai] Add missing "=". 80827de [Yin Huai] Unit test. 1765ca4 [Yin Huai] Use .toBoolean. 9d3f199 [Yin Huai] Merge remote-tracking branch 'upstream/master' into parquetBinaryAsString 5d436a1 [Yin Huai] The initial support of adding a conf to treat binary columns stored in Parquet as string columns.

### What changes were proposed in this pull request? Bump the `spark-call-home-listener` to 0.2.31. ### Why are the changes needed? This version addresses API differences that are specific to Spark 3.4.0. This differences caused failures. During testing, it was also identified that there were problems introduced due to the circular dependencies between Spark and `spark-call-home`. The builds were updated to exclude `spark-call-home-listener` from Spark. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? The gradle build was updated to add new Spark 3.4 profiles, and the build-specific profiles were used in all building. This demonstrated the API issue, causing the build to fail. In addition new PR builds were introduced to perform this testing as part of each PR. In addition, the tests were supplemented by builds and tests of Spark with the PR-built `spark-call-home-listener` were added to ensure that Spark was unaffected. In addition, the latest Spark release image was patched to replace the `spark-call-home-listener` jar, and this was used to run `spark-perf` benchmarks. ### Was this patch authored or co-authored using generative AI tooling? No ### Patch https://github.pie.apple.com/pie-spark/spark-call-home/compare/f2eb2d0b2b...a93f3d08bb - feat: Apply spark and scala profile (apache#115) - fix: Handle default Spark for macros and exclude circular dependency (apache#116) - build: Replace gala-platform-runtime with ubi9-minimal/java11-runtime (apache#118) - build: Update base image for docker deploy (apache#119)

The initial support of adding a conf to treat binary columns stored i…

5d436a1

…n Parquet as string columns.

marmbrus reviewed Aug 8, 2014
View reviewed changes

yhuai added 2 commits August 13, 2014 18:46

Merge remote-tracking branch 'upstream/master' into parquetBinaryAsSt…

9d3f199

…ring

Use .toBoolean.

1765ca4

yhuai changed the title ~~[WIP][SPARK-2927][SQL] Add a conf to configure if we always read Binary columns stored in Parquet as String columns~~ [SPARK-2927][SQL] Add a conf to configure if we always read Binary columns stored in Parquet as String columns Aug 13, 2014

yhuai added 2 commits August 13, 2014 20:49

Unit test.

80827de

Add missing "=".

689ffa9

asfgit closed this in add75d4 Aug 14, 2014

chutium reviewed Aug 15, 2014
View reviewed changes

yhuai deleted the parquetBinaryAsString branch October 6, 2014 16:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-2927][SQL] Add a conf to configure if we always read Binary columns stored in Parquet as String columns #1855

[SPARK-2927][SQL] Add a conf to configure if we always read Binary columns stored in Parquet as String columns #1855

yhuai commented Aug 8, 2014

SparkQA commented Aug 8, 2014

marmbrus Aug 8, 2014

SparkQA commented Aug 8, 2014

marmbrus commented Aug 13, 2014

yhuai commented Aug 13, 2014

SparkQA commented Aug 13, 2014

SparkQA commented Aug 14, 2014

yhuai commented Aug 14, 2014

marmbrus commented Aug 14, 2014

marmbrus commented Aug 14, 2014

SparkQA commented Aug 14, 2014

SparkQA commented Aug 14, 2014

SparkQA commented Aug 14, 2014

SparkQA commented Aug 14, 2014

marmbrus commented Aug 14, 2014

chutium Aug 15, 2014

marmbrus Aug 16, 2014

chutium Aug 18, 2014

[SPARK-2927][SQL] Add a conf to configure if we always read Binary columns stored in Parquet as String columns #1855

[SPARK-2927][SQL] Add a conf to configure if we always read Binary columns stored in Parquet as String columns #1855

Conversation

yhuai commented Aug 8, 2014

SparkQA commented Aug 8, 2014

marmbrus Aug 8, 2014

Choose a reason for hiding this comment

SparkQA commented Aug 8, 2014

marmbrus commented Aug 13, 2014

yhuai commented Aug 13, 2014

SparkQA commented Aug 13, 2014

SparkQA commented Aug 14, 2014

yhuai commented Aug 14, 2014

marmbrus commented Aug 14, 2014

marmbrus commented Aug 14, 2014

SparkQA commented Aug 14, 2014

SparkQA commented Aug 14, 2014

SparkQA commented Aug 14, 2014

SparkQA commented Aug 14, 2014

marmbrus commented Aug 14, 2014

chutium Aug 15, 2014

Choose a reason for hiding this comment

marmbrus Aug 16, 2014

Choose a reason for hiding this comment

chutium Aug 18, 2014

Choose a reason for hiding this comment