Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-2927][SQL] Add a conf to configure if we always read Binary columns stored in Parquet as String columns #1855

Closed
wants to merge 5 commits into from

Conversation

yhuai
Copy link
Contributor

@yhuai yhuai commented Aug 8, 2014

This PR adds a new conf flag spark.sql.parquet.binaryAsString. When it is true, if there is no parquet metadata file available to provide the schema of the data, we will always treat binary fields stored in parquet as string fields. This conf is used to provide a way to read string fields generated without UTF8 decoration.

JIRA: https://issues.apache.org/jira/browse/SPARK-2927

@SparkQA
Copy link

SparkQA commented Aug 8, 2014

QA tests have started for PR 1855. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18208/consoleFull

* When set to true, we always treat byte arrays in Parquet files as strings.
*/
private[spark] def isParquetBinaryAsString: Boolean =
if (getConf(PARQUET_BINARY_AS_STRING, "false") == "true") true else false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the if here is redundant.

@SparkQA
Copy link

SparkQA commented Aug 8, 2014

QA results for PR 1855:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18208/consoleFull

@marmbrus
Copy link
Contributor

@yhuai can you maybe fix the if and then remove WIP? Is this ready to go then?

@yhuai yhuai changed the title [WIP][SPARK-2927][SQL] Add a conf to configure if we always read Binary columns stored in Parquet as String columns [SPARK-2927][SQL] Add a conf to configure if we always read Binary columns stored in Parquet as String columns Aug 13, 2014
@yhuai
Copy link
Contributor Author

yhuai commented Aug 13, 2014

Actually, it needs a unit test. Let me take a look at how to add one.

@SparkQA
Copy link

SparkQA commented Aug 13, 2014

QA tests have started for PR 1855. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18491/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 14, 2014

QA results for PR 1855:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18491/consoleFull

@yhuai
Copy link
Contributor Author

yhuai commented Aug 14, 2014

@marmbrus Can you take a look at the unit test? If it is ok, I think this PR is good to go.

@marmbrus
Copy link
Contributor

Nice test. Will merge once jenkins is happy.

@marmbrus
Copy link
Contributor

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Aug 14, 2014

QA tests have started for PR 1855. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18503/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 14, 2014

QA tests have started for PR 1855. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18504/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 14, 2014

QA results for PR 1855:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18503/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 14, 2014

QA results for PR 1855:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18504/consoleFull

@marmbrus
Copy link
Contributor

Thanks! I've merged this to master and 1.1.

asfgit pushed a commit that referenced this pull request Aug 14, 2014
…lumns stored in Parquet as String columns

This PR adds a new conf flag `spark.sql.parquet.binaryAsString`. When it is `true`, if there is no parquet metadata file available to provide the schema of the data, we will always treat binary fields stored in parquet as string fields. This conf is used to provide a way to read string fields generated without UTF8 decoration.

JIRA: https://issues.apache.org/jira/browse/SPARK-2927

Author: Yin Huai <[email protected]>

Closes #1855 from yhuai/parquetBinaryAsString and squashes the following commits:

689ffa9 [Yin Huai] Add missing "=".
80827de [Yin Huai] Unit test.
1765ca4 [Yin Huai] Use .toBoolean.
9d3f199 [Yin Huai] Merge remote-tracking branch 'upstream/master' into parquetBinaryAsString
5d436a1 [Yin Huai] The initial support of adding a conf to treat binary columns stored in Parquet as string columns.

(cherry picked from commit add75d4)
Signed-off-by: Michael Armbrust <[email protected]>
@asfgit asfgit closed this in add75d4 Aug 14, 2014
def readSchemaFromFile(
origPath: Path,
conf: Option[Configuration],
isBinaryAsString: Boolean): Seq[Attribute] = {
val keyValueMetadata: java.util.Map[String, String] =
readMetaData(origPath, conf)
.getFileMetaData
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this patch will be great for impala users like us :) thanks, moreover, there is a getCreatedBy method in readMetaData(origPath, conf).getFileMetaData, and impala creates parquet files always with its own CreatedBy information (always contains string "impala"), so, maybe we can do some auto-detection like (https://github.com/apache/spark/pull/1599/files)

if (fileMetaData.getCreatedBy.contains("impala")) {
  isBinaryAsString = true
  log.info(s"Impala parquet file found, blabla...")
}

does this auto-detection make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only concern with auto detection like this is, what happens when impala starts adding the correct annotation and supporting byte arrays?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question, such a auto detection brings confusion, this is a problem of impala, not spark sql, we are not going to make a impala file format corrector :)

xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
…lumns stored in Parquet as String columns

This PR adds a new conf flag `spark.sql.parquet.binaryAsString`. When it is `true`, if there is no parquet metadata file available to provide the schema of the data, we will always treat binary fields stored in parquet as string fields. This conf is used to provide a way to read string fields generated without UTF8 decoration.

JIRA: https://issues.apache.org/jira/browse/SPARK-2927

Author: Yin Huai <[email protected]>

Closes apache#1855 from yhuai/parquetBinaryAsString and squashes the following commits:

689ffa9 [Yin Huai] Add missing "=".
80827de [Yin Huai] Unit test.
1765ca4 [Yin Huai] Use .toBoolean.
9d3f199 [Yin Huai] Merge remote-tracking branch 'upstream/master' into parquetBinaryAsString
5d436a1 [Yin Huai] The initial support of adding a conf to treat binary columns stored in Parquet as string columns.
@yhuai yhuai deleted the parquetBinaryAsString branch October 6, 2014 16:24
viirya pushed a commit to viirya/spark-1 that referenced this pull request Oct 19, 2023
### What changes were proposed in this pull request?
Bump the `spark-call-home-listener` to 0.2.31.

### Why are the changes needed?
This version addresses API differences that are specific to Spark 3.4.0.  This differences caused failures.

During testing, it was also identified that there were problems introduced due to the circular dependencies between Spark and `spark-call-home`.  The builds were updated to exclude `spark-call-home-listener` from Spark.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
The gradle build was updated to add new Spark 3.4 profiles, and the build-specific profiles were used in all building.  This demonstrated the API issue, causing the build to fail.  In addition new PR builds were introduced to perform this testing as part of each PR.  In addition, the tests were supplemented by builds and tests of Spark with the PR-built `spark-call-home-listener` were added to ensure that Spark was unaffected.

In addition, the latest Spark release image was patched to replace the `spark-call-home-listener` jar, and this was used to run `spark-perf` benchmarks.

### Was this patch authored or co-authored using generative AI tooling?
No

### Patch
https://github.pie.apple.com/pie-spark/spark-call-home/compare/f2eb2d0b2b...a93f3d08bb

- feat: Apply spark and scala profile (apache#115)
- fix: Handle default Spark for macros and exclude circular dependency (apache#116)
- build: Replace gala-platform-runtime with ubi9-minimal/java11-runtime (apache#118)
- build: Update base image for docker deploy (apache#119)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants