-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-2927][SQL] Add a conf to configure if we always read Binary columns stored in Parquet as String columns #1855
Conversation
…n Parquet as string columns.
QA tests have started for PR 1855. This patch merges cleanly. |
* When set to true, we always treat byte arrays in Parquet files as strings. | ||
*/ | ||
private[spark] def isParquetBinaryAsString: Boolean = | ||
if (getConf(PARQUET_BINARY_AS_STRING, "false") == "true") true else false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: the if
here is redundant.
QA results for PR 1855: |
@yhuai can you maybe fix the if and then remove WIP? Is this ready to go then? |
Actually, it needs a unit test. Let me take a look at how to add one. |
QA tests have started for PR 1855. This patch merges cleanly. |
QA results for PR 1855: |
@marmbrus Can you take a look at the unit test? If it is ok, I think this PR is good to go. |
Nice test. Will merge once jenkins is happy. |
Jenkins, test this please. |
QA tests have started for PR 1855. This patch merges cleanly. |
QA tests have started for PR 1855. This patch merges cleanly. |
QA results for PR 1855: |
QA results for PR 1855: |
Thanks! I've merged this to master and 1.1. |
…lumns stored in Parquet as String columns This PR adds a new conf flag `spark.sql.parquet.binaryAsString`. When it is `true`, if there is no parquet metadata file available to provide the schema of the data, we will always treat binary fields stored in parquet as string fields. This conf is used to provide a way to read string fields generated without UTF8 decoration. JIRA: https://issues.apache.org/jira/browse/SPARK-2927 Author: Yin Huai <[email protected]> Closes #1855 from yhuai/parquetBinaryAsString and squashes the following commits: 689ffa9 [Yin Huai] Add missing "=". 80827de [Yin Huai] Unit test. 1765ca4 [Yin Huai] Use .toBoolean. 9d3f199 [Yin Huai] Merge remote-tracking branch 'upstream/master' into parquetBinaryAsString 5d436a1 [Yin Huai] The initial support of adding a conf to treat binary columns stored in Parquet as string columns. (cherry picked from commit add75d4) Signed-off-by: Michael Armbrust <[email protected]>
def readSchemaFromFile( | ||
origPath: Path, | ||
conf: Option[Configuration], | ||
isBinaryAsString: Boolean): Seq[Attribute] = { | ||
val keyValueMetadata: java.util.Map[String, String] = | ||
readMetaData(origPath, conf) | ||
.getFileMetaData |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this patch will be great for impala users like us :) thanks, moreover, there is a getCreatedBy
method in readMetaData(origPath, conf).getFileMetaData
, and impala creates parquet files always with its own CreatedBy information (always contains string "impala"), so, maybe we can do some auto-detection like (https://github.com/apache/spark/pull/1599/files)
if (fileMetaData.getCreatedBy.contains("impala")) {
isBinaryAsString = true
log.info(s"Impala parquet file found, blabla...")
}
does this auto-detection make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only concern with auto detection like this is, what happens when impala starts adding the correct annotation and supporting byte arrays?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good question, such a auto detection brings confusion, this is a problem of impala, not spark sql, we are not going to make a impala file format corrector :)
…lumns stored in Parquet as String columns This PR adds a new conf flag `spark.sql.parquet.binaryAsString`. When it is `true`, if there is no parquet metadata file available to provide the schema of the data, we will always treat binary fields stored in parquet as string fields. This conf is used to provide a way to read string fields generated without UTF8 decoration. JIRA: https://issues.apache.org/jira/browse/SPARK-2927 Author: Yin Huai <[email protected]> Closes apache#1855 from yhuai/parquetBinaryAsString and squashes the following commits: 689ffa9 [Yin Huai] Add missing "=". 80827de [Yin Huai] Unit test. 1765ca4 [Yin Huai] Use .toBoolean. 9d3f199 [Yin Huai] Merge remote-tracking branch 'upstream/master' into parquetBinaryAsString 5d436a1 [Yin Huai] The initial support of adding a conf to treat binary columns stored in Parquet as string columns.
### What changes were proposed in this pull request? Bump the `spark-call-home-listener` to 0.2.31. ### Why are the changes needed? This version addresses API differences that are specific to Spark 3.4.0. This differences caused failures. During testing, it was also identified that there were problems introduced due to the circular dependencies between Spark and `spark-call-home`. The builds were updated to exclude `spark-call-home-listener` from Spark. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? The gradle build was updated to add new Spark 3.4 profiles, and the build-specific profiles were used in all building. This demonstrated the API issue, causing the build to fail. In addition new PR builds were introduced to perform this testing as part of each PR. In addition, the tests were supplemented by builds and tests of Spark with the PR-built `spark-call-home-listener` were added to ensure that Spark was unaffected. In addition, the latest Spark release image was patched to replace the `spark-call-home-listener` jar, and this was used to run `spark-perf` benchmarks. ### Was this patch authored or co-authored using generative AI tooling? No ### Patch https://github.pie.apple.com/pie-spark/spark-call-home/compare/f2eb2d0b2b...a93f3d08bb - feat: Apply spark and scala profile (apache#115) - fix: Handle default Spark for macros and exclude circular dependency (apache#116) - build: Replace gala-platform-runtime with ubi9-minimal/java11-runtime (apache#118) - build: Update base image for docker deploy (apache#119)
This PR adds a new conf flag
spark.sql.parquet.binaryAsString
. When it istrue
, if there is no parquet metadata file available to provide the schema of the data, we will always treat binary fields stored in parquet as string fields. This conf is used to provide a way to read string fields generated without UTF8 decoration.JIRA: https://issues.apache.org/jira/browse/SPARK-2927