Skip to content

Commit

Permalink
Parquet with different schemes fail in databricks loader (close #1085)
Browse files Browse the repository at this point in the history
We have an issue where we read data from multiple parquet files with different schemas (optional column only exist in some of the files).
It generates the following exception in Databricks:
`com.databricks.backend.common.rpc.SparkDriverExceptions$SQLExecutionException: org.apache.spark.sql.AnalysisException: [MISSING_COLUMN] Column 'unstruct_event_com_lego_3dcatalogue_like_product_1' does not exist. Did you mean one of the following?`

Recreating the issue in Databricks within a notebook and testing different options revealed we had to add the FORMAT_OPTIONS with mergeSchema to fix the issue.
  • Loading branch information
drphrozen authored and istreeter committed Sep 28, 2022
1 parent 1cc45f9 commit 94e3509
Showing 1 changed file with 1 addition and 0 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,7 @@ object Databricks {
SELECT $frSelectColumns from '$frPath' $frAuth
)
FILEFORMAT = PARQUET
FORMAT_OPTIONS('MERGESCHEMA' = 'TRUE')
COPY_OPTIONS('MERGESCHEMA' = 'TRUE')""";
case _: Statement.ShreddedCopy =>
throw new IllegalStateException("Databricks Loader does not support migrations")
Expand Down

0 comments on commit 94e3509

Please sign in to comment.