Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL Endpoint only supports merge incremental strategy [and still doesn't yet] #138

Closed
wants to merge 4 commits into from

Conversation

jtcohen6
Copy link
Contributor

@jtcohen6 jtcohen6 commented Jan 11, 2021

resolves #133

Disable insert_overwrite for endpoint

In order for the insert_overwrite incremental strategy to work as expected, we set two properties:

  • If the table is partitioned, we set set spark.sql.sources.partitionOverwriteMode = DYNAMIC so that Spark will dynamically determine which partitions have new data, and then replace only those, leaving in place any without new data. If the table is not partitioned, Spark will replace the entire table, equivalent to an atomic truncate + insert.
  • Regardless, we set spark.sql.hive.convertMetastoreParquet = false (docs): I honestly don't remember the exact reasons, but this has been here since the earliest days of incremental models in dbt-spark (c13b20a)

Several weeks ago, we found that the new SQL Endpoints started returning errors when dbt tried to run set statements. Following discussion with our contacts at Databricks, we found out that this support was never intended:

  • The SQL analytics [endpoints] especially do not support setting the spark config properties (they do have a very minimal set of whitelisted properties that one can set).

This PR therefore:

  • Conditions those set statements to run IFF the incremental strategy is insert_overwrite
  • Raises a compilation error if the user tries to run an incremental model with insert_overwrite strategy on the SQL Endpoint:
 Compilation Error in model incremental (models/incremental.sql)
   Invalid incremental strategy provided: insert_overwrite
       You cannot use this strategy when connecting to a SQL Endpoint
       Use `merge` with a `unique_key` and file_format = `delta` instead

   > in macro dbt_spark_validate_get_incremental_strategy (macros/materializations/incremental.sql)
   > called by macro materialization_incremental_spark (macros/materializations/incremental.sql)
   > called by model incremental (models/incremental.sql)

This may feel a bit silly, given that insert_overwrite is the default, and merge requires two additional configs. Should we change the defaults depending on the connection type? i.e. Default to incremental_strategy: merge and file_format: delta (instead of parquet) if the user has an ODBC connection to Databricks.

Whereas merge should work... soon

There's one other issue that currently prevents incremental models, even merge-strategy ones, from running on the SQL Analytics endpoint: create temp view is not yet supported. In that case, crucially, Databricks intends to eventually support it:

We are still working on getting the temporary view support for SQL analytics and will provide an update shortly.

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change to the "dbt next" section.

@jtcohen6 jtcohen6 requested a review from kwigley January 11, 2021 11:26
@cla-bot cla-bot bot added the cla:yes label Jan 11, 2021
@jtcohen6
Copy link
Contributor Author

I'm investigating the possibility (suggested to me by someone at Databricks) that these set statements are never needed when running on Delta, even if the command is insert overwrite

@jtcohen6
Copy link
Contributor Author

jtcohen6 commented Jan 11, 2021

Dynamic partition replacement is actually... not supported for Delta tables! Or at least not yet (delta-io/delta#348, delta-io/delta#371).

This isn't something I was previously aware of:

Table dbt_jcohen.my_model does not support dynamic overwrite in batch mode.;;

If we run insert overwrite into a partitioned Delta table without dynamic partition overwrite, however, we just ... replace the entire contents of the table. That is a feature (#117), but less of one now that we support create or replace table for Delta in the table materialization (#125)

I'd like to rationalize what's supported for which file formats + incremental strategies. Going to close this PR for the time being while I figure it out.

@jtcohen6 jtcohen6 closed this Jan 11, 2021
@mikealfare mikealfare deleted the disable/endpoint-merge-only branch March 1, 2023 00:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Databricks SQL Analytics endpoint does not support set statements
1 participant