Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rationalize file formats + incremental strategies #140

Closed
jtcohen6 opened this issue Jan 12, 2021 · 1 comment · Fixed by #141
Closed

Rationalize file formats + incremental strategies #140

jtcohen6 opened this issue Jan 12, 2021 · 1 comment · Fixed by #141
Labels
enhancement New feature or request

Comments

@jtcohen6
Copy link
Contributor

jtcohen6 commented Jan 12, 2021

Current behavior

  • All incremental strategies run two set statements:
    • set spark.sql.sources.partitionOverwriteMode = DYNAMIC
    • set spark.sql.hive.convertMetastoreParquet = false
  • incremental_strategy: merge requires file_format: delta and unique_key
  • incremental_strategy: insert_overwrite does not work with file_format: delta + partition_by because Delta does not support dynamic partition overwrite ([Feature Request] support for dynamic partition overwrite delta-io/delta#348, Fixes #348 Support Dynamic Partition Overwrite  delta-io/delta#371)
  • incremental_strategy: insert_overwrite without partition_by just atomically replaces the entire table. This was a possibility introduced by replace partitionOverwriteMode inside merge strategy #117.
    • However, atomic replacement with file_format: delta is now possible in the table materialization via create or replace table (Enable create or replace sql syntax #125)
    • It doesn't make conceptual sense for the incremental materialization to replace an entire table, if the --full-refresh flag is not being passed. If anything, an incremental model without partition_by or unique_key should instead be append only (insert into)—this is closer to how the materialization works on other databases.

Questions

  1. Do we still need these set statements?
    • Yes, only for incremental_strategy: insert_overwrite + partition_by
    • I don't know if we need set spark.sql.hive.convertMetastoreParquet = false anymore / at all. It's not clear to me what this was doing.
    • Given that the SQL Endpoint does not support these set statements (Databricks SQL Analytics endpoint does not support set statements #133), should we let this fail with an unhelpful error from Databricks, or raise our own compilation error?
  2. Should incremental_strategy: insert_overwrite be supported at all for Delta tables?
    • If no: Should dbt raise a compilation error if incremental_strategy: insert_overwrite + file_format: delta? Or should we defer to Delta to raise an error (Table ... does not support dynamic overwrite in batch mode.;;), until such time as they add support for dynamic partition replacement?
  3. Does incremental_strategy: insert_overwrite ever make sense without partition_by?
    • If no: We should raise a compilation error, as we did before replace partitionOverwriteMode inside merge strategy #117
    • If yes: Should it be a full table replacement? Or should we create an "append-only" version that just runs insert (instead of insert overwrite) when no partitions are specified?
  4. Should we add an "append-only" version of the merge strategy that runs if a unique_key is not specified, rather than raising a compilation error?
  5. Default values: Across the board, dbt-spark has incremental_strategy: insert_overwrite + file_format: parquet as its defaults. Should we change those defaults to incremental_strategy: merge + file_format: delta if a user is connecting via target.method = 'odbc' (i.e. to Databricks)?

My current thinking

  1. Yes, but we should move the set statement to only run if incremental_strategy: insert_overwrite + partition_by. We should raise a compilation error if incremental_strategy: insert_overwrite + target.endpoint.
  2. No, we should raise a compilation error if incremental_strategy: insert_overwrite + file_format: delta.
  3. Yes, the insert_overwrite strategy should be append-only (simple insert) if no partition_by is specified. Yes, the insert_overwrite strategy should continue replacing the entire table, as this is the standard behavior for INSERT OVERWRITE on Spark. However, a new strategy append will perform append-only inserts, partitions or no partitions; and it will be the default.
  4. Yes, the merge strategy should work without unique_key, and change its merge condition to on false (as it is here): merge into [model] using [temp view] on false when not matched insert *.
  5. Yes, I think that would make a lot of sense. We'd need to do a good job documenting this, though. This isn't possible in a straightforward way today. Instead, we'll change the default strategy to append.
@jtcohen6 jtcohen6 added the enhancement New feature or request label Jan 12, 2021
@jtcohen6
Copy link
Contributor Author

@franloza made some great points in a related Slack thread:

  • The default behavior of insert overwrite on Spark is to replace, not append. It's confusing to have a strategy called insert_overwrite that actually appends via insert into.
  • However, the default behavior of dbt incremental models (if no unique_key/partition_by specified) should be to append only, rather than replace/overwrite.

Therefore, I think we should create a new incremental strategy that only appends by running insert into. We could call it something like append, append_only, insert, insert_into. This strategy should be the default, as it works on all file formats, platforms, connection methods and is consistent with standard dbt behavior. Then, users are encouraged to switch to one of two main strategies:

  1. file_format: parquet + incremental_strategy: insert_overwrite + partition_by. Will not be supported on Databricks SQL Endpoint because it requires set statements. If no partition_by is specified, replaces the table (i.e. still runs insert overwrite).
  2. file_format: delta + incremental_strategy: merge + unique_key. Only supported on Databricks runtime. (Not yet working on Databricks SQL Endpoint because it requires create temp view, which is coming soon.) If no unique_key is specified, append only via merge (which should be functionally equivalent to the default append strategy).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant