Rationalize incremental materialization #141

jtcohen6 · 2021-01-12T13:49:06Z

resolves #140
resolves #133

Description

This PR implements, with alterations, several of the recommendations from #140 (much fuller description there) to rationalize the 'incremental' materialization on dbt-spark.

In particular:

New default strategy, append, that provides better consistency with other adapters' default behavior. This strategy works across all file formats, connection types, Spark platforms.
Raise better errors to enforce which strategies are actually supported when

Notes

~~Fix failing test_dbt_incremental: I think this PR represents a genuine change in behavior!~~ It brings it in line with the default adapter tests, which can only be a step in the right direction :)
I'm not sure about how to implement recommendation (5), which proposed that we should set file_format: delta + incremental_strategy: merge as defaults if method == 'odbc' (replacing parquet + insert_overwrite, respectively). A bit more on this below:

leaving these notes below, even though they're not material to the changes in this PR

We could add this logic just within the Jinja macros (e.g. within file_format_clause, dbt_spark_validate_get_file_format, dbt_spark_validate_get_incremental_strategy), but then it wouldn't be consistent with the values in manifest.json, since the file format default is set in python:

https://github.com/fishtown-analytics/dbt-spark/blob/2f7c2dcb419b66d463bb83212287bd29280d6964/dbt/adapters/spark/impl.py#L32-L33

@kwigley Any chance you could advise on if it would be possible to set that config default dynamically, based on the connection method? Or if that's a terrible idea? :)

Even if we can't get that last piece working, I still think we should move forward with 4/5, and advise Databricks users to simply set in dbt_project.yml:

models:
  +file_format: delta
  +incremental_strategy: merge

This is what we advise today, anyway.

Checklist

I have signed the CLA
I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
I have updated the CHANGELOG.md and added information about my change to the "dbt next" section.

kwigley · 2021-01-13T14:40:05Z

Any chance you could advise on if it would be possible to set that config default dynamically

unfortunately, I don't think it is possible with the current impl. The adapter config does not know anything about the connection when it is created :(

jtcohen6 · 2021-01-13T14:46:25Z

Thanks for looking into it @kwigley!

I suppose this would be a compelling thing that we could do, if we split off the Databricks pieces of dbt-spark into a new adapter plugin, dbt-databricks, that inherited from the current plugin, added certain functionality, and overrode certain defaults.

Fokko

Thanks for unraveling this @jtcohen6. Looking good. I think the most important thing is that we make sure that we test the edge cases. Would love to dive into this a bit deeper, but I'm currently a bit swamped in work :)

Fokko · 2021-01-13T21:06:09Z

dbt/include/spark/macros/materializations/incremental.sql

-  {% endif %}
-
-  {% call statement() %}
-    set spark.sql.hive.convertMetastoreParquet = false


Good to see this good :)

Fokko · 2021-01-13T21:06:57Z

dbt/include/spark/macros/materializations/incremental.sql

-  {%- else -%}
-    {#-- merge all columns with databricks delta - schema changes are handled for us #}
+  {%- elif strategy == 'merge' -%}
+  {#-- merge all columns with databricks delta - schema changes are handled for us #}
    {{ get_merge_sql(target, source, unique_key, dest_columns=none, predicates=none) }}
  {%- endif -%}


Maybe raise an error if it doesn't match any of the cases?

Good call. It's already raised earlier on:
https://github.com/fishtown-analytics/dbt-spark/blob/c8e3770e077e8c54026156b14e61133ef59fa7ff/dbt/include/spark/macros/materializations/incremental.sql#L65-L66

Just the same, I'll add another explicit exception here, just in case users override some macros but not others.

Fokko · 2021-01-13T21:13:23Z

dbt/include/spark/macros/materializations/incremental.sql

+    {% if unique_key %}
+        on DBT_INTERNAL_SOURCE.{{ unique_key }} = DBT_INTERNAL_DEST.{{ unique_key }}
+    {% else %}
+        on false


This feels awkward, why not just use an INSERT INTO statement?

This is in line with default dbt behavior here.

As I see it, the only reason we implement spark__get_merge_sql is to benefit from the improved wildcard update */insert *.

I agree it feels a bit silly—this is identical to append strategy / INSERT INTO statement. The alternative is to keep requiring a unique_key with merge. In any case, I'd rather err closer to the side of default behavior.

jtcohen6 · 2021-01-14T12:26:57Z

Thanks for the review @Fokko! I just reorganized these macro files a bit as well.

I think the most important thing is that we make sure that we test the edge cases.

I completely agree. I have a "suite" of "test cases" in a local project that I used to work through all the edge cases:

CASE	Apache Spark	Databricks Cluster	Databricks Endpoint
bad_file_format	❌	❌	❌
bad_strategy	❌	❌	❌
bad_merge_not_delta	❌	❌	❌
bad_insert_overwrite_delta	❌	❌	❌
default_append	✅	✅	❔
insert_overwrite_no_partitions	✅	✅	❌
insert_overwrite_partitions	✅	✅	❌
delta_append	❔	✅	❔
delta_merge_no_key	❔	✅	❔
delta_merge_unique_key	❔	✅	❔

Keys:

✅ Works
❌ Raises explicit, clear compilation error
❔ dbt is unopinionated: no explicit error, may work or may return database error (e.g. using delta in Apache Spark, forthcoming support for create temp view via endpoint)

I'm struggling a bit with how to implement this in a more-automated fashion:

It's slightly more involved than a unit test. I tried extending unit/test_macros.py without success, macros are a bit more interlocking and require dbt-specific Jinja features like return
It's slightly more granular that our suite of adapter integration tests. Ideally we'd have a way to iterate through a sequence, changing one config each time, and catch exceptions where we hope for them.

FWIW We don't have automated tests today for incremental strategies, either, so I think this still constitutes a step in the right direction.

jtcohen6 · 2021-01-15T11:55:15Z

As discussed with @kwigley yesterday, the ideal type of test to write for this kind of functionality is a "true" integration test, like the ones we have in Core (dbt/test/integration).

To that end, it would be great to include and expose base.py as an available module in pytest-dbt-adapter. It's relatively stable boilerplate code that provides the foundation for writing integration or functional tests on any adapter. Breaking it out into its own module is also a prerequisite to someday achieving true modularity for the dbt-core/-postgres/-redshift/-snowflake/-bigquery tests.

jtcohen6 · 2021-02-03T13:58:25Z

@kwigley I gave a go at writing custom integration tests for the changes in this PR, using the work in dbt-labs/dbt-adapter-tests#13. I'd love to get your help here—perhaps a bit of pair-programming? :)

Once we can manage to get these tests running, I'll feel really good about merging and releasing the additions to the adapter testing suite, even if it's missing some of the bells and whistles. (E.g. I've decided that use_profile leveraging the .dbtspec target, while a very cool idea, is nice but not strictly necessary for this first version—it's the same environment variables, at the end of the day.)

kwigley · 2021-02-18T21:12:18Z

@jtcohen6 can you take a peek at why test_insert_overwrite_apache_spark is failing?

jtcohen6 · 2021-02-19T12:39:11Z

Ugh. Here's where we're at: ODBC connections to Databricks clusters aren't respecting the set spark.sql.sources.partitionOverwriteMode = DYNAMIC parameter. I have no idea why. Does pyodbc generate a new connection for each executed SQL statement, with new default parameters each time?

(Those set statements aren't supported at all by the new SQL Analytics endpoints. This issue is exclusive to method: odbc + cluster.)

One thing we could do: Tell folks to set spark.sql.sources.partitionOverwriteMode = DYNAMIC within their Databricks clusters (docs). I'll try this in a few minutes. If it works, it may be a better idea anyway, rather than having the incremental materialization try to set this parameter each time.

kwigley · 2021-02-19T13:46:04Z

test/custom/incremental_strategies/test_incremental_strategies.py

+        self.assertTablesEqual(
+            "insert_overwrite_no_partitions", "expected_overwrite")
+        self.assertTablesEqual(
+            "insert_overwrite_partitions", "expected_upsert")


You're good!!! For a very dumb reason, this worked in Databricks when it really shouldn't have. I fixed via:

… feature/rationalize-incremental

kwigley

thanks for working through the tests with me! I'm happy with shipping this with the intention of spending more time with adapter integration tests.

1. partitionOverwriteMode = DYNAMIC iff insert_overwrite + partition_by

8ba1fc1

cla-bot bot added the cla:yes label Jan 12, 2021

jtcohen6 added 3 commits January 12, 2021 16:00

2. Error if insert_overwrite + delta

6de3d07

3. Append-only insert if no partitions

fb6a4bb

4. Append-only merge if no unique_key

6f7e1f2

jtcohen6 force-pushed the feature/rationalize-incremental branch from 50092a0 to 6f7e1f2 Compare January 12, 2021 15:01

jtcohen6 added 2 commits January 13, 2021 17:30

Revert 3. Full table insert_overwrite if no partitions

b60d7c2

Introduce 'append' strategy as default

c8e3770

Fokko reviewed Jan 13, 2021

View reviewed changes

PR feedback. Reorganize macros

bc18f80

jtcohen6 marked this pull request as ready for review January 14, 2021 12:58

jtcohen6 mentioned this pull request Jan 18, 2021

Load CSV files fails on date and numeric types #139

Closed

kwigley mentioned this pull request Jan 22, 2021

add dbt adapter integration test class dbt-labs/dbt-adapter-tests#13

Closed

jtcohen6 added 2 commits February 3, 2021 14:31

Try adding custom integration tests

af77fd8

Try updating tox, circle (WIP)

e08a23e

kwigley force-pushed the feature/rationalize-incremental branch 3 times, most recently from 8e3bd9d to b4e05a0 Compare February 17, 2021 16:35

get tests working

9bbc61b

kwigley force-pushed the feature/rationalize-incremental branch from b4e05a0 to 9bbc61b Compare February 17, 2021 16:47

update integration tests

ab57d62

jtcohen6 added 2 commits February 19, 2021 12:06

Fix assertTablesEqual mismatch

a12f74d

Concise test defs. Local tests first

43ab587

Fixups, document spark config setting

ac35028

kwigley reviewed Feb 19, 2021

View reviewed changes

jtcohen6 added 2 commits February 19, 2021 14:53

Merge branch 'master' of github.com:fishtown-analytics/dbt-spark into…

abe0a27

… feature/rationalize-incremental

Update README, Changelog

b8c7d77

jtcohen6 added a commit to dbt-labs/docs.getdbt.com that referenced this pull request Feb 19, 2021

Update incremental strategies per dbt-labs/dbt-spark#141

ccb5f8b

jtcohen6 mentioned this pull request Feb 19, 2021

Update Spark incremental strategies dbt-labs/docs.getdbt.com#569

Merged

3 tasks

jtcohen6 requested a review from kwigley February 19, 2021 14:06

kwigley approved these changes Feb 19, 2021

View reviewed changes

jtcohen6 merged commit 1ac9fda into master Feb 19, 2021

jtcohen6 deleted the feature/rationalize-incremental branch February 19, 2021 14:51

jtcohen6 mentioned this pull request Oct 15, 2021

Schema evolution for databricks delta is not working using pre_hook #217

Closed

jtcohen6 mentioned this pull request Nov 16, 2021

Model pre_hook is not executed on ODBC, but it is for HTTP #256

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rationalize incremental materialization #141

Rationalize incremental materialization #141

jtcohen6 commented Jan 12, 2021 •

edited

Loading

kwigley commented Jan 13, 2021 •

edited

Loading

jtcohen6 commented Jan 13, 2021

Fokko left a comment

Fokko Jan 13, 2021

Fokko Jan 13, 2021

jtcohen6 Jan 14, 2021

Fokko Jan 13, 2021

jtcohen6 Jan 14, 2021

jtcohen6 commented Jan 14, 2021

jtcohen6 commented Jan 15, 2021

jtcohen6 commented Feb 3, 2021

kwigley commented Feb 18, 2021

jtcohen6 commented Feb 19, 2021

kwigley Feb 19, 2021

jtcohen6 Feb 19, 2021

kwigley left a comment

Rationalize incremental materialization #141

Rationalize incremental materialization #141

Conversation

jtcohen6 commented Jan 12, 2021 • edited Loading

Description

Notes

Checklist

kwigley commented Jan 13, 2021 • edited Loading

jtcohen6 commented Jan 13, 2021

Fokko left a comment

Choose a reason for hiding this comment

Fokko Jan 13, 2021

Choose a reason for hiding this comment

Fokko Jan 13, 2021

Choose a reason for hiding this comment

jtcohen6 Jan 14, 2021

Choose a reason for hiding this comment

Fokko Jan 13, 2021

Choose a reason for hiding this comment

jtcohen6 Jan 14, 2021

Choose a reason for hiding this comment

jtcohen6 commented Jan 14, 2021

jtcohen6 commented Jan 15, 2021

jtcohen6 commented Feb 3, 2021

kwigley commented Feb 18, 2021

jtcohen6 commented Feb 19, 2021

kwigley Feb 19, 2021

Choose a reason for hiding this comment

jtcohen6 Feb 19, 2021

Choose a reason for hiding this comment

kwigley left a comment

Choose a reason for hiding this comment

jtcohen6 commented Jan 12, 2021 •

edited

Loading

kwigley commented Jan 13, 2021 •

edited

Loading