Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Truncate relation names when appending a suffix #4921

Merged
merged 14 commits into from
May 19, 2022
Merged

Truncate relation names when appending a suffix #4921

merged 14 commits into from
May 19, 2022

Conversation

epapineau
Copy link
Contributor

@epapineau epapineau commented Mar 22, 2022

Truncate relation names when appending a suffix that will result in len > 63 characters using make_temp_relation and make_backup_relation macros

resolves #2869

Description

Suffixes are appended to temp and backup relation names. However, these may exceed the 63 character limit on Postgres which currently raises a compiler Error. This PR leverages existing make_temp_relation macro and adds make_backup_relation and make_intermediate_relation macros to truncate base relation name when generated relation name exceeds character length.

Checklist

  • I have signed the CLA
  • I have added information about my change to be included in the CHANGELOG.
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR

@cla-bot
Copy link

cla-bot bot commented Mar 22, 2022

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Elize Papineau.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@github-actions
Copy link
Contributor

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

@jtcohen6 jtcohen6 added the Team:Adapters Issues designated for the adapter area of the code label Mar 22, 2022
@epapineau
Copy link
Contributor Author

Hi, I'm running into an issue pushing further commits to this branch:

remote: Resolving deltas: 100% (11/11), completed with 10 local objects.
remote: error: GH006: Protected branch update failed for refs/heads/dev/epapineau.
remote: error: You're not authorized to push to this branch. Visit https://docs.github.com/articles/about-protected-branches/ for more information.
To github.com:dbt-labs/dbt-core.git
 ! [remote rejected]     dev/epapineau -> dev/epapineau (protected branch hook declined)
error: failed to push some refs to 'github.com:dbt-labs/dbt-core.git'

Did I configure something incorrectly?

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(putting on my "Team: Adapters" hat)

@epapineau Thank you so much for the contribution! I've left a few comments on the organization of the code, and on the failing (no longer needed) test case.

GitHub logistics: I've added you to the CLA list. We do have branch protection rules set up for branches that start with dev/, because the base branch used to be called dev/<version>. Now it's just called main, and we can probably update that rule! I think you're not the last person who will instinctively create a branch prefixed dev.

@@ -155,6 +155,22 @@
})) -%}
{% endmacro %}

{% macro postgres__make_backup_relation(base_relation, suffix) %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is duplicated logic from postgres__make_temp_relation, right? What do you think about consolidating this logic in the following way:

  • In the global project, make_temp_relation and make_backup_relation actually call the same third macro, make_relation_with_suffix(base_relation, suffix). They pass in '__dbt_tmp' and '__dbt_backup' as their suffixes, respectively
  • The make_relation_with_suffix macro is dispatched, so that each adapter plugin can implement it in its own way. (The other two macros are currently dispatched, and could still be, but I don't think it's strictly necessary)
  • Within the Postgres plugin, we rename postgres__make_temp_relation to postgres__make_relation_with_suffix, and it continues to work exactly the same.
  • (To avoid the remotest possibility of a breaking change, we could keep around a macro named postgres__make_temp_relation and just have it be a "redirect" to postgres__make_relation_with_suffix. Just in case someone was in the habit of calling it directly. Given that this is totally internal and undocumented, I think it's okay without.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using your feedback as a jumping off point, this is what came together:

  • make_temp_relation is used in snapshots and incremental materializations requiring the relation returned to not have schema or database defined, specifically on Postgres, to avoid cannot create temporary relation in non-temporary schema. In order to accomplish your other suggestions in view and table I created a make_intermediate_relation macro that returns a relation with schema and database path keys intact.
  • postgres__make_temp_relation, postgres__make_intermediate_relation, and postgres__make_backup_relation all call a postgres__make_relation_with_suffix macro to avoid repeated code and then return the relation for their respective uses.
  • Added bonus of not removing postgres__make_temp_relation & alleviating your breaking changes consideration
  • In the global project, default__make_intermediate_relation calls default__make_temp_relation since the non-temporary schema consideration does not apply on other data warehouses (as far as I understand at least - please let me know if this is incorrect).

Finally, I have a question that may affect the utility of postgres__make_relation_with_suffix, but they're more appropriate for the testing thread.

@@ -56,23 +55,23 @@ def test_postgres_full_refresh(self):
@use_profile('postgres')
def test_postgres_delete__dbt_tmp_relation(self):
# This creates a __dbt_tmp view - make sure it doesn't interfere with the dbt run
self.run_sql_file("create_view__dbt_tmp.sql")
#self.run_sql_file("create_view__dbt_tmp.sql")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think I get what's going on here:

  • During its materializations, dbt creates these temp + backup objects
  • This test ensures that dbt first drops any temp/backup objects by the same name, if they already exist
  • Previously, those temp + backup objects had predictable names/suffixes (view__dbt_tmp, view__dbt_backup)
  • Thanks to the changes in this PR, those temp + backup objects have nondeterministic names like view__dbt_tmp095351072872 and view__dbt_backup095351077674

That's better! It's much much less likely for these objects to accidentally collide with a preexisting object (almost impossible). So it makes this test much trickier to define, at the same time, it makes the test significantly less important. I think we can delete this test case (test_postgres_delete__dbt_tmp_relation) and explain why.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That as my thinking precisely when Elize and I talked. Glad you concur :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Triple checking that it's appropriate to have backup objects with nondeterministic names as proposed by this PR. I had originally removed that behavior in the second commit based on this comment & the failing test, but after the suffix macro recommendations I added it back. Happy to go in either direction, just let me know what y'all think is best @jtcohen6 @VersusFacit and I'll either remove test_postgres_delete__dbt_tmp_relation or differentiate the behavior between backup/temp. Once that decision is made, I think the PR is ready for review 🥳

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@epapineau Appreciate the triple check! After a bit more thought, I think it might be very slightly preferable to keep relation names non-deterministic if they're creating permanent objects (views + non-temp tables). That way, we avoid cluttering the schema if the materialization fails during the alter-rename-swap-drop step. Deterministically named temp/backup objects get dropped the next time dbt runs that model/etc; actual temp tables get dropped automatically at the end of the session.

That said, failure in that step is far less likely than in the create table/view as step, and it's a problem that exists today anyway, whenever users rename a model, and the "old" name of that model doesn't get dropped / cleaned up in the database.

If we did want to keep that distinction—between the temp identifiers used for true temp tables, and our very not-temporary "temporary"/"backup" relations—I think it would just look like one extra argument to these macros that disables the inclusion of dtstring in the suffix. That wouldn't be so hard, but it would require a little extra plumbing. It means we could keep this test in place.

What do you think?

Copy link
Contributor Author

@epapineau epapineau Apr 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, done~✨ PR is marked as ready for review. Not 100% on the answer to this checklist item: This PR includes tests, or tests are not required/relevant for this PR

@cla-bot cla-bot bot added the cla:yes label Mar 25, 2022
@epapineau epapineau marked this pull request as ready for review April 26, 2022 21:20
@epapineau epapineau requested a review from a team April 26, 2022 21:20
@epapineau epapineau requested review from a team as code owners April 26, 2022 21:20
@epapineau
Copy link
Contributor Author

Hello again~ Following up to see if this needs anything else for review? 🥲

@jtcohen6
Copy link
Contributor

jtcohen6 commented May 6, 2022

@epapineau Thanks for the updated work here!

I started digging in, and found a few places to clarify + confirm this logic. I don't want to expand scope any more than we already have—just want to make sure we're doing the right things. I need a little more time to grok the deleted test, and to clean up my comments.

In the meantime, we just converted the 017_runtime_materialization_tests to our new testing framework. The test you're removing is now here:

# Again, but against the incremental materialization
project.run_sql(create_incremental__dbt_tmp_sql)
results = run_dbt(["run", "--model", "incremental", "--full-refresh"])
assert len(results) == 1
check_table_does_not_exist(project.adapter, "incremental__dbt_tmp")
check_relations_equal(project.adapter, ["seed", "incremental"])

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @epapineau, and really great timing! This serves as a wonderful entrée to the work that the Core team is planning to take on for v1.2, around more ergonomic materializations—cleaning up and consolidating the logic in core, so that we can better streamline it across adapters.

Two actions:

  • I believe that the deleted test should actually be passing, since it's checking for "intermediate" rather than "temp" behavior in the incremental materialization
  • I opened a new PR against your branch that's slightly more aggressive about renaming things for consistency, and using more concise logic where possible. Take a look and let me know what you think! I know this goes far beyond the original remit, which was just around truncating table names for Postgres... but we might just manage to leave the room much tidier than we found it.

{% set target_relation = this.incorporate(type='table') %}
{% set existing_relation = load_relation(this) %}
{% set tmp_relation = make_temp_relation(target_relation) %}
{%- set temp_relation = make_temp_relation(target_relation)-%}
{%- set backup_identifier = make_backup_relation(target_relation, backup_relation_type=none) -%}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and in all other materializations, I'm realizing that we can simplify the relation-creation logic quite a bit, by naming everything X_relation and using the load_relation() macro as a more convenient wrapper for get_relation:

  {%- set temp_relation = make_temp_relation(target_relation)-%}
  {%- set backup_relation = make_backup_relation(existing_relation) -%}
  {%- set preexisting_temp_relation = load_relation(temp_relation)-%}
  {%- set preexisting_backup_relation = load_relation(backup_relation) -%}

I can open a new PR to take that on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: took a swing at this in #5221

Comment on lines 74 to 81
# Again, but against the incremental materialization
self.run_sql_file("create_incremental__dbt_tmp.sql")
results = self.run_dbt(['run', '--model', 'incremental', '--full-refresh'])
self.assertEqual(len(results), 1)

self.assertTableDoesNotExist('incremental__dbt_tmp')
self.assertTablesEqual("seed", "incremental")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this test could still be useful to us, to check for the following condition: If the incremental model already exists, and we pass in --full-refresh, dbt should create the new table in the intermediate_relation location (not temporary + deterministic suffix), then swap the old and the new. In that case, the incremental materialization should really just replicate the behavior of the table materialization.

One alteration: I think we want to actually run the model first, so that dbt has to create the intermediate <database>.<schema>.incremental__dbt_tmp, and not just create it directly as <database>.<schema>.incremental.

        # Again, but against the incremental materialization
        self.run_dbt(['run', '--model', 'incremental'])   # <-- this line is new
        self.run_sql_file("create_incremental__dbt_tmp.sql")
        self.assertEqual(len(results), 1)
        ...

I believe that, with the change proposed above (preexisting_temp_relationpreexisting_intermediate_relation), this test should start passing.

Once we sort that out, we'll want to git pull origin main and update the test in its new location

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test has restored, updated, and migrated to the new location

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@epapineau This looks great to me! Thanks for rolling with the many changes and suggestions :)

Just one tiny thing: What is tests/functional/simple_seed/data/tmp.csv? Do we need that? I'm guessing it might be a holdover from a git pull/merge. As soon as that's gone, this is good to merge.

@epapineau
Copy link
Contributor Author

@jtcohen6 Great question! I do not know where that came from. It's been removed.

@jtcohen6
Copy link
Contributor

@epapineau I think you may have deleted tests/functional/materializations/test_runtime_materialization.py instead of tests/functional/simple_seed/data/tmp.csv! Just switch those around and we're good to go :)

@epapineau
Copy link
Contributor Author

@jtcohen6 Okay resolved now thanks for catching that

@jtcohen6
Copy link
Contributor

Going to rebase this PR against main to pull in the fix we merged yesterday for the flaky failing test

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work @epapineau! Thanks so much for contributing, and for seeing this all the way through :)

@jtcohen6 jtcohen6 merged commit e7218d3 into main May 19, 2022
@jtcohen6 jtcohen6 deleted the dev/epapineau branch May 19, 2022 11:57
agoblet pushed a commit to BigDataRepublic/dbt-core that referenced this pull request May 20, 2022
* Truncate relation names when appending a suffix that will result in len > 63 characters using make_temp_relation and make_backup_relation macros

* Remove timestamp from suffix appended to backup relation

* Add changelog entry

* Implememt make_relation_with_suffix macro

* Add make_intermediate_relation macro that controls _tmp relation creation in table and view materializations to delienate from database- and schema-less behavior of relation returned from make_temp_relation

* Create backup_relation at top of materialization to use for identifier

* cleanup

* Add dstring arg to make_relation_with_suffix macro

* Only reference dstring in conditional of make_relation_with_suffix macro

* Create both a temp and intermediate relation, update preexisting_temp_relation to preexisting_intermediate_relation

* Migrate test updates to new test location

* Remove restored tmp.csv

* Revert "Remove restored tmp.csv"

This reverts commit 900c9db.

* Actually remove restored tmp.csv
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla:yes Team:Adapters Issues designated for the adapter area of the code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Relation name '*__dbt_tmp' is longer than 63 characters
3 participants