Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dbt Constraints / model contracts #6271

Merged
merged 117 commits into from
Feb 15, 2023
Merged

dbt Constraints / model contracts #6271

merged 117 commits into from
Feb 15, 2023

Conversation

sungchun12
Copy link
Contributor

@sungchun12 sungchun12 commented Nov 16, 2022

resolves #6079

Description

This is a building block to empower multi-project experiences in the future.

Brings defining data types for columns within a SQL select statement as native to dbt. This PR focuses on injecting new config defaults within the manifest, and introduces compilation defense code to ensure configs follow a clear standard. Other adapter PRs will pair with this one to ensure configs cascade appropriately.

Note: It is the responsibility of the adapter to provide extra guardrails for configs like check that do not work universally across databases.

Related Adapter Pull Requests

Must be reviewed with passing tests

dbt Core Test Cases

  • Fix failing test regressions (Thanks @MichelleArk for helping me across the finish line 💯 )
  • Test: Verify compilation errors show expected dynamic messages
  • Test: Verify config hierarchy overrides work as expected(ex: constraints enabled at the model level override constraints enabled at the project_yml level with True or False values)
  • Test: Verify only SQL table materializations work
  • Test: Verify all or nothing data_type configs are enforced

dbt-postgres Adapter Test Cases

  • Test: Verify global macro is dispatched and can be overridden elegantly in the postgres adapter
  • Test: Verify DDL matches an expectation
  • Test: rollback the full DDL transaction if it fails with an ad hoc query verifying the table is in its previous state

Checklist

@cla-bot cla-bot bot added the cla:yes label Nov 16, 2022
@github-actions
Copy link
Contributor

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

1 similar comment
@github-actions
Copy link
Contributor

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

@sungchun12 sungchun12 self-assigned this Nov 16, 2022
@sungchun12 sungchun12 mentioned this pull request Nov 16, 2022
6 tasks
@sungchun12
Copy link
Contributor Author

Progress Update 11/17/2022

  • defaults for data_type, constraint, check appear in the manifest.json
  • Add a constraints validator within parsed.py before a node can be written to compiled SQL code
    image
  • only SQL table materializations are allowed
  • performance is fast
    image

Next Steps

  • Verify if my approach is an accepted pattern
  • Write down test cases

@@ -17,6 +17,7 @@

from dbt.dataclass_schema import dbtClassMixin, ExtensibleDbtClassMixin

from dbt.exceptions import CompilationException
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed no exceptions exist in this file except for the one I introduce. Is that intentional?

@@ -93,6 +93,8 @@ class HasDocs(AdditionalPropertiesMixin, ExtensibleDbtClassMixin, Replaceable):
description: str = ""
meta: Dict[str, Any] = field(default_factory=dict)
data_type: Optional[str] = None
constraint: Optional[str] = None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It only needs to be inserted here as external tables and sources are not in scope for this feature

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could there ever be multiple constraints? Should this be a list (or either a string or a list)?

Copy link
Contributor Author

@sungchun12 sungchun12 Nov 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooooo great question. For developer UX, I'd bias towards a list even though it may be more "efficient" to allow arbitrary strings, developer mental load better reasons about multiple constraints as a list.

And yes, postgres allows multiple constraints: https://www.postgresql.org/docs/current/ddl-constraints.html#id-1.5.4.6.6

I'll make the change for the constraints config to be an optional list

String Format:
Harder to reason about separate constraints

# models/my_custom_model_config.yml
models:
  - name: my_model
    columns:
      - name: email
        data_type: string
        constraints: not null unique 

List Format:
Easier to reason about separate constraints

# models/my_custom_model_config.yml
models:
  - name: my_model
    columns:
      - name: email
        data_type: string
        constraints: ['not null','unique']

@@ -1000,6 +1000,12 @@ def warn(msg, node=None):
return ""


# TODO: remove this later as Sung created this to avoid errors with running dbt-snowflake==1.3.0
def warn_or_error(msg, node=None):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this function was removed from the main branch while 1.3.latest still has it? https://github.com/dbt-labs/dbt-core/blob/1.3.latest/core/dbt/exceptions.py#L1090

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also notice that there looks to be a regression error that any selected models give an incorrect warning on the main branch.

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this was related to work that Emily has been doing on more structured logging/exceptions. Just opened a ticket for it: #6297

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the selector error message was fixed, but is there a reason warn_or_error doesn't exist in the main branch but in 1.3.latest? @jtcohen6

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sungchun12 It still exists, it just moved:

- from dbt.exceptions import warn_or_error
+ from dbt.events.functions import warn_or_error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhhh I know what happened, I needed to update my snowflake adapter to the main branch vs. relying on 1.3.latest Thanks Jerco! I'm unblocked!

@@ -180,6 +180,7 @@ def from_test_block(
)


# TODO: create a ConstraintsBuilder equivalent
class TestBuilder(Generic[Testable]):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should NOT need to build a ConstraintsBuilder because the configs extend column_info within the schemas.py file.

@@ -177,6 +177,21 @@ def verify_python_model_code(node):
raise ParsingException("No jinja in python model code is allowed", node=node)


# TODO: Do we need to do this?
# def verify_model_constraints(node):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to verify within this model because that happens parsed.py

@@ -585,6 +585,10 @@
"type": "boolean",
"default": true
},
"constraints_enabled": {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you test the manifest?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quis verificat ipsos verificantia? quis manifestat ipsos manifesta?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

google translate:

Who verifies the verifiers? Who makes them manifest?

@jtcohen6
Copy link
Contributor

I've reviewed + refactored tests on the dbt-core PR, with the specific goal of having a clearer separation between:

  1. What can be tested in dbt-core only: validation, parsing exceptions, basic end-to-end functionality
  2. What needs testing on each adapter: dbt's "preflight" checks within materialization; platform-enforced runtime checks (via DDL); successful rollback if contract isn't met

There are now only two "adapter zone" test classes that need inheriting/modifying. I'll take a pass over the adapter PRs tomorrow or Wednesday.

Ideally, I'd like to get all of this merged by Wednesday (Feb 1), but I'm going to wait until feature/click-cli is merged (#6631), so as to avoid creating additional merge conflicts for that crew.

cast('2019-01-01' as date) as date_day
"""

my_model_error_sql = """
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks the same as my_model_sql -- what's the expected error? @jtcohen6

Copy link
Contributor

@jtcohen6 jtcohen6 Jan 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dave-connors-3 Good catch! Ended up implementing this by other means below:

# Make a contract-breaking change to the model
my_model_with_nulls = my_model_sql.replace("1 as id", "null as id")
write_file(my_model_with_nulls, "models", "my_model.sql")

Let's cut my_model_error_sql entirely


# Verify the previous table still exists
old_model_exists_sql = """
select id from dbt.{0}.my_model where id = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtcohen6 I think this will also need the templated database!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@dbeatty10 dbeatty10 changed the title dbt Constraints dbt Constraints / model contracts Feb 14, 2023
Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks to @dbeatty10 @McKnight-42 @MichelleArk for doing some live PR walkthrough & review earlier!

I'll work on fixing up the merge conflicts. Let's make sure all the TODOs are captured in our tracking issues, then let's get this baby merged 😎

{%- set col_name = col['name'] -%}
{%- set column_names_config_only = column_names_config_only.append(col_name) -%}
{%- endfor -%}
{%- set sql_file_provided_columns = get_columns_in_query(sql) -%}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making an assumption here that column ordering is consistent (and deterministic) in all our data platforms.

Follow-on TODO: We prefer option 2 in https://github.com/dbt-labs/dbt-core/pull/6271/files#r1069332715, let's make sure dbt always ensures order itself, and doesn't make yaml-ordering matter so much

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @MichelleArk for opening #6975!

Comment on lines +28 to +31
{% if config.get('constraints_enabled', False) %}
{{ get_assert_columns_equivalent(sql) }}
{{ get_columns_spec_ddl() }}
{% endif %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up (related to ongoing discussion in #6750): We might want to distinguish between the "contract" and the "constraints." That distinction would make it possible for users to define (e.g.) check constraints on Postgres/Databricks, without providing the full column specification in a yaml file. (This is already the case in dbt-databricks.)

What ought to be part of the contract (for detecting breaking changes, #6869)? "Data shape" for sure (column names, types, nullability). Additional constraints? Tests? We need to decide

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assigning myself to #6750 for further refinement, before we pick this up for implementation

@@ -173,7 +173,7 @@
}
},
"additionalProperties": false,
"description": "WritableManifest(metadata: dbt.contracts.graph.manifest.ManifestMetadata, nodes: Mapping[str, Union[dbt.contracts.graph.nodes.AnalysisNode, dbt.contracts.graph.nodes.SingularTestNode, dbt.contracts.graph.nodes.HookNode, dbt.contracts.graph.nodes.ModelNode, dbt.contracts.graph.nodes.RPCNode, dbt.contracts.graph.nodes.SqlNode, dbt.contracts.graph.nodes.GenericTestNode, dbt.contracts.graph.nodes.SnapshotNode, dbt.contracts.graph.nodes.SeedNode]], sources: Mapping[str, dbt.contracts.graph.nodes.SourceDefinition], macros: Mapping[str, dbt.contracts.graph.nodes.Macro], docs: Mapping[str, dbt.contracts.graph.nodes.Documentation], exposures: Mapping[str, dbt.contracts.graph.nodes.Exposure], metrics: Mapping[str, dbt.contracts.graph.nodes.Metric], selectors: Mapping[str, Any], disabled: Optional[Mapping[str, List[Union[dbt.contracts.graph.nodes.AnalysisNode, dbt.contracts.graph.nodes.SingularTestNode, dbt.contracts.graph.nodes.HookNode, dbt.contracts.graph.nodes.ModelNode, dbt.contracts.graph.nodes.RPCNode, dbt.contracts.graph.nodes.SqlNode, dbt.contracts.graph.nodes.GenericTestNode, dbt.contracts.graph.nodes.SnapshotNode, dbt.contracts.graph.nodes.SeedNode, dbt.contracts.graph.nodes.SourceDefinition]]]], parent_map: Optional[Dict[str, List[str]]], child_map: Optional[Dict[str, List[str]]])",
"description": "WritableManifest(metadata: dbt.contracts.graph.manifest.ManifestMetadata, nodes: Mapping[str, Union[dbt.contracts.graph.nodes.AnalysisNode, dbt.contracts.graph.nodes.SingularTestNode, dbt.contracts.graph.nodes.HookNode, dbt.contracts.graph.nodes.ModelNode, dbt.contracts.graph.nodes.RPCNode, dbt.contracts.graph.nodes.SqlNode, dbt.contracts.graph.nodes.GenericTestNode, dbt.contracts.graph.nodes.SnapshotNode, dbt.contracts.graph.nodes.SeedNode]], sources: Mapping[str, dbt.contracts.graph.nodes.SourceDefinition], macros: Mapping[str, dbt.contracts.graph.nodes.Macro], docs: Mapping[str, dbt.contracts.graph.nodes.Documentation], exposures: Mapping[str, dbt.contracts.graph.nodes.Exposure], metrics: Mapping[str, dbt.contracts.graph.nodes.Metric], selectors: Mapping[str, Any], disabled: Union[Mapping[str, List[Union[dbt.contracts.graph.nodes.AnalysisNode, dbt.contracts.graph.nodes.SingularTestNode, dbt.contracts.graph.nodes.HookNode, dbt.contracts.graph.nodes.ModelNode, dbt.contracts.graph.nodes.RPCNode, dbt.contracts.graph.nodes.SqlNode, dbt.contracts.graph.nodes.GenericTestNode, dbt.contracts.graph.nodes.SnapshotNode, dbt.contracts.graph.nodes.SeedNode, dbt.contracts.graph.nodes.SourceDefinition]]], NoneType], parent_map: Union[Dict[str, List[str]], NoneType], child_map: Union[Dict[str, List[str]], NoneType])",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO before merging: Regenerate this given latest changes.

FOR LATER: Bump the manifest schema version to v9 (for v1.5) — until/unless we decide to change our approach to manifest/artifact versioning :) cc @MichelleArk let's talk more about this with @dbt-labs/core-language

@jtcohen6 jtcohen6 dismissed gshank’s stale review February 15, 2023 15:50

Lots of code written & reviewed since!

@jtcohen6 jtcohen6 merged commit c647706 into main Feb 15, 2023
@jtcohen6 jtcohen6 deleted the dbt-constraints branch February 15, 2023 15:50
@sungchun12
Copy link
Contributor Author

Thanks team for getting this done. I just shouted out loud, "FINALLY!" With the utmost glee.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

[CT-1358] [Feature] Add column type constraints as dbt native configs