-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for model contracts in v1.5 #148
Add support for model contracts in v1.5 #148
Conversation
# DuckDB doesn't support 'foreign key' as an alias | ||
return f"references {constraint.expression}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually references
seems like the more standard syntax. And we're not really doing a good job of supporting FK constraints, anyway:
- [CT-2519] [CT-2442] [Bug] Foreign key constraint templating dbt-labs/dbt-core#7417
- [CT-2441] [Feature] Support
expression
for all constraint types dbt-labs/dbt-core#7416
The majority of columnar db's don't enforce them, so it hasn't felt like a priority. It's neat that DuckDB actually does.
# DuckDB's cursor doesn't seem to distinguish between: | ||
# INT and NUMERIC/DECIMAL -- both just return as 'NUMBER' | ||
# TIMESTAMP and TIMESTAMPTZ -- both just return as 'DATETIME' | ||
# [1,2,3] and ['a','b','c'] -- both just return as 'list' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would happen in these cases?
Context: dbt is running an "empty" version of the model query (where false limit 0
), and extracting column names + data types from the cursor description, before it actually materializes the model. If it detects a diff, it will raise an error like:
08:58:44 Compilation Error in model my_model (models/my_model.sql)
08:58:44 This model has an enforced contract that failed.
08:58:44 Please ensure the name, data_type, and number of columns in your contract match the columns in your model's definition.
08:58:44
08:58:44 | column_name | definition_type | contract_type | mismatch_reason |
08:58:44 | ----------- | --------------- | ------------- | ------------------ |
08:58:44 | my_col | NUMBER | TEXT | data type mismatch |
If no mismatch is detected, and it's a table, dbt will now materialize it via:
create table {model_name} (my_col int);
insert into table (select {model_sql});
The database's type system is flexible by design, and is willing to do a lot of creative coercion:
D create table my_tbl (id int);
D insert into my_tbl (select 2.1 as id);
D select * from my_tbl;
┌───────┐
│ id │
│ int32 │
├───────┤
│ 2 │
└───────┘
It will still catch the more egregious mismatches, including for structs and lists:
D create or replace table other_tbl (struct_col struct(num int));
D insert into other_tbl (select {'num': 'text'} as struct_col);
Error: Conversion Error: Could not convert string 'text' to INT32
D insert into other_tbl (select ['some', 'strings'] as list_of_int);
D create or replace table other_tbl (list_of_int int[]);
Error: Conversion Error: Could not convert string 'some' to INT32
Hey Jeremy, thanks for taking a pass at this! ngl that when I told Sung that I was capacity constrained atm I didn't think you were going to be the one to attempt this! So you've outlined the macro problem here very well-- DuckDB did a very literal interpretation of the DB API spec and does not provide the more detailed type information that ~ every other major database does. The consequence of this is that dbt-duckdb would provide a qualitatively worse implementation of model contracts than any of the other adapters, which is unsatisfying for me in about a half-dozen different ways. There is something of a way around it though, although it's some work to do, and given some other stuff I have going on right now I didn't want to start in on it under a deadline of having it ready in a week. I originally ran into this problem (I needed a more detailed |
I woke up a bit too early this morning, caught the thread, and had some fun. Always good for me to duck around a bit & feel out what it's like to extend our new stuff.
Right on - I'm glad we're on the same page. I didn't realize you'd already had a chance to look into this, given the related work over in BV. I won't be offended if we need to scrap a lot of / all of the code that's here.
This looks cool!! where there's a Wills, there's a way Ok - I'm happy to poke around your BV implementation, or let you take it from here - let me know |
Yeah I will eventually get to it, it's really just a question of when, and it's not realistically going to be in the next couple of weeks. I feel bad about this, as I know folks are starting to actually rely on me to do things with this project (which remember, started as a fun little toy project for me to learn about how databases worked!) in a semi-timely fashion. But for the moment, it's gonna have to do. |
Heard - and totally agree with your call to ship minimal v1.5.0 compat next week, without model contracts! I still may poke around, for my own fun. |
@Mause apologies for pulling you in here, please feel free to respond next week when you're back at work, but I was wondering if y'all had given any thought to revising the level of detailed type information that the |
dbt/adapters/duckdb/impl.py
Outdated
@available.parse(lambda *a, **k: []) | ||
def get_column_schema_from_query(self, sql: str) -> List[Column]: | ||
"""Get a list of the Columns with names and data types from the given sql.""" | ||
_, cursor = self.connections.add_select_query(sql) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going to talk out loud here a bit-- I've been reading through this code and the code in dbt-core to understand how this is intended to work and how I'm going to square it with DuckDB and with my future plan for environments to enable both local and remote execution of models in dbt-duckdb.
So I'm thinking that I'm going to override both the add_select_query
and the data_type_code_to_name
method on the DuckDBConnectionManager
class, and they will in turn delegate their work to the (singleton) Environment
instance associated that is associated with the run. The consequence of making that choice is that, by overriding those two methods, we should not have to override the impl of get_column_schema_from_query
in the adapter impl as we are here.
Right now, I have two environment impls in the repo: a local one (the default), and a Buena Vista one that uses the Postgres protocol to talk to a Python server running DuckDB. As we discussed, Buena Vista already does (some of) the work required to translate the data types returned from DuckDB into Postgres-compatible type descriptions using the Arrow interface, which provides more detailed type information than the DBAPI implementation in DuckDB. I will need to flesh out those conversions a bit more so as to support all of the conversions specified in the data_types
fixture in the adapter functional tests, and then copy an edited version of that code from the Buena Vista repo into dbt-duckdb repo for the LocalEnvironment
implementation to use. I'm not wild about doing this kind of horrible copy-pasta, but it's the only way I can see having both 1) a proper implementation of model contracts in dbt-duckdb and 2) keeping the environments construct that I have totally fallen in love with in-place. The right longer term solution is going to be better/richer type information included in the description
field returned by DuckDB's DBAPI implementation, but that's going to be months away in the best case, and patience is not one of my virtues.
Additionally, there is some more source-side plugin work I need to complete (I'm slowly coming to terms with the fact that the beautiful dream of write-side plugins in #143 isn't going to happen until at least 1.5.1) and some other stuff I have going on this week that is going to make completing that body and having everything tested to my satisfaction by Friday is going to be difficult to pull off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be another (more naive) approach, if need be. Rather than trying to pull this off with a single (empty) query, and pull the data types off the cursor description
, we could opt for two queries instead:
- create a temp table from the empty query
adapter.get_columns_in_relation
(a.k.a.information_schema.columns
) - which should return more precise data types
Something like:
{% macro duckdb__get_column_schema_from_query(select_sql) -%}
{% set tmp_relation = ... %}
{% call statement('create_tmp_relation') %}
{{ create_table_as(temporary=True, relation=tmp_relation, compiled_code=get_empty_subquery_sql(select_sql)) }}
{% endcall %}
{% set column_schema = adapter.get_columns_in_relation(tmp_relation) %}
{{ return(column_schema) }}
{% endmacro %}
(It looks like we didn't dispatch
that macro - an oversight that we could quickly correct!)
@jtcohen6 I tremendously appreciate the head start here, and the constraints stuff as-specified looks 💯 . If you don't mind, I'd be happy to hack on the stuff I spec'd out in the comments on your branch, or if you prefer I can grab the patch of this PR and take things in the direction I want them to go in on a net-new branch. |
Mr Wills, you have the helm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I think this is ready to go; I may make the 1.5.0 release day after all!
|
||
# Taking advantage of yet another amazing DuckDB SQL feature right here: the | ||
# ability to DESCRIBE a query instead of a relation | ||
describe_sql = f"DESCRIBE ({sql})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jtcohen6 after a truly circuitous journey, this ended up being simple and delightful to implement (and in an environment-independent way to boot!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whoa! this is very handy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is pretty sweet
["1", schema_int_type, int_type], | ||
["'1'", string_type, string_type], | ||
["true", "bool", "BOOL"], | ||
["'2013-11-03 00:00:00-07'::timestamp", "TIMESTAMP", "TIMESTAMP"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jtcohen6 and @sungchun12: noting that I could not for the life of me figure out how to configure this so that it would work with the timestamptz
typing; the resulting data type in DuckDB will be TIMESTAMP WITH TIMEZONE
and I'm not sure if the spaces are messing things up, or something like that?
If either of you have a moment to take a look and figure out how it's supposed to work, that would be great, but if not nbd-- I think we can live without it for 1.5.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jwills When I add this line, the tests are passing for me locally:
["'2013-11-03 00:00:00-07'::timestamptz", "TIMESTAMPTZ", "TIMESTAMP WITH TIME ZONE"]
order being:
- SQL to return a value of this type
- type name to put in the yaml file that should match
- type name that we expect DuckDB to return (from
describe
) if there's a mismatch
(I agree this test could do with a clearer assert
!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotcha-- thank you, will give it a shot locally now!
having mixed feelings about the fact that my baby has grown to the point where it has a flaky integration test 😭 |
Thank you so much for all of the help on this @jtcohen6! Looking forward to getting this in folks hands tomorrow or Friday! |
Overview
There are two main components of "model contracts" in v1.5:
constraints
, included in the templated DDL during table creation(2) is pretty trivial to implement, since DuckDB supports constraints with a standard syntax. What's very cool is that DuckDB actually enforces primary/unique key constraints, which most columnar databases / analytical platforms really don't. It also enforces row-level
check
constraints. So this could be a true swap-out fornot_null
,unique
,accepted_values
tests.(1) was a bit trickier. I reimplemented
get_column_schema_from_query
method, in lieu of implementingdata_type_to_code
, because theduckdb
Python cursor returns the data type as a string, rather than an integer code. The other unfortunate thing is that some of the type precision seems to be lost:That will still catch when users implicitly switch out, e.g., a
has_first_order
column from being0
/1
to beingtrue
/false
. But it can't catch subtler distinctions like int/decimal, timestamp with/without tz, and the contents of lists/structs. That degree of precision is maintained by most other database's connection cursors. Very open to ideas - I'm sure others know more about this than I do.Example