-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix/snowflake source quoting #1326
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the structure looks good here. you're on the right track but this isn't quite ready to go yet.
core/dbt/adapters/base/impl.py
Outdated
identifier='__phony__', | ||
).include(identifier=False).quote(database=quote_db, schema=False) | ||
|
||
information_schema[relation.database.lower()] = information_schema_rel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need a duplicate check here?
core/dbt/contracts/graph/manifest.py
Outdated
key = database.lower() | ||
database_counter[key] += 1 | ||
if database_counter[key] > 1: | ||
raise RuntimeError("not unique db name TODO TODO") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
☝️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this check should be moved to the get_information_schema_mapping
, no?
'identifier': True, | ||
'database': source.quoting['database'], | ||
'schema': source.quoting['schema'], | ||
'identifier': source.quoting['identifier'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this new version is more correct than the old one, right? it guarantees that the rendered relation will exactly match what was supplied in the source spec. i like it
@@ -781,14 +782,39 @@ def get_catalog(self, manifest): | |||
Returns an agate.Table of catalog information. | |||
""" | |||
# make it a list so macros can index into it. | |||
context = {'databases': list(manifest.get_used_databases())} | |||
context = { | |||
'databases': list(manifest.get_used_databases()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like this should be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can do that! All of the get_catalog macros in dbt loop over this value. The alternative would be to make the catalog queries loop over the information_schema
dict values... unsure if that's better
core/dbt/parser/schemas.py
Outdated
identifier.endswith(self.quote_character): | ||
return True, identifier[1:-1] | ||
else: | ||
return False, identifier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if i have any doubt about this branch, it's this fn. the idea of it is fine, but i can see it breaking easily...
do we trim whitespace on identifiers specified in schema.yml?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really good question! I think there's some amount of caveat emptor that needs to happen here. I'm super happy to trim whitespace (this is a good idea) but i imagine that there are always going to be intractable edge cases with an approach like this. If your database starts and ends with quotation marks, then you have bigger problems than dbt's sources not working i think.
I'd agree with you that this is that part of the PR that i feel the least comfortable with. Happy to explore other ways of doing this, or possibly talk about to avoid string hackery all together. I think this functionality results in the nicest UX for users, but i could equivalently imagine:
sources:
- name: snowplow
database: raw
quoting:
database: true
....
I think that's doable, but i imagine it's unpleasant both for dbt devs and for users. Can think about this a little more for sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments!
table = self.execute_macro(GET_CATALOG_MACRO_NAME, | ||
context_override=context, | ||
release=True) | ||
|
||
results = self._catalog_filter_table(table, manifest) | ||
return results | ||
|
||
@available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you really want this available in macros/materializations, given that you're already injecting its result into the context.
@@ -8,6 +8,7 @@ | |||
from dbt.node_types import NodeType | |||
from dbt.logger import GLOBAL_LOGGER as logger | |||
from dbt import tracking | |||
from collections import defaultdict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you don't need this
@@ -676,12 +676,17 @@ class Hook(APIObject): | |||
# the manifest search stuff really requires this, sadly | |||
'resource_type': { | |||
'enum': [NodeType.Source], | |||
}, | |||
'quoting': { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these members should probably also be required
and exist under properties
as a 'type': 'object'
, right? Something like this:
'quoting': {
'type': 'object',
'additionalProperties': False,
'properties': {
'database': {'type': 'boolean'},
'schema': {'type': 'boolean'},
'identifier': {'type': 'boolean'},
},
'required': ['database', 'schema', 'identifier'],
}
@@ -257,6 +258,7 @@ def build_test_node(self, test_target, package_name, test, root_dir, path, | |||
|
|||
source_package = self.all_projects.get(package_name) | |||
if source_package is None: | |||
# TODO : model_name is not defined here... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think s/model_name/test_target fixes this? I'd have to check the types to be 100% on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when i run my pep8 linter, it (surprisingly frequently) shows that vars aren't defined. I'll make this quick fix here, but wanted to flag it in the initial commit so i didn't forget about it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you using pylint
instead of pep8
itself to get those? It's great for this stuff but dbt makes it cry so much I can't stand using it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
think i'm using flake8. Can try to dig up where my configs are defined -- pretty sure I'm not using the stock settings but can't seem to find them right now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, I think that's enough to go on. maybe we should add it to our CI process instead of pep8
itself? I'm pretty sure flake8
is just pep8
+ a mccabe complexity thing + pyflakes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I'll open an issue about moving it - it seems like something we should do unless we find an unbearable false positive rate. In the meantime I've installed it in my editor so I'll fix stuff as I work on other things.
The behavior in this PR is implemented by #1338, closing this one accordingly |
Information Schema queries for Sources
This PR revamps how dbt handles identifier quoting for sources. It addresses #1317, but also addresses another 0.13.0 bug that we don't currently have an issue for. This second bug was way gnarlier than #1317, and the majority of this PR is intended to fix that bug
Source quoting
In 0.13.0, dbt was modified to better support custom databases. As a part of this change, an
information_schema_name macro was added which, given a database literal, returns the database and schema for the relevant
information_schema
.This macro uses the project-wide
quoting
config to determine if databases should be quoted. This is the difference between:and
There is no difference between these two queries on pg/redshift (and we don't use BQ's information schema), so the effects of this macro are localized to Snowflake. The key problem here is that dbt is using project-level quoting configs to determine if it should quote source databases. While these seem similar, they're actually super different. Because dbt uses the
quoting
config to both 1) build and 2) query relations, it is internally consistent. For source databases however, dbt's quoting config has no bearing at all on whether or not the source database should be quoted! The existinginformation_schema_name
logic would effectively make it impossible to address a lowercase+quoted source database on snowflake if database quoting was turned off for the project. This PR divorces the logic of generatinginformation_schema
relations for sources from the quoting config specified in thedbt_project.yml
file.Two approaches
There are two approaches we can take here:
schema.yml
filesschema.yml
files (the one implemented here)The first approach is implemented in the first commit for this PR, and it worked! It also resulted in a pretty unpleasant experience for Snowflake users. This required users to write code like:
Effectively every table name would need to be duplicated in order to provide a sane (lower case) name and correct upper-cased identifier. The 90% case for snowflake sources is that they are uppercase/unquoted, so it didn't feel right to impose this burden on dbt users.
The second approach results in a much nicer UX at the expense of some... not ideal.... logic in dbt. This makes the users with troublesome source data identifiers feel the pain, which is probably how it should be :). In the approach implemented in this PR, dbt inspects the specified identifiers for sources. If the identifiers include quotes, then dbt will strip those quotes and record that the given identifier should be quoted in information_schema queries. In this version, the above schema.yml can be written as:
Much nicer! If source tables need to be quoted (b/c they include illegal characters, or are lowercased on Snowflake) users can do something like:
When dbt tries to query the
raw-database
information schema, it will render SQL like:So, i think this checks all the boxes:
There's some cleanup work to do here, but wanted to get some initial feedback as I've been chasing my tail here for a couple of days now.
TODO: