Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support alternate string formatting strategies for external sources #188

Merged
merged 1 commit into from
Jun 21, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 11 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -234,8 +234,7 @@ FROM read_parquet(['s3://my-bucket/my-sources/source2a.parquet', 's3://my-bucket
```

Note that the value of the `external_location` property does not need to be a path-like string; it can also be a function
call, which is helpful in the case that you have an external source that is a CSV file which requires special handling for DuckDB
to load it correctly:
call, which is helpful in the case that you have an external source that is a CSV file which requires special handling for DuckDB to load it correctly:

```
sources:
Expand All @@ -244,8 +243,18 @@ sources:
- name: flights
meta:
external_location: "read_csv('flights.csv', types={'FlightDate': 'DATE'}, names=['FlightDate', 'UniqueCarrier'])"
formatter: oldstyle
```

Note that we need to override the default `str.format` string formatting strategy for this example
because the `types={'FlightDate': 'DATE'}` argument to the `read_csv` function will be interpreted by
`str.format` as a template to be matched on, which will cause a `KeyError: "'FlightDate'"` when we attempt
to parse the source in a dbt model. The `formatter` configuration option for the source indicates whether
we should use `newstyle` string formatting (the default), `oldstyle` string formatting, or `template` string
formatting. You can read up on the strategies the various string formatting techniques use at this
[Stack Overflow answer](https://stackoverflow.com/questions/13451989/pythons-many-ways-of-string-formatting-are-the-older-ones-going-to-be-depre) and see examples of their use
in this [dbt-duckdb integration test](https://github.com/jwills/dbt-duckdb/blob/master/tests/functional/adapter/test_sources.py).

#### Writing to external files

We support creating dbt models that are backed by external files via the `external` materialization strategy:
Expand Down
21 changes: 15 additions & 6 deletions dbt/adapters/duckdb/relation.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
from dataclasses import dataclass
from string import Template
from typing import Any
from typing import Optional
from typing import Type
Expand Down Expand Up @@ -26,12 +27,20 @@ def create_from_source(cls: Type[Self], source: SourceDefinition, **kwargs: Any)
if DuckDBConnectionManager._ENV is not None:
# No connection means we are probably in the dbt parsing phase, so don't load yet.
DuckDBConnectionManager.env().load_source(plugin_name, source_config)
elif "external_location" in source_config.meta:
# Call str.format with the schema, name and identifier for the source so that they
# can be injected into the string; this helps reduce boilerplate when all
# of the tables in the source have a similar location based on their name
# and/or identifier.
ext_location = source_config["external_location"].format(**source_config.as_dict())
elif "external_location" in source_config:
ext_location_template = source_config["external_location"]
formatter = source_config.get("formatter", "newstyle")
if formatter == "newstyle":
ext_location = ext_location_template.format_map(source_config.as_dict())
elif formatter == "oldstyle":
ext_location = ext_location_template % source_config.as_dict()
elif formatter == "template":
ext_location = Template(ext_location_template).substitute(source_config.as_dict())
else:
raise ValueError(
f"Formatter {formatter} not recognized. Must be one of 'newstyle', 'oldstyle', or 'template'."
)

# If it's a function call or already has single quotes, don't add them
if "(" not in ext_location and not ext_location.startswith("'"):
ext_location = f"'{ext_location}'"
Expand Down
12 changes: 9 additions & 3 deletions tests/functional/adapter/test_sources.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,14 +22,20 @@
- name: seeds_ost
identifier: "seeds_other_source_table"
config:
external_location: "read_csv_auto('/tmp/{identifier}.csv')"
external_location: "read_csv_auto('/tmp/%(identifier)s.csv')"
formatter: oldstyle
- name: seeds_other_source_table
config:
external_location: "read_csv_auto('/tmp/${name}.csv')"
formatter: template
"""

models_source_model_sql = """select * from {{ source('external_source', 'seeds_source') }}
"""

models_multi_source_model_sql = """select * from {{ source('external_source', 'seeds_source') }}
inner join {{ source('external_source', 'seeds_ost') }} USING (id)
models_multi_source_model_sql = """select s.* from {{ source('external_source', 'seeds_source') }} s
inner join {{ source('external_source', 'seeds_ost') }} oldstyle USING (id)
inner join {{ source('external_source', 'seeds_other_source_table') }} tmpl USING (id)
"""


Expand Down