Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an asset block which can contain structured data for use in the compilation context #1318

Closed
drewbanin opened this issue Feb 24, 2019 · 2 comments
Labels
help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors stale Issues that have gone stale

Comments

@drewbanin
Copy link
Contributor

Feature

Feature description

A good idea, via @jtcohen6. dbt should support the specification of assets -- free-form json, csv, yml, and sql files, in dbt projects. Depending on the supplied filetype in the asset block, dbt should return an object of type str/dict/list/etc when the asset is referenced.

Use cases:

  • supply static SQL for building empty tables for ETL-ish tasks (like loading from Spectrum, or S3 on Snowflake) via operations)
  • provide mapping values into the context as an alternative to seeding a csv, then pulling the data into the context with a query

Example usage:

-- assets/ddl_snowplow_events.sql (call this whatever you want)

{% asset ddl_snowplow_event_table(type='sql') %}

create table snowplow.event (
  app_id varchar(128) encode lzo,
  collector_tstamp timestamp encode raw,
  ...
);

{% endasset %}

Later:

{% call statement() %}
  {{ asset('ddl_snowplow_event_table') }}
{% endcall %}

Prior art:

  • dbt supports the specification of variables, but these are mired with scoping and access quirks. Moreover, specifying variables in the dbt_project.yml file can be unwieldy for large assets, like static SQL ddl, mapping json data, etc.
  • You can hack this with a macro that returns a static value, but that feels like a big hack

Who will this benefit?

This is the biggest question IMO. It appears that there is utility in providing a mechanism for augmenting the dbt compilation context, but I think we'd need to supply clearer use cases before we can prioritize this feature. What is the benefit of an asset over a macro for the SQL ddl use case? And what are some real-world use cases for structured data, like JSON or CSV assets? How would folks use these?

@drewbanin drewbanin added the help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors label Feb 24, 2019
@jtcohen6
Copy link
Contributor

jtcohen6 commented Mar 18, 2019

Hypothetically, it would be really cool to give dbt the powers, terraform-style, to document and run all manner of admin/config operations on the warehouse. The ultimate appeal is the ability to define verbs in macro (functions for operations or hooks), and to define the nouns that serve as verb substrate/subject in an asset block.

Core Functionality

  • Defining user groups/roles to feed into grant statements, which are executed as run or model hooks
  • Defining mapping relationships:
    • (a) in a way that is poorly served by CSV format, especially sparse or hierarchical data—current approach is to create a macro that returns a Jinja dictionary, would be so much better to define a JSON/YML dictionary
    • (b) that would be better executed in templated SQL rather than executed SQL—i.e. a mapping that's better accomplished by ordered when ilike than by a regex join.
    • (c) In a way that encourages continual user input, as for a list of metrics (and elements of those definitions) to be included in a final metrics table. Metric definitions via YML could feel like configuration, letting macros handle all the SQL.
  • Everything currently accomplished by the analysis folder: reporting or ad hoc queries that will dynamically compile macros or ref calls

+ Operations

  • Redshift: analyze or vacuum only specific tables
  • My biggest potential use case: External data sources. My pipe (!) dream is to share config info, especially column names, between external stages/table creation and source definition.
    • Redshift/Spectrum: Define properties of external schema + tables (including columns) as YML, then compile SQL for operations using macros. (Complication: can only drop/create/alter external tables as owner or superuser.)
    • Snowflake: Define external stages and snowpipes as an operation. Run refresh pipe, on all pipes known to dbt, as a separate and regularly scheduled operation.
  • Other common behavior that needs to live outside the context of a standard dbt run:
    • Logging cost/usage statistics, specified on a list of specific tables—especially Snowflake, BQ
    • Defining UDFs as SQL+ "assets", for feeding through a db-specific macro that creates UDFs in the session

@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Dec 28, 2021
@github-actions github-actions bot closed this as completed Jan 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors stale Issues that have gone stale
Projects
None yet
Development

No branches or pull requests

2 participants