View data summary as part of dbt run. #3979

akashshah59 · 2021-09-29T20:35:01Z

Describe the feature

DBT has proved to be a great tool in organizing our data science project ETLs. However, in order to view the results of our intermediate transformations, or even the resulting transformation after an entire pipeline, data scientists have to log in to snowflake and get a summary of the data, using a LIMIT clause or doing some basic statistics on their features. Further, as a Data Scientist, sometimes we want to experiment with a feature and view the results then and there, pretty frequently, and this requires us to get into the data warehouse UI again and again.

While the current dbt project structure allows us scientists to focus more on code rather then boiler plate create statements, what could be additionally useful is a feature or a flag that allows to display intermediate results from views or tables created as part of the dbt run command.

An additional thought would be to return the intermediate results as a dataframe like structure, allowing us to manipulate or play around with that data after performing a dbt run. As I visualize it, dbt view-data <model_name>

Describe alternatives you've considered

dbt run --view-data True
dbt view-data <model_name>

Additional context

This feature should not be database-specific, however we use dbt with snowflake and it would be great to start off right there.
I understand that this might be difficult to implement when it comes to larger analytics jobs, especially the ones that involve huge data volumes , maybe in GigaBytes. However we were hoping to get around that using some sampling strategy.

Who will this benefit?

Data scientists, Engineers and everyone how makes frequent changes to their dbt models.

Are you interested in contributing this feature?

Yes.

The text was updated successfully, but these errors were encountered:

jtcohen6 · 2021-09-30T20:39:03Z

Really cool ideas @akashshah59! There's a lot you're touching on here, I'll try to break some of it down.

Option 1: Show table samples in the terminal while dbt is running

This follows your proposal for dbt run --view-data True.

There's actually a way to do this in dbt today, but it's exclusively limited to dbt seed, using a --show command that's sort of undocumented, and was originally implemented a long time ago (#618).

$ dbt seed --show
Running with dbt=0.21.0-rc1
Found 1 model, 1 test, 1 snapshot, 0 analyses, 350 macros, 0 operations, 1 seed file, 0 sources, 0 exposures

19:56:50 | Concurrency: 5 threads (target='dev')
19:56:50 |
19:56:50 | 1 of 1 START seed file dbt_jcohen.country_codes...................... [RUN]
19:56:50 | 1 of 1 OK loaded seed file dbt_jcohen.country_codes.................. [INSERT 247 in 0.16s]
19:56:50 |
19:56:50 | Finished running 1 seed in 0.46s.

Random sample of table: dbt_jcohen.country_codes
------------------------------------------------
| name             | two_letter_iso_code | three_letter_iso_... |
| ---------------- | ------------------- | -------------------- |
| Bolivia          | BO                  | BOL                  |
| Swaziland        | SZ                  | SWZ                  |
| Christmas Island | CX                  | CXR                  |
| Uganda           | UG                  | UGA                  |
| Switzerland      | CH                  | CHE                  |
| Andorra          | AD                  | AND                  |
| Montenegro       | ME                  | MNE                  |
| Chile            | CL                  | CHL                  |
| Comoros          | KM                  | COM                  |
| Nauru            | NR                  | NRU                  |
| ...              | ...                 | ...                  |


Completed successfully

Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

In practice, this isn't so useful, but it is pretty cool. There's a related feature request (#3265) to do this for test failures, which is conceivable in conjunction with the --store-failures feature: the results are already in a table, ready to be sampled. I just managed to hack on this a bit locally, and I left a comment with some details: #3265 (comment). Again, this doesn't feel tremendously useful, or like it would be a game-changing feature, but it is fun.

Option 2: Use dbt to run interactive queries against your database

This is more like your proposal of a dedicated view-data command. I think this is the real answer—it's the ability to run arbitrary bits of dbt-SQL against your data warehouse, leveraging your existing connection details. The workflow looks a bit like:

$ dbt run -s my_model

Followed by:

$ dbt run-sql "select * from {{ ref('my_model') }} limit 10"

Or, to follow your example, a handy-dandy CLI command that shortcuts the above:

$ dbt sample my_model
$ dbt view-data <model_name>

That's exactly the kind of development workflow that the dbt Server exists to enable. The initial version (dbt rpc) is a bit janky today, so we're actually in the process of reworking it currently. Hoping to share more details soon.

An additional thought would be to return the intermediate results as a dataframe like structure

Agree! The existing Server sort of does this, by storing results as an agate table (dataframe) and then returning them as a JSON object. I think this kind of workflow would be most compelling in a notebook application, where the notebook persists SQL cell results as data frames (available to R/python context), and a dbt Server makes it possible to run dbt-SQL in that notebook, with access to your full project context.

coda

Of course, if you're really committed to doing it all from the command line, and avoiding a data warehouse console, there are some cool SQL-from-CLI tools (e.g. whale) :)

github-actions · 2022-05-22T02:10:46Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

github-actions · 2022-05-29T02:14:25Z

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest; add a comment to notify the maintainers.

akashshah59 added enhancement New feature or request triage labels Sep 29, 2021

akashshah59 changed the title ~~View data as part of dbt run.~~ View data summary as part of dbt run. Sep 29, 2021

jtcohen6 removed the triage label Sep 30, 2021

jtcohen6 added the Team: Execution label Nov 18, 2021

github-actions bot added the stale Issues that have gone stale label May 22, 2022

github-actions bot closed this as completed May 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

View data summary as part of dbt run. #3979

View data summary as part of dbt run. #3979

akashshah59 commented Sep 29, 2021

jtcohen6 commented Sep 30, 2021 •

edited

Loading

github-actions bot commented May 22, 2022

github-actions bot commented May 29, 2022

View data summary as part of dbt run. #3979

View data summary as part of dbt run. #3979

Comments

akashshah59 commented Sep 29, 2021

Describe the feature

Describe alternatives you've considered

Additional context

Who will this benefit?

Are you interested in contributing this feature?

jtcohen6 commented Sep 30, 2021 • edited Loading

Option 1: Show table samples in the terminal while dbt is running

Option 2: Use dbt to run interactive queries against your database

coda

github-actions bot commented May 22, 2022

github-actions bot commented May 29, 2022

jtcohen6 commented Sep 30, 2021 •

edited

Loading