Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

View data summary as part of dbt run. #3979

Closed
akashshah59 opened this issue Sep 29, 2021 · 3 comments
Closed

View data summary as part of dbt run. #3979

akashshah59 opened this issue Sep 29, 2021 · 3 comments
Labels
enhancement New feature or request stale Issues that have gone stale

Comments

@akashshah59
Copy link

Describe the feature

DBT has proved to be a great tool in organizing our data science project ETLs. However, in order to view the results of our intermediate transformations, or even the resulting transformation after an entire pipeline, data scientists have to log in to snowflake and get a summary of the data, using a LIMIT clause or doing some basic statistics on their features. Further, as a Data Scientist, sometimes we want to experiment with a feature and view the results then and there, pretty frequently, and this requires us to get into the data warehouse UI again and again.

While the current dbt project structure allows us scientists to focus more on code rather then boiler plate create statements, what could be additionally useful is a feature or a flag that allows to display intermediate results from views or tables created as part of the dbt run command.

An additional thought would be to return the intermediate results as a dataframe like structure, allowing us to manipulate or play around with that data after performing a dbt run. As I visualize it, dbt view-data <model_name>

Describe alternatives you've considered

  1. dbt run --view-data True
  2. dbt view-data <model_name>

Additional context

This feature should not be database-specific, however we use dbt with snowflake and it would be great to start off right there.
I understand that this might be difficult to implement when it comes to larger analytics jobs, especially the ones that involve huge data volumes , maybe in GigaBytes. However we were hoping to get around that using some sampling strategy.

Who will this benefit?

Data scientists, Engineers and everyone how makes frequent changes to their dbt models.

Are you interested in contributing this feature?

Yes.

@akashshah59 akashshah59 added enhancement New feature or request triage labels Sep 29, 2021
@akashshah59 akashshah59 changed the title View data as part of dbt run. View data summary as part of dbt run. Sep 29, 2021
@jtcohen6 jtcohen6 removed the triage label Sep 30, 2021
@jtcohen6
Copy link
Contributor

jtcohen6 commented Sep 30, 2021

Really cool ideas @akashshah59! There's a lot you're touching on here, I'll try to break some of it down.

Option 1: Show table samples in the terminal while dbt is running

This follows your proposal for dbt run --view-data True.

There's actually a way to do this in dbt today, but it's exclusively limited to dbt seed, using a --show command that's sort of undocumented, and was originally implemented a long time ago (#618).

$ dbt seed --show
Running with dbt=0.21.0-rc1
Found 1 model, 1 test, 1 snapshot, 0 analyses, 350 macros, 0 operations, 1 seed file, 0 sources, 0 exposures

19:56:50 | Concurrency: 5 threads (target='dev')
19:56:50 |
19:56:50 | 1 of 1 START seed file dbt_jcohen.country_codes...................... [RUN]
19:56:50 | 1 of 1 OK loaded seed file dbt_jcohen.country_codes.................. [INSERT 247 in 0.16s]
19:56:50 |
19:56:50 | Finished running 1 seed in 0.46s.

Random sample of table: dbt_jcohen.country_codes
------------------------------------------------
| name             | two_letter_iso_code | three_letter_iso_... |
| ---------------- | ------------------- | -------------------- |
| Bolivia          | BO                  | BOL                  |
| Swaziland        | SZ                  | SWZ                  |
| Christmas Island | CX                  | CXR                  |
| Uganda           | UG                  | UGA                  |
| Switzerland      | CH                  | CHE                  |
| Andorra          | AD                  | AND                  |
| Montenegro       | ME                  | MNE                  |
| Chile            | CL                  | CHL                  |
| Comoros          | KM                  | COM                  |
| Nauru            | NR                  | NRU                  |
| ...              | ...                 | ...                  |


Completed successfully

Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

In practice, this isn't so useful, but it is pretty cool. There's a related feature request (#3265) to do this for test failures, which is conceivable in conjunction with the --store-failures feature: the results are already in a table, ready to be sampled. I just managed to hack on this a bit locally, and I left a comment with some details: #3265 (comment). Again, this doesn't feel tremendously useful, or like it would be a game-changing feature, but it is fun.

Option 2: Use dbt to run interactive queries against your database

This is more like your proposal of a dedicated view-data command. I think this is the real answer—it's the ability to run arbitrary bits of dbt-SQL against your data warehouse, leveraging your existing connection details. The workflow looks a bit like:

$ dbt run -s my_model

Followed by:

$ dbt run-sql "select * from {{ ref('my_model') }} limit 10"

Or, to follow your example, a handy-dandy CLI command that shortcuts the above:

$ dbt sample my_model
$ dbt view-data <model_name>

That's exactly the kind of development workflow that the dbt Server exists to enable. The initial version (dbt rpc) is a bit janky today, so we're actually in the process of reworking it currently. Hoping to share more details soon.

An additional thought would be to return the intermediate results as a dataframe like structure

Agree! The existing Server sort of does this, by storing results as an agate table (dataframe) and then returning them as a JSON object. I think this kind of workflow would be most compelling in a notebook application, where the notebook persists SQL cell results as data frames (available to R/python context), and a dbt Server makes it possible to run dbt-SQL in that notebook, with access to your full project context.

coda

Of course, if you're really committed to doing it all from the command line, and avoiding a data warehouse console, there are some cool SQL-from-CLI tools (e.g. whale) :)

@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label May 22, 2022
@github-actions
Copy link
Contributor

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest; add a comment to notify the maintainers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale Issues that have gone stale
Projects
None yet
Development

No branches or pull requests

2 participants