Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

expression_is_true is costly when applied to a large table #683

Closed
4 tasks
basdunn opened this issue Sep 21, 2022 · 5 comments · Fixed by #686
Closed
4 tasks

expression_is_true is costly when applied to a large table #683

basdunn opened this issue Sep 21, 2022 · 5 comments · Fixed by #686
Labels
bug Something isn't working good first issue

Comments

@basdunn
Copy link

basdunn commented Sep 21, 2022

Describe the bug

When running an expression_is_true test I noticed that the test required ~500gb of data (on BigQuery), which in my opinion is extremely costly for a simple test.

Because the way the test is setup (SELECT * in the last statement), the total cost of the test is the same as doing a SELECT * FROM TABLE_THAT_WE_TEST. Which we know can be quite expensive for long and wide tables.

Steps to reproduce

  1. CREATE A LARGE TABLE
  2. Run on BQ SELECT * FROM LARGE_TABLE and check the cost
  3. Run a expression_is_true test against LARGE_TABLE and see that it is equally costly as the SELECT *

Expected results

I expect a simple test to be really cheap. I do not want to take into account the cost of a simple column test when developing.

Actual results

Its expensive.

Screenshots and log output

If applicable, add screenshots or log output to help explain your problem.

System information

The contents of your packages.yml file:

Which database are you using dbt with?

  • postgres
  • redshift
  • [X ] bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

<output goes here>

Additional context

Add any other context about the problem here. For example, if you think you know which line of code is causing the issue.

Are you interested in contributing the fix?

Sure!

expression_is_true.sql:

{% test expression_is_true(model, expression, column_name=None, condition='1=1') %}
{# T-SQL has no boolean data type so we use 1=1 which returns TRUE #}
{# ref https://stackoverflow.com/a/7170753/3842610 #}
  {{ return(adapter.dispatch('test_expression_is_true', 'dbt_utils')(model, expression, column_name, condition)) }}
{% endtest %}

{% macro default__test_expression_is_true(model, expression, column_name, condition) %}

with meet_condition as (
    select * from {{ model }} where {{ condition }}
)

select
    1 -- Change the * to a fixed single column, there might be some relevant info you could pass here for debugging, but just not ALL columns.
from meet_condition
{% if column_name is none %}
where not({{ expression }})
{%- else %}
where not({{ column_name }} {{ expression }})
{%- endif %}

{% endmacro %}
@basdunn basdunn added bug Something isn't working triage labels Sep 21, 2022
@basdunn
Copy link
Author

basdunn commented Sep 21, 2022

BTW, I havent checked whether this same thing is happening in other tests. Might be valuable to check this.

@joellabes
Copy link
Contributor

Hi @basdunn, thanks for opening this! There is prior art in dbt-labs/dbt-core#4777 if you wanted to follow the same pattern of bringing out * if store-failures is enabled, and just the expression if not.

@elyobo
Copy link
Contributor

elyobo commented Sep 25, 2022

I'd noticed this a few times and came to raise an issue today, but found @basdunn had already raised it, thanks!

I think the prior art could be improved by allowing the specification of a list of columns to include; the 1 isn't all that informative when debugging a failing test, having a specified list of columns that are used (if should store failures) would be more useful and still cheaper than the whole table for people that have big enough tables to go to the effort of specifying the relevant columns.

@elyobo
Copy link
Contributor

elyobo commented Sep 26, 2022

I've added a PR to address the suggestion here, but if someone points me at some useful examples I'd be interested in adding the ability to specify a set of columns to extract as well, as that seems like more useful behaviour to me.

@joellabes
Copy link
Contributor

adding the ability to specify a set of columns to extract as well

@elyobo my gut reaction here is that this would be something that would make sense to define across multiple tests. So I'd recommend you open an issue in dbt-core. Something like this would be cool:

models:
  - name: my_model
    tests: 
      - dbt_utils.expression_is_true:
          expression: a + b = c
          debugging_columns: [a, b, c, d] #horrible name do not use this
    columns: 
      - name: id
         tests: 
           - not_null:
               debugging_columns: [id, e, f] #See that this is available to any test

And then we'd something like

{% macro default__get_debugging_columns(expression) %}
    {% if should_store_failures() and debugging_columns is not none %}
      {{ debugging_columns.join(", ") }}
    {% else %}
      * {# Should we default to pulling everything out where that access isn't billed? If not, then we don't need a second BQ-specific version of this #}
    {% endif %}
{% endmacro %}

{% macro bigquery__get_debugging_columns(expression) %}
    {% if should_store_failures() %}
      {% if debugging_columns is not none %}
        {{ debugging_columns.join(", ") }}
      {% else %}
        *
      {% endif %}
    {% else %}
      {{ expression }} {#or maybe just `1` 🤷 #}
    {% endif %}
{% endmacro %}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants