Optimize BaseRelation.matches() #6844

peterallenwebb · 2023-02-02T15:36:16Z

resolves #6842

Description

This PR optimizes the BaseRelation.matches() function in order to avoid costly string processing and comparison operations which are being done a very large number of times during certain large project runs of dbt build. On the large scenario described in #6842, it took my local run time for dbt build from 23m to 14m.

As written, we would lose the ApproximateMatchError exception, since determining whether a relation is a an approximate match was a large part of the time spent. We'll need to determine whether that is justified by the savings, or if there is a better way to avoid performing the check a large number of times.

At any rate, there is a lot of room for improvement in this bottleneck.

Checklist

I have read the contributing guide and understand what's expected of me
I have signed the CLA
I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
I have opened an issue to add/update docs, or docs changes are not required/relevant for this PR
I have run changie new to create a changelog entry

github-actions · 2023-02-02T15:36:41Z

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

peterallenwebb · 2023-02-02T15:50:49Z

@boxysean If you have a chance to kick the tires on this change, please do and let me know how it works for you. If it looks good to you, I'll discuss it with other stakeholders.

jtcohen6 · 2023-02-03T11:20:10Z

@peterallenwebb This is awesome!!

Perf bottlenecks in dbt

First, I want to offer some higher-level context for this change:

There are some "fixed costs" at the start of every run: parsing (files → manifest), graph generation (manifest → graph), cache population (run queries to build up relational cache on adapter).
All of those have to happen on the main thread, before we can do anything else, and they scale with the size of the overall project, rather than the number of nodes actually selected to run. That's particularly frustrating for developers in large projects who may just want to be running dbt build --select one_model. (There is an experimental config to modify cache population based on node selection.)
Once all those fixed costs are paid, we move into a queue. Now, each node can run, in parallel, on an independent thread—up to the number of --threads, depending on the shape of the DAG (interdependencies), and only the subset of nodes actually selected to run.
So: I would expect BaseRelation.get_relation() to run within those threads, and only for nodes that are actually selected to run. This does add up to a lot of time if running thousands of models, but I'd expect it to scale in proportion with that number of models, and to be parallelized across multiple threads. (Exception: In the case of deferral, we run adapter.get_relation as a fixed-cost step earlier on, to determine which upstream models do/don't exist in the target namespace.)
Finally, once we actually get into running models against the warehouse, dbt doesn't tend to be the performance bottleneck anymore. We're "I/O bound" by the time it takes to actually execute each query on the warehouse, which can be anywhere from <1s (create view) to many minutes (create table, merge). At this point, when users are thinking about performance, they're thinking about database performance, and dbt's role is mostly around making sure it's templating the right DDL/DML, with just the right keywords or magic incantations that make the data platform purr.
Of course, dbt can also try to avoid running duplicative queries. That's the motivation behind having the relational cache to begin with, populated at the beginning & accessed/update as we go along. That cache lookups should be much faster (overall) than running the same metadata queries, over and over, against the warehouse, while materializing every single model ("does this relation already exist?" "is it a table?").

This PR: proposed trade-off

As written, we would lose the ApproximateMatchError exception, since determining whether a relation is a an approximate match was a large part of the time spent. We'll need to determine whether that is justified by the savings, or if there is a better way to avoid performing the check a large number of times.

I could be open to making this change. The ApproximateMatchError is not particularly delightful for end users to see today. It is our attempt to surface an explicit error when dbt has missed matching up a user's defined relation to one in the relation cache, because of a very subtle discrepancy in casing/quoting. The alternative, unfortunately, is even more-confusing behavior, whatever is the fallout of missing the match. As it is, though, it's already pretty hard to debug, and the UX loss of removing this extra detection does feel outweighed by the UX improvement of a significant speedup.

A little history:

The original logic here goes all the way back to 2018: Implement relations api #727
We modified this logic ~15 months ago (Adjust logic when finding approx matches for model or test matching #4076) as the resolution to an issue that was quite tricky to debug (Cache miss when alias includes explicit quotes, instead of configuring quoting: {identifier: true} #3835)

A concrete example

Here's a simple case for reproducing when this exception would be helpful. On Snowflake, the default case is uppercase (both ANSI-compliant and annoying), and quoted identifiers are case-sensitive. This was the bane of my existence in 2018; since then, we've disable quoting for relation identifiers by default, and it's much more pleasant.

So if I create a model like:

-- models/my_model.sql
select 1 as id

I dbt run, it templates out a SQL statement like

create or replace view analytics.dbt_jcohen.my_model as (select 1 as id);

In Snowflake, this relation has a much more boisterous name: ANALYTICS.DBT_JCOHEN.MY_MODEL. It's unquoted, ergo case-insensitive, ergo uppercase. What happens, though, if I turn quoting on, for all dbt-created relations in my project?

# dbt_project.yml
quoting:
  identifier: true

Now, dbt is going to try to template a SQL statement like:

create or replace view analytics.dbt_jcohen."my_model" as (select 1 as id);

But we don't even get there, because first dbt populates the adapter cache, then it tries to match up my model with an entry in the cache, and it sees there's there's an almost but not quite matching entry. And we stop the whole thing in its tracks, because we want to avoid an ugly scenario.

$ dbt run
...
11:12:25  Compilation Error in model my_model (models/my_model.sql)
11:12:25    When searching for a relation, dbt found an approximate match. Instead of guessing
11:12:25    which relation to use, dbt will move on. Please delete "ANALYTICS"."DBT_JCOHEN"."MY_MODEL", or rename it to be less ambiguous.
11:12:25    Searched for: ANALYTICS.DBT_JCOHEN.my_model
11:12:25    Found: "ANALYTICS"."DBT_JCOHEN"."MY_MODEL"
11:12:25
11:12:25    > in macro create_or_replace_view (macros/materializations/models/view/create_or_replace_view.sql)
11:12:25    > called by macro materialization_view_snowflake (macros/materializations/view.sql)
11:12:25    > called by model my_model (models/my_model.sql)
...

If I try the same, having checked out your branch, I don't get that exception—the model succeeds!—because we didn't get a match, and we didn't check for an approximate match either. dbt successfully created a view named analytics.dbt_jcohen."my_model". Of course, depending on which of these queries I run in Snowflake, I will actually be querying a different view:

select * from analytics.dbt_jcohen.my_model;
select * from analytics.dbt_jcohen."my_model";

It's a gross situation, no question. But, in keeping with what I said above, I'm not convinced that the ApproximateMatchError exception does a whole lot to make the situation less gross—it just shoves the grossness in the user's face, earlier and a bit more explicitly.

boxysean · 2023-02-03T12:31:34Z

Thanks for the explanation @jtcohen6!

I'd be curious to see some real-world results, but I won't have time in the next 1-2 weeks to review due to travel. A similar analysis to what I did here would help us determine the impact of @peterallenwebb's proposed change. I will ask my client to see if they could support.

I'd also be curious to see some unit tests on get_relation() 🙈

peterallenwebb · 2023-02-03T14:43:30Z

@jtcohen6 Yes, thanks very much for this clear explanation of where the practical performance concerns really are! I'll keep it in mind as I inevitably continue to tinker with performance. I definitely don't feel strongly about getting this change in if it is unlikely to make an impact under real world conditions.

jtcohen6 · 2023-02-03T17:40:28Z

@peterallenwebb If this proves to be a significant perf boost in an "in-the-wild" scenario, I'd be supportive of moving forward! Sounds like the next step here is testing with a real large project. We do have one of these for our own internal analytics :)

mikealfare · 2023-02-06T16:45:33Z

core/dbt/adapters/base/relation.py

-            target = self.create(database=database, schema=schema, identifier=identifier)
-            raise ApproximateMatchError(target, self)
+        if database is None and schema is None and identifier is None:
+            raise dbt.exceptions.DbtRuntimeError(


I would put this first since you don't need to run self._is_exactish_match(). I'm assuming that method is the expensive method. I'd also rephrase the if clause:

if not any(database, schema, identifier): raise dbt.exceptions.DbtRuntimeError(...)

mikealfare · 2023-02-06T17:01:14Z

core/dbt/adapters/base/relation.py

-                "Tried to match relation, but no search path was passed!"
-            )
+        if identifier is not None and not self._is_exactish_match(
+            ComponentName.Identifier, identifier


I like the way it was originally written. I think the performance pickup comes from two places:

not looking for the approximate match

not exiting the for loop once a match was found
I think it could look something like this:

if not any(identifier, schema, database): raise dbt.exceptions.DbtRuntimeError(...) search = filter_null_values( { ComponentName.Identifier: identifier, ComponentName.Schema: schema, ComponentName.Database: database } ) return any( ( self._is_exactish_match(existing_components, new_component) for existing_components, new_component in search.items() ) )

I'm pretty sure any() will lazily evaluate each element in the generator and then stop when it finds one, which is what you're trying to do.

peterallenwebb · 2023-02-13T18:59:58Z

@jtcohen6 @boxysean I'm probably going to close this particular PR for now, since there are risks to any effective improvement, and we have not demonstrated that it is needed in the field.

That said, I have some interesting parting observations...

Our current strategy for looking up relations in the cache is compute-intensive, since it has to account for all the ways a relation name might be quoted or cased. Our strategy also takes time linear in the number of tables in the cache. The entire list of tables in the cache is scanned for a match every time a relation is looked up. A lookup in a cache with 1000 entries will be ten times slower than one with 100 on average. With some development effort we could make this a constant-time lookup with much lower overhead.

It's not clear how important this bottleneck is in real-world production scenarios, but the anonymized client project which @boxysean provided me has the following runtimes for dbt build on my local machine:

Stock dbt: 23 minutes
With the optimization from this PR: 14 minutes
With a quick/dirty implementation of constant-time lookup: 7 minutes

As @jtcohen6 has pointed out to me, our multithreading model might blunt the impact of the compute savings in real-world scenarios. It's still interesting how much overhead is being spent on this operation, though.

jtcohen6 · 2023-02-13T19:58:53Z

It's still interesting how much overhead is being spent on this operation, though.

Agree that it's very interesting. If you think the right next step is to close this specific PR for now, given some unknowns in the risks & benefits, I won't argue. I do think we should keep #6842 open as a promising lead to revisit for perf improvements in the future.

github-actions · 2023-08-13T01:45:28Z

This PR has been marked as Stale because it has been open with no activity as of late. If you would like the PR to remain open, please comment on the PR or else it will be closed in 7 days.

github-actions · 2023-08-21T01:44:55Z

Although we are closing this PR as stale, it can still be reopened to continue development. Just add a comment to notify the maintainers.

ct-2018: optimize BaseRelaton.matches()

b399b37

peterallenwebb requested a review from a team as a code owner February 2, 2023 15:36

peterallenwebb requested a review from mikealfare February 2, 2023 15:36

cla-bot bot added the cla:yes label Feb 2, 2023

peterallenwebb changed the title ~~Optimize BaseRelaton.matches()~~ Optimize BaseRelation.matches() Feb 3, 2023

mikealfare reviewed Feb 6, 2023

View reviewed changes

jtcohen6 mentioned this pull request Apr 10, 2023

New command: dbt clone #7258

Closed

9 tasks

github-actions bot added the stale Issues that have gone stale label Aug 13, 2023

github-actions bot closed this Aug 21, 2023

jtcohen6 mentioned this pull request Aug 23, 2023

[CT-2723] [spike+] Maximally parallelize dbt clone operations, a different mechanism for processing a queue #7914

Closed

peterallenwebb deleted the paw/ct-2018-get-relation-perf branch May 16, 2024 14:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize BaseRelation.matches() #6844

Optimize BaseRelation.matches() #6844

peterallenwebb commented Feb 2, 2023

github-actions bot commented Feb 2, 2023

peterallenwebb commented Feb 2, 2023

jtcohen6 commented Feb 3, 2023 •

edited

Loading

boxysean commented Feb 3, 2023

peterallenwebb commented Feb 3, 2023

jtcohen6 commented Feb 3, 2023

mikealfare Feb 6, 2023

mikealfare Feb 6, 2023

peterallenwebb commented Feb 13, 2023 •

edited

Loading

jtcohen6 commented Feb 13, 2023

github-actions bot commented Aug 13, 2023

github-actions bot commented Aug 21, 2023

Optimize BaseRelation.matches() #6844

Optimize BaseRelation.matches() #6844

Conversation

peterallenwebb commented Feb 2, 2023

Description

Checklist

github-actions bot commented Feb 2, 2023

peterallenwebb commented Feb 2, 2023

jtcohen6 commented Feb 3, 2023 • edited Loading

Perf bottlenecks in dbt

This PR: proposed trade-off

A concrete example

boxysean commented Feb 3, 2023

peterallenwebb commented Feb 3, 2023

jtcohen6 commented Feb 3, 2023

mikealfare Feb 6, 2023

Choose a reason for hiding this comment

mikealfare Feb 6, 2023

Choose a reason for hiding this comment

peterallenwebb commented Feb 13, 2023 • edited Loading

jtcohen6 commented Feb 13, 2023

github-actions bot commented Aug 13, 2023

github-actions bot commented Aug 21, 2023

jtcohen6 commented Feb 3, 2023 •

edited

Loading

peterallenwebb commented Feb 13, 2023 •

edited

Loading