Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make node selection O(n) #1615

Merged
merged 4 commits into from
Jul 17, 2019
Merged

Make node selection O(n) #1615

merged 4 commits into from
Jul 17, 2019

Conversation

beckjake
Copy link
Contributor

Fixes #1611

The culprit was get_ancestor_ephemeral_nodes, which looked at the set of all selected models and then asked for all ancestors of each... if you select all your models, that means for each model we will go through all models. Oops!

This fix is kind of a hack, but it just adds all ephemeral nodes to the run-list. Since ephemeral nodes run really fast (because they don't run), that seems fine. We will have to compile each ephemeral node regardless of your selection, but that is still O(n) 🙂

When we implement partial parsing/compilation it's possible this will become a pain point again, but only if you have an enormous number of ephemeral models and are selecting a tiny subset of your graph. I'll be quite happy to revisit the problem then, as we definitely could choose to memoize node selection (it would make the case of many overlapping +my_model calls faster, too).

I also removed the manifest.to_flat_graph() call that happens in context creation, which is another O(n^2) thing. This is kind of a breaking change but I'm not sorry. We never documented the graph entry in the environment, so people shouldn't have been using it anyway.

Old:

$ time dbt compile
Running with dbt=0.14.0
WARNING: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.
There are 1 unused configuration paths:
- models.my_new_package.example

Found 1000 models, 3000 tests, 0 snapshots, 0 analyses, 116 macros, 0 operations, 0 seed files, 0 sources

12:42:50 | Concurrency: 2 threads (target='default')
12:42:50 |
12:45:37 | Done.
dbt compile  269.66s user 4.44s system 100% cpu 4:33.21 total

New:

$ time dbt compile
Running with dbt=0.14.0
WARNING: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.
There are 1 unused configuration paths:
- models.my_new_package.example

Found 1000 models, 3000 tests, 0 snapshots, 0 analyses, 116 macros, 0 operations, 0 seed files, 0 sources

12:39:51 | Concurrency: 2 threads (target='default')
12:39:51 |
12:40:49 | Done.
dbt compile  95.04s user 2.90s system 100% cpu 1:37.79 total

Copy link
Member

@cmcarthur cmcarthur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good to me. @drewbanin you should have a look too esp the part about removing graph

Copy link
Contributor

@drewbanin drewbanin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment / question, but overall looks very good to me! Tried as hard as I could to break this and couldn't do it :)

core/dbt/context/common.py Show resolved Hide resolved
@drewbanin
Copy link
Contributor

One tiny other thing to note here: since we're always selecting all ephemeral models, dbt won't show a warning if a provided selector doesn't match any models (but > 0 ephemeral models exist in the project):

WARNING: Nothing to do. Try checking your model configs and model specification args

Can we just change the check in runnable.py to warn if there are no non-ephemeral nodes selected?

@beckjake
Copy link
Contributor Author

beckjake commented Jul 17, 2019

@drewbanin I added an early bail instead: if you have selected no nodes, the node selector won't add in the ephemeral nodes. That was a lot easier, hope it's ok!

@beckjake
Copy link
Contributor Author

The windows test failure you'll see in a bit is redshift's concurrent transaction nonsense, I really don't want to re-run those tests.

Copy link
Contributor

@drewbanin drewbanin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Ship it

addins = self.get_ancestor_ephemeral_nodes(selected)
# if you haven't selected any nodes, return that so we can give the
# nice "no models selected" message.
if not selected:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is great

@beckjake beckjake merged commit e46800f into dev/0.14.1 Jul 17, 2019
@beckjake beckjake deleted the fix/linear-time-selection branch July 17, 2019 20:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Node selection is quadratic in running time
3 participants