Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large projects see significant slowdown in resolve_graph using v0.21.0 #4012

Closed
4 tasks
isaacsantelli opened this issue Oct 6, 2021 · 10 comments · Fixed by #4022 or #4898
Closed
4 tasks

Large projects see significant slowdown in resolve_graph using v0.21.0 #4012

isaacsantelli opened this issue Oct 6, 2021 · 10 comments · Fixed by #4022 or #4898
Assignees
Labels
bug Something isn't working performance regression
Milestone

Comments

@isaacsantelli
Copy link

Describe the bug

Since upgrading to 0.21.0 I have seen a significant slowdown in the speed with which DBT runs this includes both dbt compile and dbt run. I am running on a redshift data warehouse, though I don't know if the issue is redshift specific.

Steps To Reproduce

Simply run dbt run or dbt compile with partial parse enabled or disabled the issue occurs either way.

Expected behavior

I saw a 3-5x slowdown so i expected it to run significantly faster

Screenshots and log output

image

image

System information

Which database are you using dbt with?

  • postgres
  • [√] redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

installed version: 0.21.0
   latest version: 0.21.0

Up to date!

Plugins:
  - bigquery: 0.21.0
  - snowflake: 0.21.0
  - redshift: 0.21.0
  - postgres: 0.21.0

The operating system you're using:
Mac OS Big Sur 11.3.1
The output of python --version:
Python 3.7.5

Additional context

Add any other context about the problem here.

@isaacsantelli isaacsantelli added bug Something isn't working triage labels Oct 6, 2021
@jtcohen6
Copy link
Contributor

jtcohen6 commented Oct 6, 2021

@isaacsantelli Weird! Thanks for the bug report.

It looks like, once dbt starts running nodes (Concurrency: stdout log line), everything is proceeding quite quickly. So it's got to be something with:

  • project parsing (though I see you've got partial parsing enabled)
  • metadata queries to build the adapter cache

Could you check the debug-level logs (logs/dbt.log), and see where the greatest gaps in timestamps are? Based on the messages that dbt is logging along the way, we should be able to figure out which steps have caused the significant slowdown.

@jtcohen6 jtcohen6 added redshift and removed triage labels Oct 6, 2021
@isaacsantelli
Copy link
Author

isaacsantelli commented Oct 6, 2021

Attached is the log from today : dbt.log

@jtcohen6
Copy link
Contributor

jtcohen6 commented Oct 6, 2021

@isaac-taylor Thanks for the quick response. It looks like all metadata queries are running quite quickly. The weird thing I'm noticing is significant slowdown associated with firing some anonymous usage events:

2021-10-06 15:00:22.947280 (MainThread): Running with dbt=0.21.0
2021-10-06 15:00:23.204965 (MainThread): running dbt with arguments Namespace(cls=<class 'dbt.task.compile.CompileTask'>, debug=False, defer=None, exclude=None, full_refresh=False, log_cache_events=False, log_format='default', parse_only=False, partial_parse=None, profile=None, profiles_dir='/Users/isaacsantelli/.dbt', project_dir=None, record_timing_info=None, rpc_method='compile', select=None, selector_name=None, single_threaded=False, state=None, strict=False, target=None, test_new_parser=False, threads=None, use_cache=True, use_colors=None, use_experimental_parser=False, vars='{}', version_check=True, warn_error=False, which='compile', write_json=True)
2021-10-06 15:00:23.205623 (MainThread): Tracking: tracking
2021-10-06 15:00:23.224004 (MainThread): Sending event: {'category': 'dbt', 'action': 'invocation', 'label': 'start', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x106899150>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x106f65490>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x106f653d0>]}
--- NO DELAY HERE ---
2021-10-06 15:00:24.006624 (MainThread): Partial parsing enabled: 0 files deleted, 0 files added, 0 files changed.
2021-10-06 15:00:24.007105 (MainThread): Partial parsing enabled, no changes found, skipping parsing
2021-10-06 15:00:24.173568 (MainThread): Sending event: {'category': 'dbt', 'action': 'load_project', 'label': '9d8ddbe5-286c-4be5-898e-4b23a11ac320', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x10835ed10>]}
--- 60 SECOND DELAY ---
2021-10-06 15:01:35.086751 (MainThread): Sending event: {'category': 'dbt', 'action': 'resource_counts', 'label': '9d8ddbe5-286c-4be5-898e-4b23a11ac320', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x1086192d0>]}
2021-10-06 15:01:35.087205 (MainThread): Found 313 models, 2798 tests, 5 snapshots, 0 analyses, 821 macros, 1 operation, 33 seed files, 236 sources, 0 exposures
--- 40 SECOND DELAY ---
2021-10-06 15:02:13.265683 (MainThread): 
2021-10-06 15:02:13.267495 (MainThread): Acquiring new redshift connection "master".

I don't know the details of your development/deployment environment, and whether those requests might be passing through some sort of proxy, or if the issue is with event collection on our end. Could you try disabling anonymous usage tracking, and see if that speeds up your invocations?

@ilmari-aalto
Copy link

I'm experiencing the same issue and disabling user tracking doesn't help.

Here's the log entry before disabling usage tracking:

2021-10-06 16:38:21.524362 (MainThread): Running with dbt=0.21.0
2021-10-06 16:38:21.828803 (MainThread): running dbt with arguments Namespace(cls=<class 'dbt.task.compile.CompileTask'>, debug=False, defer=None, exclude=None, full_refresh=False, log_cache_events=False, log_format='default', parse_only=False, partial_parse=True, profile=None, profiles_dir='/Users/ilmari.aalto/.dbt', project_dir=None, record_timing_info=None, rpc_method='compile', select=None, selector_name=None, single_threaded=False, state=None, strict=False, target=None, test_new_parser=False, threads=None, use_cache=True, use_colors=None, use_experimental_parser=False, vars='{}', version_check=True, warn_error=False, which='compile', write_json=True)
2021-10-06 16:38:21.829559 (MainThread): Tracking: tracking
2021-10-06 16:38:21.841164 (MainThread): Sending event: {'category': 'dbt', 'action': 'invocation', 'label': 'start', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x10ae64a60>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x10ae86610>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x10ae90130>]}
2021-10-06 16:38:22.400905 (MainThread): Partial parsing enabled: 0 files deleted, 0 files added, 0 files changed.
2021-10-06 16:38:22.401256 (MainThread): Partial parsing enabled, no changes found, skipping parsing
2021-10-06 16:38:22.588468 (MainThread): Sending event: {'category': 'dbt', 'action': 'load_project', 'label': '5043baeb-8ae1-40c4-aabb-1a273ced6e46', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x10cfef100>]}
--- 40 SECOND DELAY ---
2021-10-06 16:39:01.165051 (MainThread): write_gpickle is deprecated and will be removed in 3.0.Use ``pickle.dump(G, path, protocol)``
2021-10-06 16:39:01.225095 (MainThread): Sending event: {'category': 'dbt', 'action': 'resource_counts', 'label': '5043baeb-8ae1-40c4-aabb-1a273ced6e46', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x10cfefc70>]}
2021-10-06 16:39:01.225543 (MainThread): Found 335 models, 2713 tests, 3 snapshots, 5 analyses, 397 macros, 3 operations, 10 seed files, 135 sources, 0 exposures
--- 20 SECOND DELAY ---
2021-10-06 16:39:19.281222 (MainThread): 
2021-10-06 16:39:19.282803 (MainThread): Acquiring new redshift connection "master".

Here's after:

2021-10-06 16:45:46.255823 (MainThread): Running with dbt=0.21.0
2021-10-06 16:45:46.872313 (MainThread): running dbt with arguments Namespace(cls=<class 'dbt.task.compile.CompileTask'>, debug=False, defer=None, exclude=None, full_refresh=False, log_cache_events=False, log_format='default', parse_only=False, partial_parse=True, profile=None, profiles_dir='/Users/ilmari.aalto/.dbt', project_dir=None, record_timing_info=None, rpc_method='compile', select=None, selector_name=None, single_threaded=False, state=None, strict=False, target=None, test_new_parser=False, threads=None, use_cache=True, use_colors=None, use_experimental_parser=False, vars='{}', version_check=True, warn_error=False, which='compile', write_json=True)
2021-10-06 16:45:46.872989 (MainThread): Tracking: do not track
2021-10-06 16:45:47.574988 (MainThread): Partial parsing enabled: 0 files deleted, 0 files added, 0 files changed.
2021-10-06 16:45:47.575376 (MainThread): Partial parsing enabled, no changes found, skipping parsing
--- 40 SECOND DELAY ---
2021-10-06 16:46:25.705940 (MainThread): write_gpickle is deprecated and will be removed in 3.0.Use ``pickle.dump(G, path, protocol)``
2021-10-06 16:46:25.765966 (MainThread): Found 335 models, 2713 tests, 3 snapshots, 5 analyses, 397 macros, 3 operations, 10 seed files, 135 sources, 0 exposures
--- 20 SECOND DELAY ---
2021-10-06 16:46:43.499142 (MainThread): 
2021-10-06 16:46:43.500753 (MainThread): Acquiring new redshift connection "master".

The output of dbt --version:

installed version: 0.21.0
   latest version: 0.21.0

Up to date!

Plugins:
  - bigquery: 0.21.0
  - snowflake: 0.21.0
  - redshift: 0.21.0
  - postgres: 0.21.0

The operating system you're using:
Mac OS Big Sur 11.4

The output of python3 --version:
Python 3.8.12

@isaacsantelli
Copy link
Author

I set send_anonymous_usage_stats to False and the issue is persisting attached is an updated copy of my dbt log
dbt.log

@isaacsantelli
Copy link
Author

Logs show a gap of about 2 minutes with nothing in between

2021-10-06 17:10:36.620771 (MainThread): Tracking: do not track
2021-10-06 17:10:37.441933 (MainThread): Partial parsing enabled: 0 files deleted, 0 files added, 0 files changed.
2021-10-06 17:10:37.442459 (MainThread): Partial parsing enabled, no changes found, skipping parsing
2021-10-06 17:12:06.817106 (MainThread): Found 313 models, 2798 tests, 5 snapshots, 0 analyses, 821 macros, 1 operation, 33 seed files, 236 sources, 0 exposures

@isaacsantelli
Copy link
Author

Also if it would be helpful info we have a DC2 redshift cluster

@jtcohen6
Copy link
Contributor

jtcohen6 commented Oct 7, 2021

@ilmari-aalto @isaacsantelli Thanks for that additional info. I've been able to reproduce the issue by compiling a project with a lot of tests:

2021-10-07 10:57:02.979374 (MainThread): Running with dbt=0.21.0
2021-10-07 10:57:03.073803 (MainThread): running dbt with arguments Namespace(cls=<class 'dbt.task.compile.CompileTask'>, debug=True, defer=None, exclude=None, full_refresh=False, log_cache_events=False, log_format='default', parse_only=False, partial_parse=None, profile='garage-postgres', profiles_dir='/Users/jerco/.dbt', project_dir=None, record_timing_info=None, rpc_method='compile', select=None, selector_name=None, single_threaded=False, state=None, strict=False, target=None, test_new_parser=False, threads=None, use_cache=True, use_colors=None, use_experimental_parser=False, vars='{}', version_check=True, warn_error=False, which='compile', write_json=True)
2021-10-07 10:57:03.074672 (MainThread): Tracking: tracking
2021-10-07 10:57:03.098361 (MainThread): Sending event: {'category': 'dbt', 'action': 'invocation', 'label': 'start', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x110b2b670>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x111051550>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x111051760>]}
2021-10-07 10:57:04.412107 (MainThread): Partial parsing enabled: 0 files deleted, 0 files added, 0 files changed.
2021-10-07 10:57:04.412464 (MainThread): Partial parsing enabled, no changes found, skipping parsing
2021-10-07 10:57:04.699027 (MainThread): Sending event: {'category': 'dbt', 'action': 'load_project', 'label': 'ef646720-f445-4177-afd1-9e835e0e389e', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x1102d7dc0>]}
2021-10-07 10:58:59.490064 (MainThread): write_gpickle is deprecated and will be removed in 3.0.Use ``pickle.dump(G, path, protocol)``
2021-10-07 10:58:59.579928 (MainThread): Sending event: {'category': 'dbt', 'action': 'resource_counts', 'label': 'ef646720-f445-4177-afd1-9e835e0e389e', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x11100b280>]}
2021-10-07 10:58:59.580330 (MainThread): Found 2000 models, 6000 tests, 0 snapshots, 0 analyses, 162 macros, 0 operations, 0 seed files, 0 sources, 0 exposures
2021-10-07 10:59:06.360928 (MainThread): 
2021-10-07 10:59:06.366090 (MainThread): Acquiring new postgres connection "master".

Based on the logging, I believe the primary delay is cropping up after parsing (manifest construction), during graph construction. (I'm not sure about the second delay, between Found ... and the first metadata query.)

@leahwicz I'm pretty sure this is related to the new dependency linking, related to the test blocking feature for dbt build. It looks like we acknowledged that performance might be an issue several weeks ago: #3792 (comment).

To confirm this, I added timestamped lines to the beginning + end of resolve_graph:

Running with dbt=0.21.0
Partial parsing enabled: 0 files deleted, 0 files added, 0 files changed.
Partial parsing enabled, no changes found, skipping parsing
13:05:58 | Start graph resolution
13:07:48 | End graph resolution
Found 2000 models, 6000 tests, 0 snapshots, 0 analyses, 162 macros, 0 operations, 0 seed files, 0 sources, 0 exposures

This issue isn't unique to Redshift, and it's especially painful because partial parsing can't help here—users are going to feel this pain during every invocation of a runnable task.

I think our options are:

  • (long-term ideal) Refactor the approach to graph linking with performance front of mind
  • (workable, temporary) Turn off the new dependency linking for non-build tasks, to undo the regression in performance

@jtcohen6 jtcohen6 changed the title Significant Slowdown in dbt 0.21.0 on redshift Large projects see significant slowdown in resolve_graph using v0.21.0 Oct 7, 2021
@ilmari-aalto
Copy link

Thanks for looking at it quickly @jtcohen6 ! I wonder what's behind the second delay, but perhaps it always was there I just didn't pay attention to it 😛 . To be clear, the slowness doesn't affect our Production runs as the delay there is not relevant, but slows down interactive, local development.

I always thought our project was small to mid-size, but now I'm seeing it can be considered large 😄

@joshuataylor
Copy link
Contributor

Our project is fair sized:
Found 218 models, 1409 tests, 0 snapshots, 0 analyses, 405 macros, 0 operations, 5 seed files, 79 sources, 0 exposures

0.21.0 is very slow to start after that line, even with partial parsing as well. Snowflake here.

gshank added a commit that referenced this issue Oct 14, 2021
…_graph (#4057)

* [#4012] Performance: Use child_map to find tests for nodes in resolve_graph
iknox-fa pushed a commit that referenced this issue Feb 8, 2022
…_graph (#4022)

automatic commit by git-black, original commits:
  fd7c95d
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working performance regression
Projects
None yet
5 participants