Optimize ORM usage, db_session instantiation, and tuning #3365
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This change introduces performance improvements for deduplicated Signal instances that are ingested. By optimizing the ORM usage across the flow (primarily by removing loads of unnecessary relationships, columns, and rows) this path is about 4x faster. To process 500 deduplicated instances, it takes roughly 10 seconds.
The
create_signal_messages
had the most room for improvement, because it unnecessarily was loading all signal_instances and their associated raw data. It previously took about 3 seconds (and exponentially more as more and more signals are ingested). It is about 5-6x faster.Before:
DEBUG:function.elapsed.time.dispatch.plugins.dispatch_slack.case.messages.create_signal_messages: 3.1490371670006425
After:
DEBUG:function.elapsed.time.dispatch.plugins.dispatch_slack.case.messages.create_signal_messages: 0.056947083001432475:/Users/wshel/Projects/dispatch/src/dispatch/decorators.py:wrapper:185
Creating a new case from scratch (non-deduped) now takes about 6 seconds. It previously took roughly 30 seconds.
DEBUG:function.elapsed.time.dispatch.signal.scheduled.process_signal_instance: 29.435003541992046
This is primarily due to the removal of external resources such as Google docs, drive, etc.
The default dedupe filter performance is improved by fetching one row to determine if we dedupe instead of all of them in the time frame and only fetching the necessary column (case_id).
Testing