Update insert job function to avoid joining on symlinks for jobs that have no symlinks #2144
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
Typical marquez installations don't have a large number of new jobs being created on a regular basis. However, in some small number of installations, there can be a large number of new jobs being created, which executes the
rewrite_jobs_fqn_table
function each time, putting stress on the backing database. Most of the compute cost of this function is in computing the symlinks and aliases for jobs - even when the inserted job has no symlink.Closes: #ISSUE-NUMBER
Solution
Adding a check for the symlink field and offering a lower cost query in cases when no symlink is present (the norm) radically reduces the database compute load in Marquez installations that frequently create a large number of new jobs.
The following graph shows query count and latency and database CPU utilization under a test load of many new jobs being created. The test load was several days of real production OpenLineage events being replayed on a dev instance. To verify results, I ran the same test twice for both the old query and the new. Under heavy load, the job creation query causes database CPU utilization to climb to 100% and query latency climbs to as high as 2 seconds. Under the same load (I renamed all of the jobs in the database, so the same load shows up as new jobs that invoke the job creation query), the new query drives CPU utilization to around 30% and query latency is around 300 microseconds.
Note that the query latency in this graph is shown at log scale (right axis). Otherwise, the latency for the new query would be indistinguishable from 0.
Checklist
CHANGELOG.md
with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary).sql
database schema migration according to Flyway's naming convention (if relevant)