Fix recursive views perf #2043

collado-mike · 2022-07-20T21:49:32Z

Problem

In #1928, we introduced a jobs_view, which used recursive queries to construct a job's fully qualified name by collecting the job's ancestry and joining the names. Additionally, #1946 introduced job symlinks, which also used recursive queries to find the last link in a chain of symlinks and always return the ultimate target job. Unfortunately, while initial testing showed those queries performed reasonably well on read requests, we didn't sufficiently load test the OpenLineage write API. Unfortunately, under sustained load in a real production environment, the write API (which repeatedly queries the jobs_view either directly or via the runs_view) puts considerable strain on our Postgres database.

Solution

One of the goals with using a view was to consistently construct job FQNs even when ancestors are renamed (using the symlink feature). This allows us to avoid big migration tasks updating the runs table with new job names when, e.g., an Airflow DAG or a Spark application is renamed, thus renaming all of its child tasks. However, constructing the FQN on every read is too computationally expensive, so this PR changes the view to compute the FQN on write. This continues to use the view, but now there is a TRIGGER on INSERT to compute the FQN for a job, as well as to traverse the symlink chain and write the new FQN and symlink target to a new table called jobs_fqn. This table also collects the past aliases of a job, so we can serve requests that still refer to the old job name, as described in the original symlink issue.

I utilized writes to the jobs_view in order to capture the write by the trigger, so the code now references the view instead of the original jobs table. This also allowed me to return the full row from the insert query rather than just the job UUID, as it previously did. I also simplified the uniqueness constraint by adding a parent_job_id_string field, which just converts the parent job id to a string or an empty string if null (we couldn't use a null UUID to enforce uniqueness for jobs that have no parents because NULL != NULL in Postgres).

Given the jobs_view and associated write function are likely to change in the future, I made their definitions repeatable migrations so that future changes can be captured in version control (I went ahead and did the same for the runs_view for the same reason, though unrelated to the purpose of this PR).

I also included an additional test for the DAG renaming use case.

The following snapshot shows performance in a load test both before and after the proposed code change. Both tests sustained a rate of 900 requests per minute to the OpenLineage write API. I used the same RDS db.m5.large instance for both tests (cleared the DB structure between tests).

(note that #2041 is included in the test deployment)

With the current code at HEAD, the DB's CPU utilization climbs to 95%+ within about an hour, then hovers just below 100%. The request graph shows the request rate dropping after that because pods are consistently being restarted due to failed health checks (can't access the database), so eventually only about 50% of requests are being handled.

Conversely, the DB CPU utilization never rises about ~50% with the new code changes. The request rate stays pretty steady throughout the test (there are still some OOM restarts, but that's consistent with what we see in production with an older version of Marquez).

Note: All database schema changes require discussion. Please link the issue for context.

Checklist

You've signed-off your work
Your changes are accompanied by tests (if relevant)
Your change contains a small diff and is self-contained
You've updated any relevant documentation (if relevant)
You've updated the CHANGELOG.md with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary)
You've versioned your .sql database schema migration according to Flyway's naming convention (if relevant)
You've included a header in any source code files (if relevant)

…possible Signed-off-by: Michael Collado <[email protected]>

Signed-off-by: Michael Collado <[email protected]>

…changes can be easily compared in version control Signed-off-by: Michael Collado <[email protected]>

codecov · 2022-07-20T21:52:02Z

Codecov Report

Merging #2043 (35e1a14) into main (ee44ae0) will decrease coverage by 0.02%.
The diff coverage is n/a.

❗ Current head 35e1a14 differs from pull request most recent head 6435a0e. Consider uploading reports for the commit 6435a0e to get more accurate results

@@             Coverage Diff              @@
##               main    #2043      +/-   ##
============================================
- Coverage     78.81%   78.79%   -0.03%     
+ Complexity     1013     1011       -2     
============================================
  Files           200      200              
  Lines          5579     5573       -6     
  Branches        422      422              
============================================
- Hits           4397     4391       -6     
  Misses          730      730              
  Partials        452      452

Impacted Files	Coverage Δ
api/src/main/java/marquez/db/JobDao.java	`100.00% <ø> (ø)`
api/src/main/java/marquez/db/RunDao.java	`92.50% <ø> (ø)`

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

wslulciuc

I left a few minors comments, but overall the changes seem reasonable (I double checked the SQL 😉). Thanks for the detailed writeup (as usual), @collado-mike! Also, since the DB write performance issues were observed only after deploying to production, having a not too exhaustive load test in CI with a report can help spot any DB bottlenecks sooner. I've opened #2047

api/src/main/resources/marquez/db/migration/V45__update_jobs_view_rule.sql

api/src/main/resources/marquez/db/migration/R__1_jobs_view_and_rewrite_function.sql

api/src/main/resources/marquez/db/migration/R__2_runs_view.sql

Signed-off-by: Michael Collado <[email protected]>

collado-mike added 5 commits July 20, 2022 12:27

Changed RunDao to use simple RunRow rather than ExtendedRunRow where …

7e0f47a

…possible Signed-off-by: Michael Collado <[email protected]>

Remove need to query JobRow on run completion

5c88d5b

Signed-off-by: Michael Collado <[email protected]>

Refactor jobs_view to use job_fqn table

cf5872b

Signed-off-by: Michael Collado <[email protected]>

Update changelog

5fc8be6

Signed-off-by: Michael Collado <[email protected]>

Move jobs_view and runs_view to repeatable migrations so that future …

d0738d8

…changes can be easily compared in version control Signed-off-by: Michael Collado <[email protected]>

collado-mike requested a review from wslulciuc July 20, 2022 21:49

Base automatically changed from runs_row_reduction to main July 26, 2022 20:31

wslulciuc mentioned this pull request Jul 26, 2022

Add API load test in CI #2047

Closed

wslulciuc approved these changes Jul 26, 2022

View reviewed changes

collado-mike and others added 2 commits July 26, 2022 16:16

Merge branch 'main' into fix_recursive_views_perf

35e1a14

Address comments for column names and migration files

6435a0e

Signed-off-by: Michael Collado <[email protected]>

collado-mike enabled auto-merge (squash) July 27, 2022 18:33

collado-mike merged commit 2c21ab0 into main Jul 27, 2022

collado-mike deleted the fix_recursive_views_perf branch July 27, 2022 18:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix recursive views perf #2043

Fix recursive views perf #2043

collado-mike commented Jul 20, 2022

codecov bot commented Jul 20, 2022 •

edited

Loading

wslulciuc left a comment •

edited

Loading

Fix recursive views perf #2043

Fix recursive views perf #2043

Conversation

collado-mike commented Jul 20, 2022

Problem

Solution

Checklist

codecov bot commented Jul 20, 2022 • edited Loading

Codecov Report

wslulciuc left a comment • edited Loading

Choose a reason for hiding this comment

codecov bot commented Jul 20, 2022 •

edited

Loading

wslulciuc left a comment •

edited

Loading