Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix recursive views perf #2043

Merged
merged 7 commits into from
Jul 27, 2022
Merged

Fix recursive views perf #2043

merged 7 commits into from
Jul 27, 2022

Conversation

collado-mike
Copy link
Collaborator

Problem

In #1928, we introduced a jobs_view, which used recursive queries to construct a job's fully qualified name by collecting the job's ancestry and joining the names. Additionally, #1946 introduced job symlinks, which also used recursive queries to find the last link in a chain of symlinks and always return the ultimate target job. Unfortunately, while initial testing showed those queries performed reasonably well on read requests, we didn't sufficiently load test the OpenLineage write API. Unfortunately, under sustained load in a real production environment, the write API (which repeatedly queries the jobs_view either directly or via the runs_view) puts considerable strain on our Postgres database.

Solution

One of the goals with using a view was to consistently construct job FQNs even when ancestors are renamed (using the symlink feature). This allows us to avoid big migration tasks updating the runs table with new job names when, e.g., an Airflow DAG or a Spark application is renamed, thus renaming all of its child tasks. However, constructing the FQN on every read is too computationally expensive, so this PR changes the view to compute the FQN on write. This continues to use the view, but now there is a TRIGGER on INSERT to compute the FQN for a job, as well as to traverse the symlink chain and write the new FQN and symlink target to a new table called jobs_fqn. This table also collects the past aliases of a job, so we can serve requests that still refer to the old job name, as described in the original symlink issue.

I utilized writes to the jobs_view in order to capture the write by the trigger, so the code now references the view instead of the original jobs table. This also allowed me to return the full row from the insert query rather than just the job UUID, as it previously did. I also simplified the uniqueness constraint by adding a parent_job_id_string field, which just converts the parent job id to a string or an empty string if null (we couldn't use a null UUID to enforce uniqueness for jobs that have no parents because NULL != NULL in Postgres).

Given the jobs_view and associated write function are likely to change in the future, I made their definitions repeatable migrations so that future changes can be captured in version control (I went ahead and did the same for the runs_view for the same reason, though unrelated to the purpose of this PR).

I also included an additional test for the DAG renaming use case.

The following snapshot shows performance in a load test both before and after the proposed code change. Both tests sustained a rate of 900 requests per minute to the OpenLineage write API. I used the same RDS db.m5.large instance for both tests (cleared the DB structure between tests).

(note that #2041 is included in the test deployment)

Screen Shot 2022-07-20 at 12 08 43 PM

With the current code at HEAD, the DB's CPU utilization climbs to 95%+ within about an hour, then hovers just below 100%. The request graph shows the request rate dropping after that because pods are consistently being restarted due to failed health checks (can't access the database), so eventually only about 50% of requests are being handled.

Conversely, the DB CPU utilization never rises about ~50% with the new code changes. The request rate stays pretty steady throughout the test (there are still some OOM restarts, but that's consistent with what we see in production with an older version of Marquez).

Note: All database schema changes require discussion. Please link the issue for context.

Checklist

  • You've signed-off your work
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • You've updated the CHANGELOG.md with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary)
  • You've versioned your .sql database schema migration according to Flyway's naming convention (if relevant)
  • You've included a header in any source code files (if relevant)

Signed-off-by: Michael Collado <[email protected]>
…changes can be easily compared in version control

Signed-off-by: Michael Collado <[email protected]>
@codecov
Copy link

codecov bot commented Jul 20, 2022

Codecov Report

Merging #2043 (35e1a14) into main (ee44ae0) will decrease coverage by 0.02%.
The diff coverage is n/a.

❗ Current head 35e1a14 differs from pull request most recent head 6435a0e. Consider uploading reports for the commit 6435a0e to get more accurate results

@@             Coverage Diff              @@
##               main    #2043      +/-   ##
============================================
- Coverage     78.81%   78.79%   -0.03%     
+ Complexity     1013     1011       -2     
============================================
  Files           200      200              
  Lines          5579     5573       -6     
  Branches        422      422              
============================================
- Hits           4397     4391       -6     
  Misses          730      730              
  Partials        452      452              
Impacted Files Coverage Δ
api/src/main/java/marquez/db/JobDao.java 100.00% <ø> (ø)
api/src/main/java/marquez/db/RunDao.java 92.50% <ø> (ø)

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

Base automatically changed from runs_row_reduction to main July 26, 2022 20:31
Copy link
Member

@wslulciuc wslulciuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few minors comments, but overall the changes seem reasonable (I double checked the SQL 😉). Thanks for the detailed writeup (as usual), @collado-mike! Also, since the DB write performance issues were observed only after deploying to production, having a not too exhaustive load test in CI with a report can help spot any DB bottlenecks sooner. I've opened #2047

@collado-mike collado-mike enabled auto-merge (squash) July 27, 2022 18:33
@collado-mike collado-mike merged commit 2c21ab0 into main Jul 27, 2022
@collado-mike collado-mike deleted the fix_recursive_views_perf branch July 27, 2022 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants