Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update SQL in backfill script for facet tables to improve performance #2461

Merged
merged 1 commit into from
Mar 30, 2023

Conversation

collado-mike
Copy link
Collaborator

Problem

The backfill script to populate the facet tables relies on some SQL that consistently times out on large Marquez installations. E.g., the initial lock query relies on ordering the lineage_events table, which takes more than 10 minutes (I don't know how long it actually takes because I stopped increasing the statement timeout at this point) just to determine that the installation is too large to apply the migration. Additionally, populating the facets takes more than 8 mins per batch, even when reducing the batch size from 10,000 to 1,000. In an installation of ~30M records, the migration at that rate will take multiple weeks to complete.

Solution

The SQL changes here introduce a temp table that tracks the runs that need to be migrated and use that table to handle sorting the runs by time. Using the run_uuid allows us to take advantage of the index on lineage_events so that fetching the events happens much faster. As there is typically two events per run, it effectively doubles the batch size. Using the default chunk size of 10,000, with an effective batch size of ~20,000, the query now completes in about 45 seconds.

Checklist

  • You've signed-off your work
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • You've included a one-line summary of your change for the CHANGELOG.md (Depending on the change, this may not be necessary).
  • You've versioned your .sql database schema migration according to Flyway's naming convention (if relevant)
  • You've included a header in any source code files (if relevant)

@boring-cyborg boring-cyborg bot added the api API layer changes label Mar 29, 2023
@collado-mike collado-mike marked this pull request as draft March 29, 2023 17:34
… on large installations

Signed-off-by: Michael Collado <[email protected]>
@collado-mike collado-mike force-pushed the fix/backfill_facet_table_sql_improvements branch from 514c2d5 to e29e062 Compare March 29, 2023 17:58
@collado-mike collado-mike marked this pull request as ready for review March 29, 2023 18:04
@codecov
Copy link

codecov bot commented Mar 29, 2023

Codecov Report

Merging #2461 (e29e062) into main (60581e3) will increase coverage by 0.71%.
The diff coverage is 89.47%.

@@             Coverage Diff              @@
##               main    #2461      +/-   ##
============================================
+ Coverage     83.60%   84.32%   +0.71%     
+ Complexity     1213      985     -228     
============================================
  Files           231      191      -40     
  Lines          5520     4560     -960     
  Branches        266      231      -35     
============================================
- Hits           4615     3845     -770     
+ Misses          762      600     -162     
+ Partials        143      115      -28     
Impacted Files Coverage Δ
...a/marquez/db/migrations/V57_1__BackfillFacets.java 85.96% <89.47%> (-4.58%) ⬇️

... and 40 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Member

@wslulciuc wslulciuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤩

@wslulciuc wslulciuc merged commit e7abc0a into main Mar 30, 2023
@wslulciuc wslulciuc deleted the fix/backfill_facet_table_sql_improvements branch March 30, 2023 16:39
Xavier-Cliquennois pushed a commit to Xavier-Cliquennois/marquez that referenced this pull request Jul 26, 2023
… on large installations (MarquezProject#2461)

Signed-off-by: Michael Collado <[email protected]>
Signed-off-by: Xavier-Cliquennois <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api API layer changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants