-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add index on jobs_fqn namespace and fqn to optimize read queries #2357
Conversation
Codecov Report
@@ Coverage Diff @@
## main #2357 +/- ##
============================================
- Coverage 77.11% 76.72% -0.40%
+ Complexity 1234 1177 -57
============================================
Files 228 222 -6
Lines 5572 5354 -218
Branches 447 429 -18
============================================
- Hits 4297 4108 -189
+ Misses 775 768 -7
+ Partials 500 478 -22
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
I was confused a couple of times by a pattern in Marquez when a table has both columns I think I would be in favour of:
Just in case someone would like the ability to rename namespaces in future ;-) |
TBH, I don't know when that pattern started. I assume it was to easily query the jobs/runs tables with namespaces without having to incur the cost of a join. Given the tiny size of the namespaces table and the fact that we always query by a single namespace, it was probably an unnecessary optimization. Here are two EXPLAIN plans for a SELECT * FROM jobs_fqn
WHERE namespace_name='abcdefg'
AND job_fqn='a_job_name';
SELECT j.* FROM jobs_fqn j
INNER JOIN namespaces n ON j.namespace_uuid=n.uuid
WHERE n.name='abcdefg'
AND job_fqn='a_job_name';
Technically, the cost of the join is super high. In reality, the two queries execute in practically the same amount of time. |
@collado-mike That's interesting. Seems like postgres prefers filtering first |
@pawel-big-lebowski, @collado-mike: I can provide some background on the denormalization efforts on some of the tables. In order to reduce (possibly expensive) joins, some tables have redundant data. One example being the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛳️ it!
I did test out an index on Here's the plan if I do a join on the two tables along with an index on
Here's the plan with the proposed index.
Again, it is a slight gain, though TBH, I think it's probably an unnecessary optimization. But I don't propose we start undoing this pattern in this PR. I think that's a job that's better left to a dedicated task with research into all of the impacted queries. |
Signed-off-by: Michael Collado <[email protected]>
67930ce
to
f772621
Compare
Problem
An index of the
jobs
table'sname
andnamespace_name
has existed for a long time, but there was no index added to the newerjobs_fqn
table'snamespace_name
andjob_fqn
columns.Solution
Adds an index on the
jobs_fqn
table using thenamespace_name
andjob_fqn
columns.EXPLAIN plans on a Marquez installation with ~100,000 jobs for the query
Original EXPLAIN plan:
New EXPLAIN plan:
Checklist
CHANGELOG.md
with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary).sql
database schema migration according to Flyway's naming convention (if relevant)