Materialized CTE performance bottleneck #720

mdavidn · 2022-10-10T18:12:40Z

I wanted to document a performance bottleneck I encountered when enqueuing approximately 150,000 small, low-priority jobs at once with 20 workers. The scope GoodJob::Lockable.advisory_lock materializes a CTE that sorts and returns a list of all candidate jobs. This query takes about three seconds to lock each job under these conditions. This pegs the database CPU.

The performance of this query can be dramatically improved, to 0.1 ms, by making two changes:

Limit the number of rows materialized by the CTE to the number of workers in all good_job processes. This is one less than the maximum number of advisory locks that might be held.
Create an index matching the query's sort and conditions:
USING btree (priority DESC NULLS LAST, created_at ASC) WHERE finished_at IS NULL

The number of workers can change over time but is generally stable. The number could be cached in each process, perhaps refreshing at some interval after the query returns no available jobs.

The text was updated successfully, but these errors were encountered:

bensheldon · 2022-10-11T04:58:34Z

@mdavidn Thank you so much for opening this issue 🙏🏻 I've known GoodJob had not great performance characteristics at large numbers of jobs and was waiting to see if anyone was actually pushing those limits.

I really appreciate you digging into the solutions too. I'm thinking there are some quick improvements here:

Obviously a better index.
I think setting a static configurable upper bound on the CTE size would probably have some improvement to prevent it from trying to materialize the entire table would be good. I dunno if 1k is too small. I'm imagining if someone is running a thousand GoodJob threads across all their processes they're probably running up against this issue already. And hopefully that will help defer making the configuration dynamic (which I think would be complex).

mdavidn · 2022-10-11T17:13:20Z

I like the configuration with a moderately high default. There are better places to tune the maximum number of workers in large deployments, like in Terraform.

bensheldon · 2022-10-22T20:41:39Z

@mdavidn I think this has been addressed by those two PRs (#726, #727) from @mitchellhenke 🎉

This was referenced Oct 19, 2022

Add index to good_jobs to improve querying candidate jobs #726

Merged

Add configurable limit (queue_select_limit) when querying candidate jobs #727

Merged

bensheldon closed this as completed Oct 22, 2022

bensheldon mentioned this issue Feb 28, 2023

Really fast jobs run one at a time #871

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Materialized CTE performance bottleneck #720

Materialized CTE performance bottleneck #720

mdavidn commented Oct 10, 2022 •

edited

Loading

bensheldon commented Oct 11, 2022

mdavidn commented Oct 11, 2022

bensheldon commented Oct 22, 2022

Materialized CTE performance bottleneck #720

Materialized CTE performance bottleneck #720

Comments

mdavidn commented Oct 10, 2022 • edited Loading

bensheldon commented Oct 11, 2022

mdavidn commented Oct 11, 2022

bensheldon commented Oct 22, 2022

mdavidn commented Oct 10, 2022 •

edited

Loading