Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Materialized CTE performance bottleneck #720

Closed
mdavidn opened this issue Oct 10, 2022 · 3 comments
Closed

Materialized CTE performance bottleneck #720

mdavidn opened this issue Oct 10, 2022 · 3 comments

Comments

@mdavidn
Copy link

mdavidn commented Oct 10, 2022

I wanted to document a performance bottleneck I encountered when enqueuing approximately 150,000 small, low-priority jobs at once with 20 workers. The scope GoodJob::Lockable.advisory_lock materializes a CTE that sorts and returns a list of all candidate jobs. This query takes about three seconds to lock each job under these conditions. This pegs the database CPU.

The performance of this query can be dramatically improved, to 0.1 ms, by making two changes:

  1. Limit the number of rows materialized by the CTE to the number of workers in all good_job processes. This is one less than the maximum number of advisory locks that might be held.
  2. Create an index matching the query's sort and conditions:
    USING btree (priority DESC NULLS LAST, created_at ASC) WHERE finished_at IS NULL

The number of workers can change over time but is generally stable. The number could be cached in each process, perhaps refreshing at some interval after the query returns no available jobs.

@bensheldon
Copy link
Owner

@mdavidn Thank you so much for opening this issue 🙏🏻 I've known GoodJob had not great performance characteristics at large numbers of jobs and was waiting to see if anyone was actually pushing those limits.

I really appreciate you digging into the solutions too. I'm thinking there are some quick improvements here:

  1. Obviously a better index.
  2. I think setting a static configurable upper bound on the CTE size would probably have some improvement to prevent it from trying to materialize the entire table would be good. I dunno if 1k is too small. I'm imagining if someone is running a thousand GoodJob threads across all their processes they're probably running up against this issue already. And hopefully that will help defer making the configuration dynamic (which I think would be complex).

@mdavidn
Copy link
Author

mdavidn commented Oct 11, 2022

I like the configuration with a moderately high default. There are better places to tune the maximum number of workers in large deployments, like in Terraform.

@bensheldon
Copy link
Owner

@mdavidn I think this has been addressed by those two PRs (#726, #727) from @mitchellhenke 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

2 participants