Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race conditions in ActiveJob concurrency extension #325

Closed
codyrobbins opened this issue Aug 10, 2021 · 3 comments
Closed

Race conditions in ActiveJob concurrency extension #325

codyrobbins opened this issue Aug 10, 2021 · 3 comments

Comments

@codyrobbins
Copy link
Contributor

I’ve been testing out the brand new concurrency controls for use in a production system and unfortunately in practice they don’t appear to be working correctly. There seem to be two separate issues with the controls around enqueue and perform, but they both boil down to race conditions. I’ve created a pull request that I believe addresses the issues that we are seeing in our particular case, but I’m not familiar enough with the GoodJob or ActiveJob internals to be sure that there are not other problems I’m unaware of introduced by what I’ve done here.

In case it makes a difference, I’ve been running into these particular issues with the following concurrency limits:

good_job_control_concurrency_with(
  perform_limit: 1,
  enqueue_limit: 2,
  key: […]
)

The goal here is to ensure that the job is never being performed more than once at the same time, but we can enqueue two additional jobs to act as a debounce rather than a throttle on job performance.

For enqueue, it appears as though the problem is that the advisory lock is released in the before_enqueue callback before the job actually gets enqueued, and this is allowing multiple jobs to be enqueued simultaneously with the advisory lock essentially being useless.

GoodJob::Job.new.with_advisory_lock(key: key, function: "pg_advisory_lock") do
# TODO: Why is `unscoped` necessary? Nested scope is bleeding into subsequent query?
enqueue_concurrency = GoodJob::Job.unscoped.where(concurrency_key: key).unfinished.count
# The job has not yet been enqueued, so check if adding it will go over the limit
throw :abort if enqueue_concurrency + 1 > limit
end

Immediately after the advisory lock is relinquished at the end of this block, but before control is returned to the code that goes ahead and actually does the enqueueing, another job can enter the critical section and can calculate an invalid enqueue_concurrency because it won’t take into account the job that has just been allowed to enqueue but has not actually done so yet.

I’m guessing there are other internal GoodJob or ActiveJob callbacks that need to run after this code but before the enqueue occurs which exacerbate this problem and make it more likely to occur, although even if there aren’t I think instructions can still be interleaved in such a way that this happens. In practice it seems to end up happening a significant proportion of the time for me and with the above concurrency configuration I was getting more than two jobs enqueued at the same time consistently.

The simple solution I came up with that seems to resolve this problem is changing the before callback to an around one. This ensures the advisory lock is not released until the enqueue operation has actually occurred and that no other jobs can enter the critical section in the meantime.

For perform, the problem we’re seeing is a huge amount of lock contention that in practice introduces heavy unnecessary delays to queue processing but in certain circumstances seems to prevent the jobs from running indefinitely.

When a job begins to be performed, its row is locked here:

unfinished.priority_ordered.only_scheduled.limit(1).with_advisory_lock(unlock_session: true) do |good_jobs|

The calculation of perform concurrency in before_perform then basically counts the number of jobs that have had their rows locked in this manner:

perform_concurrency = GoodJob::Job.unscoped.where(concurrency_key: key).advisory_locked.count

The problem is that in practice what ends up happening with our configuration is that by the time a job hits the critical section where perform_concurrency is calculated, other jobs have simultaneously already begun performing and had their rows locked in the same way even if they’re stuck waiting for the advisory lock to enter the critical section. Thus they all end up calculating that there are too many jobs being performed and they all back off, despite the fact that none of them are “really” performing yet. The same thing happens repeatedly until the attempts interleave in such a way by chance that this lock contention doesn’t occur. In a few cases, however, I have seen it settle into a relatively stable state where this happens continously and seemingly indefinitely.

I think the problem is that perform_concurrency is really a calculation of the number of jobs that have entered a pre-performance state—none of them begin “actually” performing until they pass the critical section successfully.

The fix I came up with for this problem is to change the perform concurrency calculation to deterministically pick which jobs “should” be performing right now and raising for any job not in this set so that it gets rescheduled. This boils down to ordering the jobs by when they began performing and then allowing the first n jobs to proceed. I think there may be a few different ways of fixing this issue, but I’m not sure of the relative merits and this seemed like a straightforward approach that works in practice for our application.

I’m not a concurrency expert by any means nor overly familiar with ActiveJob in general, so like I said there may be more issues going on here than I’m aware of and these particular fixes may be unsuitable for a variety of reasons, but hopefully it’s at least a start. Happy to provide more details if you need more info to help surface the problem!

@bensheldon
Copy link
Owner

@codyrobbins this is fantastic analysis. Thank you for writing this up! And thank you for digging deeply into this new functionality too. I would say that at this point you are experienced 😄

I'll review your PR for the enqueue step. Changing it from a before_enqueue to around_enqueue makes sense.

For perform, at the time that count is called, the job is executing: GoodJob has fetched and locked the job and handed it off to ActiveJob. So from the perspective of GoodJob, the job is in a performing state.

To explain how enqueue and perform are different in orchestration between GoodJob <--> ActiveJob:

  • On enqueue, ActiveJob hands the job to GoodJob to store in the database
  • On perform, GoodJob fetches the job from the database, locks it, and hands the job to ActiveJob to execute

This explanation is spot on with my understanding ➕ :

The problem is that in practice what ends up happening with our configuration is that by the time a job hits the critical section where perform_concurrency is calculated, other jobs have simultaneously already begun performing and had their rows locked in the same way even if they’re stuck waiting for the advisory lock to enter the critical section.

Would you mind sharing your code for deterministically ordering/counting the jobs? That sounds like a good solution. The concurrency controls are bleeding edge for good and bad, but I want to ensure they work.

@codyrobbins
Copy link
Contributor Author

codyrobbins commented Aug 10, 2021

@bensheldon Thanks!!

Ah, everything you wrote makes sense. What I am referring to as ‘pre-perform’ or ‘not “really” performing yet’ state in my comment above might more properly then be called the execution step. In that case the current code is basically using a count of the jobs in the execution state to limit which jobs enter the perform state, but since the critical section is not wrapping entrance to the execution state we’re left with a situation where lots of jobs can simultaneously be executing which then blocks perform.

Would you mind sharing your code for deterministically ordering/counting the jobs? That sounds like a good solution. The concurrency controls are bleeding edge for good and bad, but I want to ensure they work.

Sure, the PR also actually has my fix for the perform step. The relevant changes are here:

https://github.com/codyrobbins/good_job/blob/1769d66644d4774fe029dff17ee5fc8b6f825224/lib/good_job/active_job_extensions/concurrency.rb#L43-L47

It’s basically ordering the locked jobs by performed_at with the earliest first, limiting the number of results to the concurrency limit, and then plucking their IDs into a set. The concurrency limit is then enforced by checking whether the current job’s ID is in the set of jobs “allowed” to be performed right now.

@bensheldon
Copy link
Owner

@codyrobbins thank you so much! I'll get your PR merged and released tomorrow.

Another question while I have you since you're now a GoodJob concurrency expert. Do you have perspective on the question in #317 (comment)

Should enqueue_limit include jobs that are In a performing state?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants