Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimize fetches from empty queues #5

Merged
merged 16 commits into from
Apr 24, 2023

Conversation

sensorsasha
Copy link
Contributor

@sensorsasha sensorsasha commented Apr 20, 2023

Overview

This PR addresses a flaw in the current fetching system. Currently, we have to run rpoplpush for every queue trying to fetch a job.

queues.each do |queue|
job = redis.rpoplpush(queue.pending, queue.inproc)
return UnitOfWork.new(queue, job) if job
@exhausted.add(queue, :ttl => QUEUE_TIMEOUT)
end

The more requests we need to perform before finding a queue with a job to pick up, the more time is wasted. If the system has a lot of high priority but most of the time empty queues, it's even more vulnerable. Every time we have to iterate over empty queues.

I run a benchmark to test how slow is sidekiq-ultimate fetching. Here is a comparison of regular fetch vs existing implementation based on the number of empty queues (X axis):
232784246-2cf2cc67-6779-491e-ab65-7daad311755b

We do have a "queue exhaustion" mechanism

@exhausted.add(queue, :ttl => QUEUE_TIMEOUT)

But it's per sidekiq thread. For bigger fleets of workers, it makes sense to share the list of empty queues. This is exactly what this PR is doing.

Implementation details

For each sidekiq process it adds a background Concurrent::TimerTask. It periodically (controllable by the empty_queues_refresh_interval_sec setting) tries to update both the global list of empty queues (stored in redis) and the local list (stored in a local variable).

Global list update is covered by a global lock (based on redis), so only a single process can update it at the same time.

Local list is stored in the EmptyQueues singleton which is accessible for reads by all the threads.

Every call to EmptyQueues.instance.refresh! requires a local lock implemented using ruby Mutex.

Bonus

  • Fixes many flaky tests because of the shared state
  • Fixes namespacing for redlock locks
  • Adds tests for existing functionality in lib/sidekiq/ultimate/fetch.rb
  • Use a cheaper check before trying to acquire a lock for the resurrection
  • Tests for Sidekiq::Ultimate.setup

@sensorsasha sensorsasha self-assigned this Apr 20, 2023
@codecov-commenter
Copy link

codecov-commenter commented Apr 20, 2023

Codecov Report

Patch coverage: 98.79% and project coverage change: -0.16 ⚠️

Comparison is base (b43c117) 96.84% compared to head (23ac936) 96.68%.

❗ Current head 23ac936 differs from pull request most recent head b060f74. Consider uploading reports for the commit b060f74 to get more accurate results

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Additional details and impacted files
@@            Coverage Diff             @@
##           master       #5      +/-   ##
==========================================
- Coverage   96.84%   96.68%   -0.16%     
==========================================
  Files           8       15       +7     
  Lines         190      422     +232     
==========================================
+ Hits          184      408     +224     
- Misses          6       14       +8     
Impacted Files Coverage Δ
lib/sidekiq/ultimate/empty_queues.rb 97.22% <97.22%> (ø)
lib/sidekiq/ultimate/configuration.rb 100.00% <100.00%> (ø)
...idekiq/ultimate/empty_queues/refresh_timer_task.rb 100.00% <100.00%> (ø)
lib/sidekiq/ultimate/fetch.rb 97.91% <100.00%> (ø)
lib/sidekiq/ultimate/interval_with_jitter.rb 100.00% <100.00%> (ø)
lib/sidekiq/ultimate/resurrector.rb 93.67% <100.00%> (+1.67%) ⬆️
lib/sidekiq/ultimate/resurrector/lock.rb 100.00% <100.00%> (ø)
lib/sidekiq/ultimate/use_exists_question_mark.rb 100.00% <100.00%> (ø)

... and 2 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@sensorsasha sensorsasha force-pushed the optimize_fetch_for_empty_queues branch from 02a3349 to 84b880f Compare April 20, 2023 17:00
@sensorsasha sensorsasha force-pushed the optimize_fetch_for_empty_queues branch from 84b880f to a81d710 Compare April 20, 2023 18:35
@sensorsasha sensorsasha marked this pull request as ready for review April 21, 2023 14:27
Copy link
Contributor

@davidbodow-st davidbodow-st left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideas for next iteration if performance is still too slow:

  • Avoid stale empty queue lists by checking the global list more quickly if we tried to read during a lock
  • If a queue bottoms out during a cycle, add it to the lists for the rest of the interval so that we avoid double-hitting a known empty queue

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
lib/sidekiq/ultimate/configuration.rb Outdated Show resolved Hide resolved
lib/sidekiq/ultimate/empty_queues.rb Outdated Show resolved Hide resolved
lib/sidekiq/ultimate/fetch.rb Outdated Show resolved Hide resolved
spec/sidekiq/ultimate/configuration_spec.rb Outdated Show resolved Hide resolved
spec/sidekiq/ultimate/empty_queues_spec.rb Outdated Show resolved Hide resolved
@stefanl-st
Copy link
Contributor

I think we missed a pretty important reviewer here. @ixti please take a look if you wish

Comment on lines +6 to +7
# Silence ruby warnings
$TESTING = true # rubocop:disable Style/GlobalVars
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I actually think it's helpful to see them, especially if we need to deal with deprecation warnings for instance

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$TESTING is a sidekiq global variable. It does not silence Ruby warnings, when ruby executed with -w argument or RUBYOPT=-w environment variable.

See Ruby globals: https://docs.ruby-lang.org/en/2.7.0/globals_rdoc.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a different kind of warnings:

RUBYOPT=-W0 rspec
gems/sidekiq-5.2.10/lib/sidekiq/processor.rb:66: warning: global variable `$TESTING' not initialized
gems/sidekiq-5.2.10/lib/sidekiq/launcher.rb:63: warning: global variable `$TESTING' not initialized
gems/sidekiq-5.2.10/lib/sidekiq/cli.rb:17: warning: global variable `$TESTING' not initialized

Copy link
Contributor

@stefanl-st stefanl-st left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great @sensorsasha, overall the direction is solid and I think this will be a significant improvement.

On the top level after reading the code I suggest we change the code and wording to treat this more like a cache and therefore use words that are commonly used with caches so use the words global and local cache rather than using the words global and local list.

An example of the change is:

#set_local_list => #update_local_cache

There are many places where need to make changes and it also depends on the context so if you think its a good idea, please take a stab at it and I can review after.

lib/sidekiq/ultimate/configuration.rb Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
lib/sidekiq/ultimate/configuration.rb Outdated Show resolved Hide resolved
lib/sidekiq/ultimate/empty_queues.rb Outdated Show resolved Hide resolved
lib/sidekiq/ultimate/empty_queues.rb Outdated Show resolved Hide resolved
README.md Outdated
Comment on lines 97 to 114
Specifies how often the list of empty queues should be refreshed.
In a nutshell, this sets the maximum possible delay between when a job was pushed to previously empty queue and earliest the moment when that new job could be picked up.

**Note:** every worker maintains its own local list of empty queues.
Setting this interval to a low value will increase the number of Redis calls needed to check for empty queues, increasing the total load on Redis.

This setting helps manage the tradeoff between performance penalties and latency needed for reliable fetch.
Under the hood, Sidekiq's default fetch occurs with [a single Redis `BRPOP` call](https://redis.io/commands/brpop/) which is passes list of all queues to pluck work from.
In contrast, [reliable fetch uses `LPOPRPUSH`](https://redis.io/commands/rpoplpush/) (or the equivalent `LMOVE` in later Redis versions) to place in progress work into a WIP queue.
However, `LPOPRPUSH` can only check one source queue to pop from at once, and [no multi-key alternative is available](https://github.com/redis/redis/issues/1785), so multiple Redis calls are needed to pluck work if an empty queue is checked.
In order to avoid performance penalties for repeated calls to empty queues, Sidekiq Ultimate therefore maintains a list of recently know empty queues which it will avoid polling for work.

Therefore:
- If your Sidekiq architecture has *a low number of total queues*, the worst case penalty for polling empty queues will be bounded, and it is reasonable to **set a shorter refresh period**.
- If your Sidekiq architecture has a *high number of total queues*, the worst case penalty for polling empty queues is large, and it is recommended to **set a longer refresh period**.
- When adjusting this setting:
- Check that work is consumed appropriately quickly from high priority queues after they bottom out (after increasing the refresh interval)
- Check that backlog work does not accumulate in low priority queues (after decreasing the refresh interval)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like that you added all this detail but in order to make it a bit digestible, please rewrite it as suggested below:

  • Start by explaining the problem this feature solves
  • Explain how it works under the hood (the different types of Redis operations like you do now)
  • Explain what the performance implications are for a shared global queue (also please include the total number of workers into the description, it looks like its missing)
  • Then explain the constant itself

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidbodow-st any thoughts since you proposed this section?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stefanl-st @sensorsasha TBH, I just wrote this off the top of my head, so we can sink more time into refining the structure and wording if needed, though I think it's probably at diminishing returns right now.

I also disagree with the proposed structure, as IMHO, it would provide too much detail for quick reference use cases, which are the majority of documentation readers' time. Here, I assume that most readers will need to learn the "why" just once and the "what was that config option called again?" many times while using the feature.

So, my preferred structure is:

  • Quick reference overview
  • Explain the problem that needs to be solved
  • Explain the tradeoff that the implementation details require us to make

IMO we're already following this structure, and it's mostly optimized for the quick reference use case. If we want to take another pass, then I would recommend:

  • Tighten up the "Therefore" section and move it into the quick reference, for readers who care about making the correct decision without fully building the mental model of the implementation that they need to do that
  • Take a general pass to make the language more concise (could definitely be condensed)

@stefanl-st If you disagree / would still prefer your structure, can you elaborate on which audience you want to optimize for instead of quick reference readers (who would expect the Then explain the constant itself part at the very beginning, IMO)?

Comment on lines 27 to 32
# It specifies how often the list of empty queues should be refreshed.
# In a nutshell, it specifies the maximum possible delay between a job was pushed to previously empty queue and
# the moment when that new job is picked up.
# Note that every worker needs to maintain its own local list of empty queues. Setting this interval to a low
# values will increase the number of redis calls and will increase the load on redis.
# @return [Integer] interval in seconds to refresh the list of empty queues
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defer some of this explanation to the readme and link it.

Suggested change
# It specifies how often the list of empty queues should be refreshed.
# In a nutshell, it specifies the maximum possible delay between a job was pushed to previously empty queue and
# the moment when that new job is picked up.
# Note that every worker needs to maintain its own local list of empty queues. Setting this interval to a low
# values will increase the number of redis calls and will increase the load on redis.
# @return [Integer] interval in seconds to refresh the list of empty queues
# Each individual worker ignores attempting to fetch jobs from queues it believes are empty by
# checking the empty queue cached state from the global empty queue cache to speed up performance
# @return [Integer] interval in seconds between global cache refreshes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather keep it here and link README to these comments. IMO it's easier to keep it up to date than README.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also be fine with that to tighten up the Readme language

lib/sidekiq/ultimate/empty_queues.rb Outdated Show resolved Hide resolved
lib/sidekiq/ultimate/empty_queues.rb Outdated Show resolved Hide resolved
@stefanl-st stefanl-st self-requested a review April 21, 2023 20:49
Copy link
Contributor

@ixti ixti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but IMO there's no need in re-implementing sscan_each. Both redis-rb and redis-client have convenience wrappers for that build-in.

lib/sidekiq/ultimate/configuration.rb Show resolved Hide resolved
lib/sidekiq/ultimate/redis_sscan.rb Outdated Show resolved Hide resolved
Comment on lines 17 to 18
result.uniq! # Cursor is not atomic, so there may be duplicates because of concurrent update operations
result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most bang-ending Ruby operators return nil if there were no changes. That's done to allow do something when changes were made or not:

if arr.uniq!
  puts "we removed some duplicates"
end

Copy link
Contributor

@kevinrobell-st kevinrobell-st left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with a lot of the comments about documentation but, in general, I think this approach makes sense.

The only thing in the back of my mind is whether 30 seconds is a sane default for the empty_queues_refresh_interval_sec. Do we know how long queues usually stay empty for?

lib/sidekiq/ultimate/interval_with_jitter.rb Outdated Show resolved Hide resolved
lib/sidekiq/ultimate/resurrector.rb Outdated Show resolved Hide resolved
sensorsasha and others added 2 commits April 24, 2023 17:37
Co-authored-by: Stefan Lynggaard <[email protected]>
Co-authored-by: Kevin Robell <[email protected]>
- Add configuration for THROTTLE_TIMEOUT
- Address proposed naming changes
- Remove RedisSscan in favor of `.sscan_each` from redis-rb
- aed -> heartbeat
- cthulhu -> resurrect
@sensorsasha
Copy link
Contributor Author

LGTM, but IMO there's no need in re-implementing sscan_each. Both redis-rb and redis-client have convenience wrappers for that build-in.

Thank you for pointing this out. I did a brief check of redis-rb docs but missed that for some reason. 🤝

@sensorsasha
Copy link
Contributor Author

The only thing in the back of my mind is whether 30 seconds is a sane default for the empty_queues_refresh_interval_sec. Do we know how long queues usually stay empty for?

Not really. I'd say that 30 seconds looks safe to me. The rest can be tuned for any specific use case.

@sensorsasha
Copy link
Contributor Author

#set_local_list => #update_local_cache

I like this idea. It definitely makes it easier to reason about.I updated the code. @stefanl-st

@sensorsasha sensorsasha merged commit d09d540 into master Apr 24, 2023
@sensorsasha sensorsasha deleted the optimize_fetch_for_empty_queues branch April 24, 2023 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants