-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minimize fetches from empty queues #5
Conversation
Codecov ReportPatch coverage:
📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more Additional details and impacted files@@ Coverage Diff @@
## master #5 +/- ##
==========================================
- Coverage 96.84% 96.68% -0.16%
==========================================
Files 8 15 +7
Lines 190 422 +232
==========================================
+ Hits 184 408 +224
- Misses 6 14 +8
... and 2 files with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report in Codecov by Sentry. |
02a3349
to
84b880f
Compare
84b880f
to
a81d710
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideas for next iteration if performance is still too slow:
- Avoid stale empty queue lists by checking the global list more quickly if we tried to read during a lock
- If a queue bottoms out during a cycle, add it to the lists for the rest of the interval so that we avoid double-hitting a known empty queue
Co-authored-by: David Bodow <[email protected]>
…al_sec - add jitter for refresh interval
I think we missed a pretty important reviewer here. @ixti please take a look if you wish |
# Silence ruby warnings | ||
$TESTING = true # rubocop:disable Style/GlobalVars |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I actually think it's helpful to see them, especially if we need to deal with deprecation warnings for instance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
$TESTING is a sidekiq global variable. It does not silence Ruby warnings, when ruby executed with -w
argument or RUBYOPT=-w
environment variable.
See Ruby globals: https://docs.ruby-lang.org/en/2.7.0/globals_rdoc.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a different kind of warnings:
RUBYOPT=-W0 rspec
gems/sidekiq-5.2.10/lib/sidekiq/processor.rb:66: warning: global variable `$TESTING' not initialized
gems/sidekiq-5.2.10/lib/sidekiq/launcher.rb:63: warning: global variable `$TESTING' not initialized
gems/sidekiq-5.2.10/lib/sidekiq/cli.rb:17: warning: global variable `$TESTING' not initialized
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great @sensorsasha, overall the direction is solid and I think this will be a significant improvement.
On the top level after reading the code I suggest we change the code and wording to treat this more like a cache and therefore use words that are commonly used with caches so use the words global and local cache rather than using the words global and local list.
An example of the change is:
#set_local_list
=>
#update_local_cache
There are many places where need to make changes and it also depends on the context so if you think its a good idea, please take a stab at it and I can review after.
README.md
Outdated
Specifies how often the list of empty queues should be refreshed. | ||
In a nutshell, this sets the maximum possible delay between when a job was pushed to previously empty queue and earliest the moment when that new job could be picked up. | ||
|
||
**Note:** every worker maintains its own local list of empty queues. | ||
Setting this interval to a low value will increase the number of Redis calls needed to check for empty queues, increasing the total load on Redis. | ||
|
||
This setting helps manage the tradeoff between performance penalties and latency needed for reliable fetch. | ||
Under the hood, Sidekiq's default fetch occurs with [a single Redis `BRPOP` call](https://redis.io/commands/brpop/) which is passes list of all queues to pluck work from. | ||
In contrast, [reliable fetch uses `LPOPRPUSH`](https://redis.io/commands/rpoplpush/) (or the equivalent `LMOVE` in later Redis versions) to place in progress work into a WIP queue. | ||
However, `LPOPRPUSH` can only check one source queue to pop from at once, and [no multi-key alternative is available](https://github.com/redis/redis/issues/1785), so multiple Redis calls are needed to pluck work if an empty queue is checked. | ||
In order to avoid performance penalties for repeated calls to empty queues, Sidekiq Ultimate therefore maintains a list of recently know empty queues which it will avoid polling for work. | ||
|
||
Therefore: | ||
- If your Sidekiq architecture has *a low number of total queues*, the worst case penalty for polling empty queues will be bounded, and it is reasonable to **set a shorter refresh period**. | ||
- If your Sidekiq architecture has a *high number of total queues*, the worst case penalty for polling empty queues is large, and it is recommended to **set a longer refresh period**. | ||
- When adjusting this setting: | ||
- Check that work is consumed appropriately quickly from high priority queues after they bottom out (after increasing the refresh interval) | ||
- Check that backlog work does not accumulate in low priority queues (after decreasing the refresh interval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like that you added all this detail but in order to make it a bit digestible, please rewrite it as suggested below:
- Start by explaining the problem this feature solves
- Explain how it works under the hood (the different types of Redis operations like you do now)
- Explain what the performance implications are for a shared global queue (also please include the total number of workers into the description, it looks like its missing)
- Then explain the constant itself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davidbodow-st any thoughts since you proposed this section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stefanl-st @sensorsasha TBH, I just wrote this off the top of my head, so we can sink more time into refining the structure and wording if needed, though I think it's probably at diminishing returns right now.
I also disagree with the proposed structure, as IMHO, it would provide too much detail for quick reference use cases, which are the majority of documentation readers' time. Here, I assume that most readers will need to learn the "why" just once and the "what was that config option called again?" many times while using the feature.
So, my preferred structure is:
- Quick reference overview
- Explain the problem that needs to be solved
- Explain the tradeoff that the implementation details require us to make
IMO we're already following this structure, and it's mostly optimized for the quick reference use case. If we want to take another pass, then I would recommend:
- Tighten up the "Therefore" section and move it into the quick reference, for readers who care about making the correct decision without fully building the mental model of the implementation that they need to do that
- Take a general pass to make the language more concise (could definitely be condensed)
@stefanl-st If you disagree / would still prefer your structure, can you elaborate on which audience you want to optimize for instead of quick reference readers (who would expect the Then explain the constant itself
part at the very beginning, IMO)?
# It specifies how often the list of empty queues should be refreshed. | ||
# In a nutshell, it specifies the maximum possible delay between a job was pushed to previously empty queue and | ||
# the moment when that new job is picked up. | ||
# Note that every worker needs to maintain its own local list of empty queues. Setting this interval to a low | ||
# values will increase the number of redis calls and will increase the load on redis. | ||
# @return [Integer] interval in seconds to refresh the list of empty queues |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defer some of this explanation to the readme and link it.
# It specifies how often the list of empty queues should be refreshed. | |
# In a nutshell, it specifies the maximum possible delay between a job was pushed to previously empty queue and | |
# the moment when that new job is picked up. | |
# Note that every worker needs to maintain its own local list of empty queues. Setting this interval to a low | |
# values will increase the number of redis calls and will increase the load on redis. | |
# @return [Integer] interval in seconds to refresh the list of empty queues | |
# Each individual worker ignores attempting to fetch jobs from queues it believes are empty by | |
# checking the empty queue cached state from the global empty queue cache to speed up performance | |
# @return [Integer] interval in seconds between global cache refreshes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather keep it here and link README to these comments. IMO it's easier to keep it up to date than README.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also be fine with that to tighten up the Readme language
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but IMO there's no need in re-implementing sscan_each. Both redis-rb and redis-client have convenience wrappers for that build-in.
lib/sidekiq/ultimate/redis_sscan.rb
Outdated
result.uniq! # Cursor is not atomic, so there may be duplicates because of concurrent update operations | ||
result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most bang-ending Ruby operators return nil
if there were no changes. That's done to allow do something when changes were made or not:
if arr.uniq!
puts "we removed some duplicates"
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with a lot of the comments about documentation but, in general, I think this approach makes sense.
The only thing in the back of my mind is whether 30 seconds is a sane default for the empty_queues_refresh_interval_sec
. Do we know how long queues usually stay empty for?
Co-authored-by: Stefan Lynggaard <[email protected]> Co-authored-by: Kevin Robell <[email protected]>
- Add configuration for THROTTLE_TIMEOUT - Address proposed naming changes - Remove RedisSscan in favor of `.sscan_each` from redis-rb - aed -> heartbeat - cthulhu -> resurrect
Thank you for pointing this out. I did a brief check of redis-rb docs but missed that for some reason. 🤝 |
Not really. I'd say that 30 seconds looks safe to me. The rest can be tuned for any specific use case. |
I like this idea. It definitely makes it easier to reason about.I updated the code. @stefanl-st |
Overview
This PR addresses a flaw in the current fetching system. Currently, we have to run
rpoplpush
for every queue trying to fetch a job.sidekiq-ultimate/lib/sidekiq/ultimate/fetch.rb
Lines 64 to 69 in b43c117
The more requests we need to perform before finding a queue with a job to pick up, the more time is wasted. If the system has a lot of high priority but most of the time empty queues, it's even more vulnerable. Every time we have to iterate over empty queues.
I run a benchmark to test how slow is sidekiq-ultimate fetching. Here is a comparison of regular fetch vs existing implementation based on the number of empty queues (X axis):
We do have a "queue exhaustion" mechanism
sidekiq-ultimate/lib/sidekiq/ultimate/fetch.rb
Line 68 in b43c117
But it's per sidekiq thread. For bigger fleets of workers, it makes sense to share the list of empty queues. This is exactly what this PR is doing.
Implementation details
For each sidekiq process it adds a background
Concurrent::TimerTask
. It periodically (controllable by theempty_queues_refresh_interval_sec
setting) tries to update both the global list of empty queues (stored in redis) and the local list (stored in a local variable).Global list update is covered by a global lock (based on redis), so only a single process can update it at the same time.
Local list is stored in the
EmptyQueues
singleton which is accessible for reads by all the threads.Every call to
EmptyQueues.instance.refresh!
requires a local lock implemented using rubyMutex
.Bonus
lib/sidekiq/ultimate/fetch.rb
Sidekiq::Ultimate.setup