Minimize fetches from empty queues #5

sensorsasha · 2023-04-20T15:23:51Z

Overview

This PR addresses a flaw in the current fetching system. Currently, we have to run rpoplpush for every queue trying to fetch a job.

sidekiq-ultimate/lib/sidekiq/ultimate/fetch.rb

Lines 64 to 69 in b43c117

    
           queues.each do |queue| 
        
             job = redis.rpoplpush(queue.pending, queue.inproc) 
        
             return UnitOfWork.new(queue, job) if job 
        
             @exhausted.add(queue, :ttl => QUEUE_TIMEOUT) 
        
           end

The more requests we need to perform before finding a queue with a job to pick up, the more time is wasted. If the system has a lot of high priority but most of the time empty queues, it's even more vulnerable. Every time we have to iterate over empty queues.

I run a benchmark to test how slow is sidekiq-ultimate fetching. Here is a comparison of regular fetch vs existing implementation based on the number of empty queues (X axis):

We do have a "queue exhaustion" mechanism

sidekiq-ultimate/lib/sidekiq/ultimate/fetch.rb

Line 68 in b43c117

@exhausted.add(queue, :ttl => QUEUE_TIMEOUT)

But it's per sidekiq thread. For bigger fleets of workers, it makes sense to share the list of empty queues. This is exactly what this PR is doing.

Implementation details

For each sidekiq process it adds a background Concurrent::TimerTask. It periodically (controllable by the empty_queues_refresh_interval_sec setting) tries to update both the global list of empty queues (stored in redis) and the local list (stored in a local variable).

Global list update is covered by a global lock (based on redis), so only a single process can update it at the same time.

Local list is stored in the EmptyQueues singleton which is accessible for reads by all the threads.

Every call to EmptyQueues.instance.refresh! requires a local lock implemented using ruby Mutex.

Bonus

Fixes many flaky tests because of the shared state
Fixes namespacing for redlock locks
Adds tests for existing functionality in lib/sidekiq/ultimate/fetch.rb
Use a cheaper check before trying to acquire a lock for the resurrection
Tests for Sidekiq::Ultimate.setup

codecov-commenter · 2023-04-20T15:25:28Z

Codecov Report

Patch coverage: 98.79% and project coverage change: -0.16 ⚠️

Comparison is base (b43c117) 96.84% compared to head (23ac936) 96.68%.

❗ Current head 23ac936 differs from pull request most recent head b060f74. Consider uploading reports for the commit b060f74 to get more accurate results

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Additional details and impacted files

@@            Coverage Diff             @@
##           master       #5      +/-   ##
==========================================
- Coverage   96.84%   96.68%   -0.16%     
==========================================
  Files           8       15       +7     
  Lines         190      422     +232     
==========================================
+ Hits          184      408     +224     
- Misses          6       14       +8

Impacted Files	Coverage Δ
lib/sidekiq/ultimate/empty_queues.rb	`97.22% <97.22%> (ø)`
lib/sidekiq/ultimate/configuration.rb	`100.00% <100.00%> (ø)`
...idekiq/ultimate/empty_queues/refresh_timer_task.rb	`100.00% <100.00%> (ø)`
lib/sidekiq/ultimate/fetch.rb	`97.91% <100.00%> (ø)`
lib/sidekiq/ultimate/interval_with_jitter.rb	`100.00% <100.00%> (ø)`
lib/sidekiq/ultimate/resurrector.rb	`93.67% <100.00%> (+1.67%)`	⬆️
lib/sidekiq/ultimate/resurrector/lock.rb	`100.00% <100.00%> (ø)`
lib/sidekiq/ultimate/use_exists_question_mark.rb	`100.00% <100.00%> (ø)`

... and 2 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

davidbodow-st

Ideas for next iteration if performance is still too slow:

Avoid stale empty queue lists by checking the global list more quickly if we tried to read during a lock
If a queue bottoms out during a cycle, add it to the lists for the rest of the interval so that we avoid double-hitting a known empty queue

README.md

lib/sidekiq/ultimate/configuration.rb

lib/sidekiq/ultimate/empty_queues.rb

lib/sidekiq/ultimate/fetch.rb

spec/sidekiq/ultimate/configuration_spec.rb

spec/sidekiq/ultimate/empty_queues_spec.rb

Co-authored-by: David Bodow <[email protected]>

…al_sec - add jitter for refresh interval

stefanl-st · 2023-04-21T19:28:50Z

I think we missed a pretty important reviewer here. @ixti please take a look if you wish

davidbodow-st · 2023-04-21T19:47:02Z

spec/spec_helper.rb

+# Silence ruby warnings
+$TESTING = true # rubocop:disable Style/GlobalVars


Hmm, I actually think it's helpful to see them, especially if we need to deal with deprecation warnings for instance

$TESTING is a sidekiq global variable. It does not silence Ruby warnings, when ruby executed with -w argument or RUBYOPT=-w environment variable.

See Ruby globals: https://docs.ruby-lang.org/en/2.7.0/globals_rdoc.html

This is a different kind of warnings:

RUBYOPT=-W0 rspec gems/sidekiq-5.2.10/lib/sidekiq/processor.rb:66: warning: global variable `$TESTING' not initialized gems/sidekiq-5.2.10/lib/sidekiq/launcher.rb:63: warning: global variable `$TESTING' not initialized gems/sidekiq-5.2.10/lib/sidekiq/cli.rb:17: warning: global variable `$TESTING' not initialized

stefanl-st

This is great @sensorsasha, overall the direction is solid and I think this will be a significant improvement.

On the top level after reading the code I suggest we change the code and wording to treat this more like a cache and therefore use words that are commonly used with caches so use the words global and local cache rather than using the words global and local list.

An example of the change is:

#set_local_list => #update_local_cache

There are many places where need to make changes and it also depends on the context so if you think its a good idea, please take a stab at it and I can review after.

lib/sidekiq/ultimate/configuration.rb

README.md

lib/sidekiq/ultimate/configuration.rb

lib/sidekiq/ultimate/empty_queues.rb

lib/sidekiq/ultimate/use_exists_question_mark.rb

stefanl-st · 2023-04-21T20:20:57Z

README.md

+Specifies how often the list of empty queues should be refreshed.
+In a nutshell, this sets the maximum possible delay between when a job was pushed to previously empty queue and earliest the moment when that new job could be picked up.
+
+**Note:** every worker maintains its own local list of empty queues.
+Setting this interval to a low value will increase the number of Redis calls needed to check for empty queues, increasing the total load on Redis.
+
+This setting helps manage the tradeoff between performance penalties and latency needed for reliable fetch.
+Under the hood, Sidekiq's default fetch occurs with [a single Redis `BRPOP` call](https://redis.io/commands/brpop/) which is passes list of all queues to pluck work from.
+In contrast, [reliable fetch uses `LPOPRPUSH`](https://redis.io/commands/rpoplpush/) (or the equivalent `LMOVE` in later Redis versions) to place in progress work into a WIP queue.
+However, `LPOPRPUSH` can only check one source queue to pop from at once, and [no multi-key alternative is available](https://github.com/redis/redis/issues/1785), so multiple Redis calls are needed to pluck work if an empty queue is checked.
+In order to avoid performance penalties for repeated calls to empty queues, Sidekiq Ultimate therefore maintains a list of recently know empty queues which it will avoid polling for work.
+
+Therefore:
+- If your Sidekiq architecture has *a low number of total queues*, the worst case penalty for polling empty queues will be bounded, and it is reasonable to **set a shorter refresh period**.
+- If your Sidekiq architecture has a *high number of total queues*, the worst case penalty for polling empty queues is large, and it is recommended to **set a longer refresh period**.
+- When adjusting this setting:
+    - Check that work is consumed appropriately quickly from high priority queues after they bottom out (after increasing the refresh interval)
+    - Check that backlog work does not accumulate in low priority queues (after decreasing the refresh interval)


I really like that you added all this detail but in order to make it a bit digestible, please rewrite it as suggested below:

Start by explaining the problem this feature solves

Explain how it works under the hood (the different types of Redis operations like you do now)

Explain what the performance implications are for a shared global queue (also please include the total number of workers into the description, it looks like its missing)

Then explain the constant itself

@davidbodow-st any thoughts since you proposed this section?

@stefanl-st @sensorsasha TBH, I just wrote this off the top of my head, so we can sink more time into refining the structure and wording if needed, though I think it's probably at diminishing returns right now.

I also disagree with the proposed structure, as IMHO, it would provide too much detail for quick reference use cases, which are the majority of documentation readers' time. Here, I assume that most readers will need to learn the "why" just once and the "what was that config option called again?" many times while using the feature.

So, my preferred structure is:

Quick reference overview

Explain the problem that needs to be solved

Explain the tradeoff that the implementation details require us to make

IMO we're already following this structure, and it's mostly optimized for the quick reference use case. If we want to take another pass, then I would recommend:

Tighten up the "Therefore" section and move it into the quick reference, for readers who care about making the correct decision without fully building the mental model of the implementation that they need to do that

Take a general pass to make the language more concise (could definitely be condensed)

@stefanl-st If you disagree / would still prefer your structure, can you elaborate on which audience you want to optimize for instead of quick reference readers (who would expect the Then explain the constant itself part at the very beginning, IMO)?

stefanl-st · 2023-04-21T20:26:45Z

lib/sidekiq/ultimate/configuration.rb

+      # It specifies how often the list of empty queues should be refreshed.
+      # In a nutshell, it specifies the maximum possible delay between a job was pushed to previously empty queue and
+      # the moment when that new job is picked up.
+      # Note that every worker needs to maintain its own local list of empty queues. Setting this interval to a low
+      # values will increase the number of redis calls and will increase the load on redis.
+      # @return [Integer] interval in seconds to refresh the list of empty queues


Defer some of this explanation to the readme and link it.

Suggested change

# It specifies how often the list of empty queues should be refreshed.

# In a nutshell, it specifies the maximum possible delay between a job was pushed to previously empty queue and

# the moment when that new job is picked up.

# Note that every worker needs to maintain its own local list of empty queues. Setting this interval to a low

# values will increase the number of redis calls and will increase the load on redis.

# @return [Integer] interval in seconds to refresh the list of empty queues

# Each individual worker ignores attempting to fetch jobs from queues it believes are empty by

# checking the empty queue cached state from the global empty queue cache to speed up performance

# @return [Integer] interval in seconds between global cache refreshes

I'd rather keep it here and link README to these comments. IMO it's easier to keep it up to date than README.

I'd also be fine with that to tighten up the Readme language

lib/sidekiq/ultimate/empty_queues.rb

ixti

LGTM, but IMO there's no need in re-implementing sscan_each. Both redis-rb and redis-client have convenience wrappers for that build-in.

lib/sidekiq/ultimate/configuration.rb

lib/sidekiq/ultimate/redis_sscan.rb

ixti · 2023-04-21T22:01:10Z

lib/sidekiq/ultimate/redis_sscan.rb

+          result.uniq! # Cursor is not atomic, so there may be duplicates because of concurrent update operations
+          result


Most bang-ending Ruby operators return nil if there were no changes. That's done to allow do something when changes were made or not:

if arr.uniq! puts "we removed some duplicates" end

kevinrobell-st

I agree with a lot of the comments about documentation but, in general, I think this approach makes sense.

The only thing in the back of my mind is whether 30 seconds is a sane default for the empty_queues_refresh_interval_sec. Do we know how long queues usually stay empty for?

lib/sidekiq/ultimate/interval_with_jitter.rb

lib/sidekiq/ultimate/resurrector.rb

Co-authored-by: Stefan Lynggaard <[email protected]> Co-authored-by: Kevin Robell <[email protected]>

- Add configuration for THROTTLE_TIMEOUT - Address proposed naming changes - Remove RedisSscan in favor of `.sscan_each` from redis-rb - aed -> heartbeat - cthulhu -> resurrect

sensorsasha · 2023-04-24T17:14:43Z

LGTM, but IMO there's no need in re-implementing sscan_each. Both redis-rb and redis-client have convenience wrappers for that build-in.

Thank you for pointing this out. I did a brief check of redis-rb docs but missed that for some reason. 🤝

sensorsasha · 2023-04-24T17:19:29Z

The only thing in the back of my mind is whether 30 seconds is a sane default for the empty_queues_refresh_interval_sec. Do we know how long queues usually stay empty for?

Not really. I'd say that 30 seconds looks safe to me. The rest can be tuned for any specific use case.

sensorsasha · 2023-04-24T17:28:08Z

#set_local_list => #update_local_cache

I like this idea. It definitely makes it easier to reason about.I updated the code. @stefanl-st

sensorsasha self-assigned this Apr 20, 2023

sensorsasha force-pushed the optimize_fetch_for_empty_queues branch from 02a3349 to 84b880f Compare April 20, 2023 17:00

Minimize fetches from empty queues

a81d710

sensorsasha force-pushed the optimize_fetch_for_empty_queues branch from 84b880f to a81d710 Compare April 20, 2023 18:35

sensorsasha added 9 commits April 20, 2023 19:39

Do not stop on the first CI fail

79df59f

Fix redlock namespacing

5ec29ab

Update docs

0ca26f6

Update version

8734f99

Update appraisals

ebc76dd

Do not fail if no namespace used

6aaf6e4

Sidekiq prefixes queue names so do we

6bff058

Logging for local list refresh

b0dcc61

Extract RedisSscan

5b8a33a

sensorsasha marked this pull request as ready for review April 21, 2023 14:27

sensorsasha requested review from davidbodow-st, stefanl-st and kevinrobell-st April 21, 2023 14:27

davidbodow-st reviewed Apr 21, 2023

View reviewed changes

sensorsasha and others added 3 commits April 21, 2023 20:56

Update README.md

c42276f

Co-authored-by: David Bodow <[email protected]>

- rename empty_queues_refresh_interval to empty_queues_refresh_interv…

dd79d89

…al_sec - add jitter for refresh interval

Add jitter to resurrector timers

2ed129d

sensorsasha requested a review from davidbodow-st April 21, 2023 19:25

davidbodow-st approved these changes Apr 21, 2023

View reviewed changes

stefanl-st approved these changes Apr 21, 2023

View reviewed changes

stefanl-st self-requested a review April 21, 2023 20:49

ixti approved these changes Apr 21, 2023

View reviewed changes

kevinrobell-st approved these changes Apr 21, 2023

View reviewed changes

lib/sidekiq/ultimate/interval_with_jitter.rb Outdated Show resolved Hide resolved

lib/sidekiq/ultimate/resurrector.rb Outdated Show resolved Hide resolved

stefanl-st reviewed Apr 21, 2023

View reviewed changes

lib/sidekiq/ultimate/resurrector.rb Outdated Show resolved Hide resolved

sensorsasha and others added 2 commits April 24, 2023 17:37

Apply suggestions from code review

0d39c72

Co-authored-by: Stefan Lynggaard <[email protected]> Co-authored-by: Kevin Robell <[email protected]>

- List -> Cache

403d428

- Add configuration for THROTTLE_TIMEOUT - Address proposed naming changes - Remove RedisSscan in favor of `.sscan_each` from redis-rb - aed -> heartbeat - cthulhu -> resurrect

Fix type in the docs of empty_queues_cache_refresh_interval_sec

b060f74

sensorsasha merged commit d09d540 into master Apr 24, 2023

sensorsasha deleted the optimize_fetch_for_empty_queues branch April 24, 2023 17:58

sensorsasha mentioned this pull request Nov 19, 2023

Exponentially delay polling of queue that had no elements #1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minimize fetches from empty queues #5

Minimize fetches from empty queues #5

sensorsasha commented Apr 20, 2023 •

edited

Loading

codecov-commenter commented Apr 20, 2023 •

edited

Loading

davidbodow-st left a comment

stefanl-st commented Apr 21, 2023

davidbodow-st Apr 21, 2023

ixti Apr 21, 2023

sensorsasha Apr 24, 2023

stefanl-st left a comment •

edited

Loading

stefanl-st Apr 21, 2023

sensorsasha Apr 24, 2023

davidbodow-st Apr 24, 2023

stefanl-st Apr 21, 2023

sensorsasha Apr 24, 2023

davidbodow-st Apr 24, 2023

ixti left a comment

ixti Apr 21, 2023

kevinrobell-st left a comment

sensorsasha commented Apr 24, 2023

sensorsasha commented Apr 24, 2023

sensorsasha commented Apr 24, 2023

	queues.each do \|queue\|
	job = redis.rpoplpush(queue.pending, queue.inproc)
	return UnitOfWork.new(queue, job) if job

	@exhausted.add(queue, :ttl => QUEUE_TIMEOUT)
	end

		# Silence ruby warnings
		$TESTING = true # rubocop:disable Style/GlobalVars

		result.uniq! # Cursor is not atomic, so there may be duplicates because of concurrent update operations
		result

Minimize fetches from empty queues #5

Minimize fetches from empty queues #5

Conversation

sensorsasha commented Apr 20, 2023 • edited Loading

Overview

Implementation details

Bonus

codecov-commenter commented Apr 20, 2023 • edited Loading

Codecov Report

davidbodow-st left a comment

Choose a reason for hiding this comment

stefanl-st commented Apr 21, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stefanl-st left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ixti left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevinrobell-st left a comment

Choose a reason for hiding this comment

sensorsasha commented Apr 24, 2023

sensorsasha commented Apr 24, 2023

sensorsasha commented Apr 24, 2023

sensorsasha commented Apr 20, 2023 •

edited

Loading

codecov-commenter commented Apr 20, 2023 •

edited

Loading

stefanl-st left a comment •

edited

Loading