Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace the local worker queues with st3's #115

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

james7132
Copy link
Contributor

@james7132 james7132 commented Apr 22, 2024

Fix #32. This PR replaces the fixed-sized local worker queues with st3's implementation. The implementation in the crate itself is largely the same, but st3's implementation should use considerably fewer atomic operations.

Performance wise, this seems to provide a major performance improvement across the board, particularly for single threaded cases, since pushing no longer requires any atomic operations, and popping only requires one. This does contain one major performance regression with multi_thread/executor::spawn_one, and it's unclear why that's the case. My current working theory is that the atomic-free push to local queues is putting the global queue under higher contention.

executor::create        time:   [725.66 ns 726.23 ns 726.98 ns]
                        change: [-32.237% -31.965% -31.775%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  5 (5.00%) high severe

single_thread/executor::spawn_one
                        time:   [923.93 ns 936.62 ns 950.73 ns]
                        change: [-37.180% -33.978% -30.511%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  11 (11.00%) high mild
single_thread/executor::spawn_batch
                        time:   [34.023 µs 36.536 µs 39.778 µs]
                        change: [+22.133% +34.229% +44.880%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe
single_thread/executor::spawn_many_local
                        time:   [4.6916 ms 4.7248 ms 4.7611 ms]
                        change: [-3.5935% -2.6420% -1.6484%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  6 (6.00%) high mild
  5 (5.00%) high severe
single_thread/executor::spawn_recursively
                        time:   [35.261 ms 35.584 ms 35.936 ms]
                        change: [-26.953% -26.069% -25.097%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe
single_thread/executor::yield_now
                        time:   [5.4241 ms 5.4290 ms 5.4344 ms]
                        change: [-10.452% -9.0866% -7.9914%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

multi_thread/executor::spawn_one
                        time:   [14.511 µs 14.882 µs 15.177 µs]
                        change: [+674.99% +725.04% +767.31%] (p = 0.00 < 0.05)
                        Performance has regressed.
multi_thread/executor::spawn_batch
                        time:   [53.164 µs 58.006 µs 63.134 µs]
                        change: [-19.348% -12.758% -5.2161%] (p = 0.00 < 0.05)
                        Performance has improved.
multi_thread/executor::spawn_many_local
                        time:   [27.513 ms 27.608 ms 27.705 ms]
                        change: [+1.8542% +2.4549% +3.0788%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild
Benchmarking multi_thread/executor::spawn_recursively: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 17.7s, or reduce sample count to 20.
multi_thread/executor::spawn_recursively
                        time:   [174.31 ms 174.66 ms 175.04 ms]
                        change: [-1.8165% -1.5293% -1.2438%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe
multi_thread/executor::yield_now
                        time:   [23.860 ms 23.931 ms 23.996 ms]
                        change: [-1.8530% -1.4776% -1.0672%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low severe
  5 (5.00%) low mild

single_thread/static_executor::spawn_one
                        time:   [671.98 ns 680.60 ns 689.94 ns]
                        change: [-53.423% -50.975% -48.005%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  3 (3.00%) high mild
  9 (9.00%) high severe
single_thread/static_executor::spawn_many_local
                        time:   [4.4846 ms 4.5148 ms 4.5500 ms]
                        change: [-11.369% -10.414% -9.3355%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  4 (4.00%) high severe
single_thread/static_executor::spawn_recursively
                        time:   [24.470 ms 24.599 ms 24.738 ms]
                        change: [-6.6356% -5.5074% -4.4066%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe
single_thread/static_executor::yield_now
                        time:   [5.3366 ms 5.3424 ms 5.3486 ms]
                        change: [-6.6970% -6.4629% -6.2391%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

multi_thread/static_executor::spawn_one
                        time:   [13.876 µs 14.197 µs 14.443 µs]
                        change: [+704.49% +755.40% +805.04%] (p = 0.00 < 0.05)
                        Performance has regressed.
multi_thread/static_executor::spawn_many_local
                        time:   [4.5342 ms 4.6639 ms 4.7927 ms]
                        change: [-21.861% -19.142% -16.242%] (p = 0.00 < 0.05)
                        Performance has improved.
multi_thread/static_executor::spawn_recursively
                        time:   [43.786 ms 44.057 ms 44.264 ms]
                        change: [-0.6134% -0.0025% +0.5506%] (p = 1.00 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) low severe
  1 (1.00%) low mild
multi_thread/static_executor::yield_now
                        time:   [23.979 ms 24.052 ms 24.121 ms]
                        change: [+0.4275% +0.8193% +1.1661%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low severe
  4 (4.00%) low mild

This is a breaking change. It makes the future returned by Executor::run no longer Send or Sync.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Consider using st3 for work-stealing
1 participant