Round up for the batch size to improve par_iter performance #9814

MJohnson459 · 2023-09-14T19:50:03Z

Objective

The default division for a usize rounds down which means the batch sizes were too small when the max_size isn't exactly divisible by the batch count.

Solution

Changing the division to round up fixes this which can dramatically improve performance when using par_iter.

I created a small example to proof this out and measured some results. I don't know if it's worth committing this permanently so I left it out of the PR for now.

use std::{thread, time::Duration};

use bevy::{
    prelude::*,
    window::{PresentMode, WindowPlugin},
};

fn main() {
    App::new()
        .add_plugins((DefaultPlugins.set(WindowPlugin {
            primary_window: Some(Window {
                present_mode: PresentMode::AutoNoVsync,
                ..default()
            }),
            ..default()
        }),))
        .add_systems(Startup, spawn)
        .add_systems(Update, update_counts)
        .run();
}

#[derive(Component, Default, Debug, Clone, Reflect)]
pub struct Count(u32);

fn spawn(mut commands: Commands) {
    // Worst case
    let tasks = bevy::tasks::available_parallelism() * 5 - 1;
    // Best case
    // let tasks = bevy::tasks::available_parallelism() * 5 + 1;
    for _ in 0..tasks {
        commands.spawn(Count(0));
    }
}

// changing the bounds of the text will cause a recomputation
fn update_counts(mut count_query: Query<&mut Count>) {
    count_query.par_iter_mut().for_each(|mut count| {
        count.0 += 1;
        thread::sleep(Duration::from_millis(10))
    });
}

Results

I ran this four times, with and without the change, with best case (should favour the old maths) and worst case (should favour the new maths) task numbers.

Worst case

Before the change the batches were 9 on each thread, plus the 5 remainder ran on one of the threads in addition. With the change its 10 on each thread apart from one which has 9. The results show a decrease from ~140ms to ~100ms which matches what you would expect from the maths (10 * 10ms vs (9 + 4) * 10ms).

Best case

Before the change the batches were 10 on each thread, plus the 1 remainder ran on one of the threads in addition. With the change its 11 on each thread apart from one which has 5. The results slightly favour the new change but are basically identical as the total time is determined by the worse case which is 11 * 10ms for both tests.

The default usize division rounds down which leads to faulty batch sizes when the max_size isn't exactly divisible.

alice-i-cecile

Great PR description and clear demonstration of the effect. Thanks!

alice-i-cecile

Actually, can you please update the docs to describe this behaviour?

MJohnson459 · 2023-09-14T20:02:13Z

Thanks! I added a small clarification to the docs to be explicit about the rounding. Is there anywhere else the rough maths is mentioned? That is the docs the bevy 0.10 release notes point to and the only place I could find it mentioned.

james7132

I've been trying to construct a degenerate case where this might cause a perf regression but cannot think of one at this moment.

MJohnson459 · 2023-09-15T21:30:07Z

I've been trying to construct a degenerate case where this might cause a perf regression but cannot think of one at this moment.

The worst case I can come up with is if I change the test to have multiple different par_iter systems in the same set and each has a different time. The batch calculation above assumes the executor will evenly spread out the tasks across threads and while this is true for the single case, it isn't true for multiple. It seems that we can hit a case where two of the longer tasks are assigned the same thread.

Here is an example update which shows a sub-optimal allocation. I struggled to check tracy to show this nicely, I don't know why the long ones are shown under the short task.

Anyway, in this case, this change may make things worse (two 10ms vs two 9ms going back to the original example) but the scheduling noise is quite high so I wasn't able to show this well. It also seems the root of the issue here is with the executor, although I don't know if that's something that could even be fixed (I guess it assumes all tasks are equal?). The only thing I can think of would be somehow tell it all tasks from a single par_iter loop are likely to be a similar duration, a hairy assumption.

…ne#9814) # Objective The default division for a `usize` rounds down which means the batch sizes were too small when the `max_size` isn't exactly divisible by the batch count. ## Solution Changing the division to round up fixes this which can dramatically improve performance when using `par_iter`. I created a small example to proof this out and measured some results. I don't know if it's worth committing this permanently so I left it out of the PR for now. ```rust use std::{thread, time::Duration}; use bevy::{ prelude::*, window::{PresentMode, WindowPlugin}, }; fn main() { App::new() .add_plugins((DefaultPlugins.set(WindowPlugin { primary_window: Some(Window { present_mode: PresentMode::AutoNoVsync, ..default() }), ..default() }),)) .add_systems(Startup, spawn) .add_systems(Update, update_counts) .run(); } #[derive(Component, Default, Debug, Clone, Reflect)] pub struct Count(u32); fn spawn(mut commands: Commands) { // Worst case let tasks = bevy::tasks::available_parallelism() * 5 - 1; // Best case // let tasks = bevy::tasks::available_parallelism() * 5 + 1; for _ in 0..tasks { commands.spawn(Count(0)); } } // changing the bounds of the text will cause a recomputation fn update_counts(mut count_query: Query<&mut Count>) { count_query.par_iter_mut().for_each(|mut count| { count.0 += 1; thread::sleep(Duration::from_millis(10)) }); } ``` ## Results I ran this four times, with and without the change, with best case (should favour the old maths) and worst case (should favour the new maths) task numbers. ### Worst case Before the change the batches were 9 on each thread, plus the 5 remainder ran on one of the threads in addition. With the change its 10 on each thread apart from one which has 9. The results show a decrease from ~140ms to ~100ms which matches what you would expect from the maths (`10 * 10ms` vs `(9 + 4) * 10ms`). ![Screenshot from 2023-09-14 20-24-36](https://github.com/bevyengine/bevy/assets/1353401/82099ee4-83a8-47f4-bb6b-944f1e87a818) ### Best case Before the change the batches were 10 on each thread, plus the 1 remainder ran on one of the threads in addition. With the change its 11 on each thread apart from one which has 5. The results slightly favour the new change but are basically identical as the total time is determined by the worse case which is `11 * 10ms` for both tests. ![Screenshot from 2023-09-14 20-48-51](https://github.com/bevyengine/bevy/assets/1353401/4532211d-ab36-435b-b864-56af3370d90e)

Round up for the batch size

e48e72a

The default usize division rounds down which leads to faulty batch sizes when the max_size isn't exactly divisible.

alice-i-cecile added A-ECS Entities, components, systems, and events C-Performance A change motivated by improving speed, memory usage or compile times A-Tasks Tools for parallel and async work labels Sep 14, 2023

alice-i-cecile approved these changes Sep 14, 2023

View reviewed changes

alice-i-cecile requested changes Sep 14, 2023

View reviewed changes

Add snippet to docs to describe behaviour

55299bf

james7132 self-requested a review September 14, 2023 20:29

james7132 approved these changes Sep 14, 2023

View reviewed changes

alice-i-cecile approved these changes Sep 15, 2023

View reviewed changes

alice-i-cecile added the S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it label Sep 15, 2023

alice-i-cecile added this pull request to the merge queue Sep 18, 2023

Merged via the queue into bevyengine:main with commit 68fa81e Sep 18, 2023
22 checks passed

cart mentioned this pull request Oct 13, 2023

News: Release 0.12 bevyengine/bevy-website#754

Merged

43 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Round up for the batch size to improve par_iter performance #9814

Round up for the batch size to improve par_iter performance #9814

MJohnson459 commented Sep 14, 2023

alice-i-cecile left a comment

alice-i-cecile left a comment •

edited

Loading

MJohnson459 commented Sep 14, 2023

james7132 left a comment

MJohnson459 commented Sep 15, 2023

Round up for the batch size to improve par_iter performance #9814

Round up for the batch size to improve par_iter performance #9814

Conversation

MJohnson459 commented Sep 14, 2023

Objective

Solution

Results

Worst case

Best case

alice-i-cecile left a comment

Choose a reason for hiding this comment

alice-i-cecile left a comment • edited Loading

Choose a reason for hiding this comment

MJohnson459 commented Sep 14, 2023

james7132 left a comment

Choose a reason for hiding this comment

MJohnson459 commented Sep 15, 2023

alice-i-cecile left a comment •

edited

Loading