Establishing More Reliable Benchmarks #655

zicklag · 2020-10-10T16:20:00Z

zicklag
Oct 10, 2020
Collaborator

So I was just working on the Bevy ECS trying to prepare it for scripting, and I had been running the ecs_bench to make sure I didn't introduce performance regressions while modifying one of the really "hot" portions of the code. But then I ran into results that are really making me doubt the usefulness of these micro benchmarks.

The Experiment

Take this simple experiment for example:

# Clone Bevy
$ git clone https://github.com/bevyengine/bevy.git
# Clone my fork of the ecs_bench, which is identical to Carters, but with the non-bevy benches
# commented out for the sake of time.
$ git clone https://github.com/katharostech/ecs_bench.git --branch just-bevy-benches
# Go to the benchmarks
$ cd ecs_bench
# Run the benchmarks
$ cargo bench
pos_vel/bevy            time:   [932.68 ns 937.92 ns 944.68 ns]                          
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe
pos_vel/bevy_entity     time:   [937.14 ns 942.81 ns 948.24 ns]                                 
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe
pos_vel/bevy_foreach    time:   [959.35 ns 980.90 ns 1.0056 us]                                  
Found 17 outliers among 100 measurements (17.00%)
  7 (7.00%) high mild
  10 (10.00%) high severe
pos_vel/bevy_foreach_entity                                                                             
                        time:   [947.12 ns 953.02 ns 959.04 ns]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

build/bevy              time:   [325.21 us 329.16 us 333.92 us]                       
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

To highlight the number's we'll be looking at:

pos_vel/bevy mean: 937.92 ns
pos_vel/bevy_entity mean: 942.81 ns

OK, now that we've run the benchmark once, edit the benches/pos_vel/bevy.rs file and comment out the bevy_foreach benchmarks so that the bench() function body looks like this:

let world = build();
let resources = Resources::default();
let mut system = move_system.system();
group.bench_function("bevy", |b| b.iter(|| system.run(&world, &resources)));

let world = build();
let resources = Resources::default();
let mut system = move_system_entity.system();
group.bench_function("bevy_entity", |b| b.iter(|| system.run(&world, &resources)));

// Comment these out

// let world = build();
// let resources = Resources::default();
// let mut system = move_system_foreach.system();
// group.bench_function("bevy_foreach", |b| {
//     b.iter(|| system.run(&world, &resources))
// });

// let world = build();
// let resources = Resources::default();
// let mut system = move_system_foreach_entity.system();
// group.bench_function("bevy_foreach_entity", |b| {
//     b.iter(|| system.run(&world, &resources));
// });

Now run the bench again:

$ cargo bench
pos_vel/bevy            time:   [1.8610 us 1.8639 us 1.8672 us]                          
                        change: [+95.808% +97.307% +98.741%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  6 (6.00%) high mild
  4 (4.00%) high severe
pos_vel/bevy_entity     time:   [1.9662 us 1.9874 us 2.0106 us]                                 
                        change: [+111.43% +114.15% +116.99%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

build/bevy              time:   [317.80 us 321.66 us 325.78 us]                       
                        change: [-5.8767% -4.4653% -3.1427%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

Now lets check our iteration times:

pos_vel/bevy mean: 1.8639 us +97.307%
pos_vel/bevy_entity mean: 1.9874 us +114.15%

So we did not change any bevy code. All we did is comment out some of the benchmarks and it made it 100% slower!

I could be missing something, and maybe this is relatively normal and the benchmarks are still useful if the only variable is bevy's own code, but even in that situation I ran into some really strange performance effects. Again, maybe that is still normal, and that code could be sensitive to changes, but the above experiment is making me a lot less sure about how useful guiding optimization based on at least our current benchmarks is.

How Can We Help This?

So I wanted to start this discussion to start working out how we might be able to create better benchmarks or profiling solutions. I'm open to anything. I've thought about creating real games as benchmarks and then measuring frames per second. Or maybe an intrusive profiler is more useful, but profilers have overhead, so I'm not sure. And maybe we still use the current benchmarks and we just need more different strategies for measuring performance. I don't know, but I think it's definitely important to realize that micro-benchmarking probably isn't going to give us a true impression of Bevy's performance and we're going to want to start understanding how we can track Bevy's performance more reliably.

What do you think?

cart · 2020-10-10T17:09:28Z

cart
Oct 10, 2020
Maintainer

I also noticed the behavior much earlier in Bevy's development and eventually found something "stable", but it was extremely sensitive to changes. I have a feeling that one of the many changes made since then regressed here.

I think a part of this is that micro benchmarks will always have different performance characteristics. But I also think the current bevy_hecs is probably more sensitive to changes than "normal".

Interestingly, my experimental safe(-er) hecs refactor remains nice and stable when you run the "test" above. Its not ready for prime time / I'm not yet sure its the right path to take bevy_ecs on, but I pushed the branch anyway in case you want to test. If it makes bevy perform better on the average case, thats maybe another reason to polish it up and merge it.

2 replies

zicklag Oct 10, 2020
Collaborator Author

Oh, that's interesting. I'll try out your experimental branch. If it was safe-er and stayed more stable across changes that would be nice. Then it would probably be easier to accept contributions by combination of less unsafe and more performance stability.

It seems like we might still want to devise at least a larger variety of benchmarks so that we can measure the impact better. Maybe we can create some bevy_templates and then use them as benchmarks.

Maybe I'll try to throw together some small benchmark games for testing. If we have different full games that should better test how the code performs in the context of a larger set of unrelated code, which is where we see the strange instability in my experiment above.

cart Oct 10, 2020
Maintainer

Yeah i think thats a good call. Maybe a "headless game" is a good idea, as that would eliminate rendering / audio / input from the equation.

zicklag · 2020-10-11T15:32:26Z

zicklag
Oct 11, 2020
Collaborator Author

Found some useful tips performance optimizing here:

https://github.com/flamegraph-rs/flamegraph#systems-performance-work-guided-by-flamegraphs

0 replies

zicklag · 2020-10-13T00:54:52Z

zicklag
Oct 13, 2020
Collaborator Author

I created a repo for bevy benchmark games and added an asteroids-ish game. I still don't know exactly the best metrics to collect or tools to use to actually do benchmarking/profiling, but I'm looking into it. Tools like Linux perf and Valgrind seem useful for this kind of stuff, but I don't know how to use them.

https://github.com/katharostech/bevy_benchmark_games

Just measuring the time it takes for so many frames to run might be fine enough. On Linux the benchmark game is able to use the perf_event crate to count number of CPU cycles and instructions for the game execution, which seems like a useful metric.

Anyway, I'll probably put together one more game example and then start figuring out what kind of metrics to actually collect.

If anybody has profiling or benchmarking experience and could give me some pointers that would be great. 😃

4 replies

zicklag Oct 14, 2020
Collaborator Author

Update the benchmark repo so that it now runs the asteroids game for 2000 frames at a time and grabs the average frame time, the CPU instructions, and the CPU cycles, then it does that 50 times. I'm going to try to get a nice workflow similar to criterion so you can just cargo run and it will collect and compare all of the benchmark metrics for you with graphs and such, too. The difference between this and criterion is mostly that I'm also collecting the CPU instructions data, not just the amount of time it takes to run. I think that metric might be particularly helpful as it won't register differently based on the CPU load I don't think.

zicklag Oct 15, 2020
Collaborator Author

Now I've got the benchmarks creating a report similar to criterion, but also featuring the CPU instructions and CPU cycles graphs:

The cool part is that the CPU instructions graph accurately doesn't change hardly 0.1% when running multiple benchmarks without changes to the code. I think that the CPU instructions graph will be less noisy the timing-based benchmarks probably.

I threw CPU cycles in there because I could, but I don't specifically know what it's useful for. 🤷 Maybe somebody else does, so I'll leave it in there for now.

I think all that's left now is to add a couple more benchmark games.

zicklag Oct 15, 2020
Collaborator Author

Just ported the breakout example as a benchmark:

It doesn't test much specific, like heavy concurrency or anything, but I think that's OK. I think the point should be to slowly build on the benchmarks and make our suites more comprehensive over time similar to unit tests or something like that.

cart Oct 15, 2020
Maintainer

Very cool :)

I have a feeling that these will be very useful!

zicklag · 2020-10-13T15:50:31Z

zicklag
Oct 13, 2020
Collaborator Author

Opened a Rust forum topic to see if we can get any pointers from people with benchmarking tips or experience:

https://users.rust-lang.org/t/guidance-for-profiling-and-benchmarking-for-bevy-game-engine/50062?u=zicklag

3 replies

zicklag Oct 16, 2020
Collaborator Author

An interesting post on the Rust forum mentions that cache alignment could be the cause of weird performance differences caused by commenting out code:

https://users.rust-lang.org/t/compilers-are-black-magic-mostly-good-black-magic/50093/8?u=zicklag

IndianBoy42: Actually I would think that instruction cache alignment is more likely cause than branch prediction weirdness. I remember seeing a couple of other articles/posts where microbenchmarks got weird results that they determined was due to differing cache alignment of the code, which was essentially random based on other parts of the code

cart Oct 16, 2020
Maintainer

Its reassuring to know that this might be an instance of a more generic problem, but it also makes me trust everything so much less now. Bawhaha like why measure anything at all 🤷

zicklag Oct 16, 2020
Collaborator Author

I know, right? Benchmarking is so hard. 🤦

zicklag · 2020-10-16T15:08:01Z

zicklag
Oct 16, 2020
Collaborator Author

@cart I started checking out your archetype_rework branch, but couldn't try it out with the new benchmarks yet because the renderer won't compile with it. I wanted to make sure you thought it made sense for me to get that in a compiling state so we can test it with the benchmarks. I don't think it's far from working, I just have to figure out how to adapt a section of the renderer to work with the new resources setup.

Also, It'd be good for me to get more familiar with the ECS internals anyway so I could start cleaning up that code if you wanted and trying to get it merge ready. I'm not sure how far from the design you were going for that it is, but if the design is essentially there, I could probably fix the remaining failing tests ( I think there's one failing I think, and a new one I wrote that's failing for SOA ) and otherwise clean up the old comments and such.

Edit: actually if we wanted to benchmark it I forgot that I can leave the renderer out and run it headless, like I was already doing anyway, I just have to disable the feature.

3 replies

cart Oct 16, 2020
Maintainer

I'd say hold off for now on getting it merge ready. I still want to spend a bit more time with it before even considering it as an option. If its easy to benchmark it (esp in headless mode) then thats cool, but definitely don't invest too much time in it as my interest in its performance is largely academic at this point.

zicklag Oct 16, 2020
Collaborator Author

OK, sounds good. I'll benchmark it just for for the experiment's sake then and then move on.

BTW, I'm really loving the simplicity of the archetype module in the refactor design. 😃

zicklag Oct 16, 2020
Collaborator Author

Ah, the resources are not fully implemented yet so the benchmarks won't run. I'll just skip benchmarking it for now.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Establishing More Reliable Benchmarks #655

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 12 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Establishing More Reliable Benchmarks #655

zicklag Oct 10, 2020 Collaborator

The Experiment

How Can We Help This?

Replies: 5 comments · 12 replies

cart Oct 10, 2020 Maintainer

zicklag Oct 10, 2020 Collaborator Author

cart Oct 10, 2020 Maintainer

zicklag Oct 11, 2020 Collaborator Author

zicklag Oct 13, 2020 Collaborator Author

zicklag Oct 14, 2020 Collaborator Author

zicklag Oct 15, 2020 Collaborator Author

zicklag Oct 15, 2020 Collaborator Author

cart Oct 15, 2020 Maintainer

zicklag Oct 13, 2020 Collaborator Author

zicklag Oct 16, 2020 Collaborator Author

cart Oct 16, 2020 Maintainer

zicklag Oct 16, 2020 Collaborator Author

zicklag Oct 16, 2020 Collaborator Author

cart Oct 16, 2020 Maintainer

zicklag Oct 16, 2020 Collaborator Author

zicklag Oct 16, 2020 Collaborator Author

zicklag
Oct 10, 2020
Collaborator

Replies: 5 comments 12 replies

cart
Oct 10, 2020
Maintainer

zicklag Oct 10, 2020
Collaborator Author

cart Oct 10, 2020
Maintainer

zicklag
Oct 11, 2020
Collaborator Author

zicklag
Oct 13, 2020
Collaborator Author

zicklag Oct 14, 2020
Collaborator Author

zicklag Oct 15, 2020
Collaborator Author

zicklag Oct 15, 2020
Collaborator Author

cart Oct 15, 2020
Maintainer

zicklag
Oct 13, 2020
Collaborator Author

zicklag Oct 16, 2020
Collaborator Author

cart Oct 16, 2020
Maintainer

zicklag Oct 16, 2020
Collaborator Author

zicklag
Oct 16, 2020
Collaborator Author

cart Oct 16, 2020
Maintainer

zicklag Oct 16, 2020
Collaborator Author

zicklag Oct 16, 2020
Collaborator Author