-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Noticeable schedules running overhead #11932
Comments
Might be related to #11931. |
Note that on Discord we looked into this a bit more, and still got a regression vs 0.12 with system stepping off. It worsened the problem, but wasn't fully the cause. |
I was able to bisect it to #8453, but not sure what could be causing it yet. Everything obvious seems to be |
I tried cfg'ing out the changes to the run function signature and it seemed to get rid of most of the extra overhead. Not sure why that helped and we probably don't want to be changing function signatures, so not goinig to pr it. main...hymm:bevy:investigate-empty-system-overhead |
I dug into the Empty Schedule Benchmark for
|
#[cfg(feature = "bevy_debug_stepping")] | |
{ | |
use bevy_ecs::schedule::{IntoSystemConfigs, Stepping}; | |
app.add_systems(Main, Stepping::begin_frame.before(Main::run_main)); | |
} |
Schedule::run()
bevy/crates/bevy_ecs/src/schedule/schedule.rs
Lines 338 to 342 in abb8c35
#[cfg(feature = "bevy_debug_stepping")] | |
let skip_systems = match world.get_resource_mut::<Stepping>() { | |
None => None, | |
Some(mut stepping) => stepping.skipped_systems(self), | |
}; |
if let Some(mut skipped_systems) = _skip_systems { |
- check if
_skip_systems
isNone
Based on the structure of the benchmark (five empty schedules), the changes to Schedule::run()
(the get_resource_mut()
) is being hit 5x more than begin_frame()
, but honestly that reflects real-world use.
Stepping Performance Improvement Idea
Even without concrete numbers, there are a couple of ways we can remove any stepping performance impacts:
The biggest performance win will be to move the skip list from Stepping
into the Schedule
. Then have Stepping
apply changes to the step list during begin_frame()
. This eliminates the get_resource_mut()
for every schedule every frame. Side effect of this is that Stepping
loses dynamic schedule order detection, which honestly was a bad idea on my part. Instead we can just use the explicit order the schedules were added to Stepping
, and provide mechanisms to change the order.
A second performance improvement is to remove the begin_frame()
system unless the Stepping
resource is present. Right now we check for the resource every frame, but we could eliminate that if we could add/remove systems at runtime. As that's not currently possible, we can add a condition to the begin_frame()
that checks some global variable in schedule::stepping
to see if it should run.
As an alternative to both of these, we can move the needed stepping run state out of the resource and into a global variable. It's effectively the combination of the two options above. The Stepping
resource either goes away or becomes an interface to that global state. I'm not really a fan of this as it feels very dirty, but it should be mentioned.
That's crazy, just the removal of the |
Oh that's very weird. Wait, could this be that |
Oh yeah, if we're hitting the allocator there, that might be it. |
I jumped the gun, that's actually a call to
# verifying the address is `MultithreadedExecutor::run()`
(lldb) image lookup -a 0x00000001032b4d40
Address: breakout[0x00000001032b4d40] (breakout.__TEXT.__text + 53158848)
Summary: breakout`_$LT$bevy_ecs..schedule..executor..multi_threaded..MultiThreadedExecutor$u20$as$u20$bevy_ecs..schedule..executor..SystemExecutor$GT$::run::hdf2d2d0ce9eb657b at multi_threaded.rs:166
# disassemble `MultithreadedExecutor::run()`:
(lldb) x/40i 0x00000001032b4d40
0x1032b4d40: 55 other pushq %rbp
0x1032b4d41: 48 89 e5 other movq %rsp, %rbp
0x1032b4d44: 48 81 ec b0 04 00 00 other subq $0x4b0, %rsp ; imm = 0x4B0
0x1032b4d4b: 48 89 b5 88 fc ff ff other movq %rsi, -0x378(%rbp)
0x1032b4d52: 48 89 f8 other movq %rdi, %rax
0x1032b4d55: 48 8b bd 88 fc ff ff other movq -0x378(%rbp), %rdi
0x1032b4d5c: 48 89 85 90 fc ff ff other movq %rax, -0x370(%rbp)
0x1032b4d63: 48 89 bd 98 fc ff ff other movq %rdi, -0x368(%rbp)
0x1032b4d6a: 48 89 95 a0 fc ff ff other movq %rdx, -0x360(%rbp)
0x1032b4d71: 48 89 8d a8 fc ff ff other movq %rcx, -0x358(%rbp)
0x1032b4d78: 48 89 45 88 other movq %rax, -0x78(%rbp)
0x1032b4d7c: 48 89 7d 90 other movq %rdi, -0x70(%rbp)
0x1032b4d80: 48 89 4d 98 other movq %rcx, -0x68(%rbp)
0x1032b4d84: c6 45 87 00 other movb $0x0, -0x79(%rbp)
0x1032b4d88: c6 45 86 00 other movb $0x0, -0x7a(%rbp)
0x1032b4d8c: c6 45 85 00 other movb $0x0, -0x7b(%rbp)
0x1032b4d90: c6 45 84 00 other movb $0x0, -0x7c(%rbp)
0x1032b4d94: c6 45 83 00 other movb $0x0, -0x7d(%rbp)
0x1032b4d98: c6 45 84 01 other movb $0x1, -0x7c(%rbp)
0x1032b4d9c: 48 83 c7 18 other addq $0x18, %rdi
0x1032b4da0: e8 cb 7d fe ff call callq 0x10329cb70
0x1032b4da5: 48 89 85 b0 fc ff ff other movq %rax, -0x350(%rbp)
0x1032b4dac: eb 3c jump jmp 0x1032b4dea
0x1032b4dae: 48 8b 95 a0 fc ff ff other movq -0x360(%rbp), %rdx
0x1032b4db5: 48 be 00 00 00 00 00 00 00 80 other movabsq $-0x8000000000000000, %rsi ; imm = 0x8000000000000000
0x1032b4dbf: b8 01 00 00 00 other movl $0x1, %eax
0x1032b4dc4: 31 c9 other xorl %ecx, %ecx
0x1032b4dc6: 48 39 32 other cmpq %rsi, (%rdx)
0x1032b4dc9: 48 0f 44 c1 other cmoveq %rcx, %rax
0x1032b4dcd: 48 83 f8 01 other cmpq $0x1, %rax
0x1032b4dd1: 0f 84 43 0c 00 00 cond jump je 0x1032b5a1a
0x1032b4dd7: e9 44 0c 00 00 jump jmp 0x1032b5a20
0x1032b4ddc: 48 89 c1 other movq %rax, %rcx
0x1032b4ddf: 89 d0 other movl %edx, %eax
0x1032b4de1: 48 89 4d a0 other movq %rcx, -0x60(%rbp)
0x1032b4de5: 89 45 a8 other movl %eax, -0x58(%rbp)
0x1032b4de8: eb c4 jump jmp 0x1032b4dae
0x1032b4dea: 48 8b 85 90 fc ff ff other movq -0x370(%rbp), %rax
0x1032b4df1: 48 8b 8d b0 fc ff ff other movq -0x350(%rbp), %rcx
0x1032b4df8: 48 89 88 c8 01 00 00 other movq %rcx, 0x1c8(%rax)
# Check the first call, looks like a Vec alloc -- NOPE, is Vec::len()
(lldb) image lookup -a 0x10329cb70
Address: breakout[0x000000010329cb70] (breakout.__TEXT.__text + 53060080)
Summary: breakout`alloc::vec::Vec$LT$T$C$A$GT$::len::hed1b7d1b42e2eadd at mod.rs:2141 |
|
I read further into the mangled function name; it's not an alloc; it's a |
I retract my request. I'm making this code change and it's getting complicated quickly. I'll keep poking at it trying to figure out why the argument is causing problems. |
Alright, I broke down the disassembly of v0.13.0 stepping enabled We're not explicitly allocing from what I can see here with stepping disabled, but what I do see is calls to # calls to `drop_in_place()` for FixedBitSet in v0.13 with stepping enabled
dmlary@beast:~/src/bevy/vendor/bevy/benches> grep callq ../disass-run-stepping.txt | grep drop_in_place | grep fixedbitset
0x100445970 <+208>: callq 0x1005c3c50 ; core::ptr::drop_in_place$LT$fixedbitset..FixedBitSet$GT$::h6dcfbf4b71ae901f at mod.rs:507
0x10044598c <+236>: callq 0x1005c3c50 ; core::ptr::drop_in_place$LT$fixedbitset..FixedBitSet$GT$::h6dcfbf4b71ae901f at mod.rs:507
0x100445b23 <+643>: callq 0x1005c3c50 ; core::ptr::drop_in_place$LT$fixedbitset..FixedBitSet$GT$::h6dcfbf4b71ae901f at mod.rs:507
0x1004460fb <+2139>: callq 0x1005c3c50 ; core::ptr::drop_in_place$LT$fixedbitset..FixedBitSet$GT$::h6dcfbf4b71ae901f at mod.rs:507
0x10043d400 <+208>: callq 0x1004f4c50 ; core::ptr::drop_in_place$LT$core..option..Option$LT$fixedbitset..FixedBitSet$GT$$GT$::h2cecbb3fcde869ed at mod.rs:507
0x10043d8a9 <+1401>: callq 0x1005c3c50 ; core::ptr::drop_in_place$LT$fixedbitset..FixedBitSet$GT$::h6dcfbf4b71ae901f at mod.rs:507
0x10043dfc4 <+3220>: callq 0x1005c3c50 ; core::ptr::drop_in_place$LT$fixedbitset..FixedBitSet$GT$::h6dcfbf4b71ae901f at mod.rs:507
0x10043e020 <+3312>: callq 0x1005c3c50 ; core::ptr::drop_in_place$LT$fixedbitset..FixedBitSet$GT$::h6dcfbf4b71ae901f at mod.rs:507
# Calls for `drop_in_place()` for `Option<FixedBitSet>` in v0.13 with stepping disabled
dmlary@beast:~/src/bevy/vendor/bevy/benches> grep callq ../disass-run-without-stepping-013.txt | grep drop_in_place | grep fixedbitset
0x100484d85 <+101>: callq 0x100434aa0 ; core::ptr::drop_in_place$LT$core..option..Option$LT$fixedbitset..FixedBitSet$GT$$GT$::hc51e23594530a709 at mod.rs:507
0x100484eb5 <+405>: callq 0x100434aa0 ; core::ptr::drop_in_place$LT$core..option..Option$LT$fixedbitset..FixedBitSet$GT$$GT$::hc51e23594530a709 at mod.rs:507
0x100489ea9 <+105>: callq 0x100434aa0 ; core::ptr::drop_in_place$LT$core..option..Option$LT$fixedbitset..FixedBitSet$GT$$GT$::hc51e23594530a709 at mod.rs:507
0x100489ee7 <+167>: callq 0x100434aa0 ; core::ptr::drop_in_place$LT$core..option..Option$LT$fixedbitset..FixedBitSet$GT$$GT$::hc51e23594530a709 at mod.rs:507
0x10048a773 <+2355>: callq 0x100434aa0 ; core::ptr::drop_in_place$LT$core..option..Option$LT$fixedbitset..FixedBitSet$GT$$GT$::hc51e23594530a709 at mod.rs:507
# calls to `drop_in_place()` for Option<FixedBitSet> with @hymm's feature gate on the arg
dmlary@beast:~/src/bevy/vendor/bevy/benches> grep callq ../disass-run-without-stepping.txt | grep drop_in_place | grep fixedbitset
0x1004d3b7a <+186>: callq 0x100434e20 ; core::ptr::drop_in_place$LT$core..option..Option$LT$fixedbitset..FixedBitSet$GT$$GT$::hc51e23594530a709 at mod.rs:507
0x1004d3b93 <+211>: callq 0x100434e20 ; core::ptr::drop_in_place$LT$core..option..Option$LT$fixedbitset..FixedBitSet$GT$$GT$::hc51e23594530a709 at mod.rs:507 Just looking at the differences between @hymm 's version and v0.13, both with stepping disabled, my best guess is that the cost of dropping For v0.13 with stepping disabled, we do allocate an additional 80 bytes on the stack versus hymm's with stepping disabled, but I wasn't able to find us copying those 80 bytes over when I'm unable to explain this behavior for now. I'll take another look tomorrow. |
Just to rule it out, I tried |
Tryed `&Option<FixedBitSet>` for SystemExecutor::run(), but it didn't change the performance hit of the argument. related bevyengine#11932
# Objective * Fixes #11932 (performance impact when stepping is disabled) ## Solution The `Option<FixedBitSet>` argument added to `ScheduleExecutor::run()` in #8453 caused a measurable performance impact even when stepping is disabled. This can be seen by the benchmark of running `Schedule:run()` on an empty schedule in a tight loop (#11932 (comment)). I was able to get the same performance results as on 0.12.1 by changing the argument `ScheduleExecutor::run()` from `Option<FixedBitSet>` to `Option<&FixedBitSet>`. The down-side of this change is that `Schedule::run()` now takes about 6% longer (3.7319 ms vs 3.9855ns) when stepping is enabled --- ## Changelog * Change `ScheduleExecutor::run()` `_skipped_systems` from `Option<FixedBitSet>` to `Option<&FixedBitSet>` * Added a few benchmarks to measure `Schedule::run()` performance with various executors
# Objective * Fixes bevyengine#11932 (performance impact when stepping is disabled) ## Solution The `Option<FixedBitSet>` argument added to `ScheduleExecutor::run()` in bevyengine#8453 caused a measurable performance impact even when stepping is disabled. This can be seen by the benchmark of running `Schedule:run()` on an empty schedule in a tight loop (bevyengine#11932 (comment)). I was able to get the same performance results as on 0.12.1 by changing the argument `ScheduleExecutor::run()` from `Option<FixedBitSet>` to `Option<&FixedBitSet>`. The down-side of this change is that `Schedule::run()` now takes about 6% longer (3.7319 ms vs 3.9855ns) when stepping is enabled --- ## Changelog * Change `ScheduleExecutor::run()` `_skipped_systems` from `Option<FixedBitSet>` to `Option<&FixedBitSet>` * Added a few benchmarks to measure `Schedule::run()` performance with various executors
# Objective * Fixes bevyengine#11932 (performance impact when stepping is disabled) ## Solution The `Option<FixedBitSet>` argument added to `ScheduleExecutor::run()` in bevyengine#8453 caused a measurable performance impact even when stepping is disabled. This can be seen by the benchmark of running `Schedule:run()` on an empty schedule in a tight loop (bevyengine#11932 (comment)). I was able to get the same performance results as on 0.12.1 by changing the argument `ScheduleExecutor::run()` from `Option<FixedBitSet>` to `Option<&FixedBitSet>`. The down-side of this change is that `Schedule::run()` now takes about 6% longer (3.7319 ms vs 3.9855ns) when stepping is enabled --- ## Changelog * Change `ScheduleExecutor::run()` `_skipped_systems` from `Option<FixedBitSet>` to `Option<&FixedBitSet>` * Added a few benchmarks to measure `Schedule::run()` performance with various executors
# Objective * Fixes #11932 (performance impact when stepping is disabled) ## Solution The `Option<FixedBitSet>` argument added to `ScheduleExecutor::run()` in #8453 caused a measurable performance impact even when stepping is disabled. This can be seen by the benchmark of running `Schedule:run()` on an empty schedule in a tight loop (#11932 (comment)). I was able to get the same performance results as on 0.12.1 by changing the argument `ScheduleExecutor::run()` from `Option<FixedBitSet>` to `Option<&FixedBitSet>`. The down-side of this change is that `Schedule::run()` now takes about 6% longer (3.7319 ms vs 3.9855ns) when stepping is enabled --- ## Changelog * Change `ScheduleExecutor::run()` `_skipped_systems` from `Option<FixedBitSet>` to `Option<&FixedBitSet>` * Added a few benchmarks to measure `Schedule::run()` performance with various executors
# Objective * Fixes bevyengine#11932 (performance impact when stepping is disabled) ## Solution The `Option<FixedBitSet>` argument added to `ScheduleExecutor::run()` in bevyengine#8453 caused a measurable performance impact even when stepping is disabled. This can be seen by the benchmark of running `Schedule:run()` on an empty schedule in a tight loop (bevyengine#11932 (comment)). I was able to get the same performance results as on 0.12.1 by changing the argument `ScheduleExecutor::run()` from `Option<FixedBitSet>` to `Option<&FixedBitSet>`. The down-side of this change is that `Schedule::run()` now takes about 6% longer (3.7319 ms vs 3.9855ns) when stepping is enabled --- ## Changelog * Change `ScheduleExecutor::run()` `_skipped_systems` from `Option<FixedBitSet>` to `Option<&FixedBitSet>` * Added a few benchmarks to measure `Schedule::run()` performance with various executors
Bevy version 0.13
Relevant system information
Please include:
What's performing poorly?
The schedule is running slower than before, which slowed down my program by 20%. This can impact the performance of real-time simulation applications and may significantly lower the performance when you scale the systems with more schedule plans.
It can be reproduced by running empty schedules:
Performance table for this code:
With LTO = yes
With LTO turned-off
The slowing down is obvious, and I don't use the profiler because such overhead is not caused by user systems.
The text was updated successfully, but these errors were encountered: