Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache parallel iteration spans #9950

Merged
merged 1 commit into from
Sep 30, 2023

Conversation

james7132
Copy link
Member

@james7132 james7132 commented Sep 28, 2023

Objective

We cached system spans in #9390, but another common span seen in most Bevy apps when enabling tracing are Query::par_iter(_mut) related spans.

Solution

Cache them in QueryState. The one downside to this is that we pay for the memory for every Query(State) instantiated, not just those that are used for parallel iteration, but this shouldn't be a significant cost unless the app is creating hundreds of thousands of Query(State)s regularly.

Metrics

Tested against cargo run --profile stress-test --features trace_tracy --example many_cubes. Yellow is this PR, red is main.

sync_simple_transforms:

image

check_visibility:

image

Full frame:

image

@james7132 james7132 added A-ECS Entities, components, systems, and events C-Performance A change motivated by improving speed, memory usage or compile times A-Diagnostics Logging, crash handling, error reporting and performance analysis and removed A-ECS Entities, components, systems, and events labels Sep 28, 2023
@alice-i-cecile alice-i-cecile added the A-ECS Entities, components, systems, and events label Sep 28, 2023
@hymm hymm self-requested a review September 29, 2023 05:08
@hymm hymm added the S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it label Sep 30, 2023
@superdump superdump added this pull request to the merge queue Sep 30, 2023
Merged via the queue into bevyengine:main with commit 95813b8 Sep 30, 2023
27 checks passed
count = len,
);
#[cfg(feature = "trace")]
let task = task.instrument(span);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calling instrument is supposed to make the span work across async boundaries, is it not needed anymore?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The task itself never yields to the async executor. There's no async boundary to work across. Best to avoid the overhead and just enter the span inside the task instead.

github-merge-queue bot pushed a commit that referenced this pull request Oct 1, 2023
# Objective
`extract_meshes` can easily be one of the most expensive operations in
the blocking extract schedule for 3D apps. It also has no fundamentally
serialized parts and can easily be run across multiple threads. Let's
speed it up by parallelizing it!

## Solution
Use the `ThreadLocal<Cell<Vec<T>>>` approach utilized by #7348 in
conjunction with `Query::par_iter` to build a set of thread-local
queues, and collect them after going wide.

## Performance
Using `cargo run --profile stress-test --features trace_tracy --example
many_cubes`. Yellow is this PR. Red is main.

`extract_meshes`:


![image](https://github.com/bevyengine/bevy/assets/3137680/9d45aa2e-3cfa-4fad-9c08-53498b51a73b)

An average reduction from 1.2ms to 770us is seen, a 41.6% improvement.

Note: this is still not including #9950's changes, so this may actually
result in even faster speedups once that's merged in.
@cart cart mentioned this pull request Oct 13, 2023
43 tasks
regnarock pushed a commit to regnarock/bevy that referenced this pull request Oct 13, 2023
# Objective
We cached system spans in bevyengine#9390, but another common span seen in most
Bevy apps when enabling tracing are Query::par_iter(_mut) related spans.

## Solution
Cache them in QueryState. The one downside to this is that we pay for
the memory for every Query(State) instantiated, not just those that are
used for parallel iteration, but this shouldn't be a significant cost
unless the app is creating hundreds of thousands of Query(State)s
regularly.

## Metrics
Tested against `cargo run --profile stress-test --features trace_tracy
--example many_cubes`. Yellow is this PR, red is main.

`sync_simple_transforms`:


![image](https://github.com/bevyengine/bevy/assets/3137680/d60f6d69-5586-4424-9d78-aac78992aacd)

`check_visibility`:


![image](https://github.com/bevyengine/bevy/assets/3137680/096a58d2-a330-4a32-b806-09cd524e6e15)

Full frame:


![image](https://github.com/bevyengine/bevy/assets/3137680/3b088cf8-9487-4bc7-a308-026e172d6672)
regnarock pushed a commit to regnarock/bevy that referenced this pull request Oct 13, 2023
# Objective
`extract_meshes` can easily be one of the most expensive operations in
the blocking extract schedule for 3D apps. It also has no fundamentally
serialized parts and can easily be run across multiple threads. Let's
speed it up by parallelizing it!

## Solution
Use the `ThreadLocal<Cell<Vec<T>>>` approach utilized by bevyengine#7348 in
conjunction with `Query::par_iter` to build a set of thread-local
queues, and collect them after going wide.

## Performance
Using `cargo run --profile stress-test --features trace_tracy --example
many_cubes`. Yellow is this PR. Red is main.

`extract_meshes`:


![image](https://github.com/bevyengine/bevy/assets/3137680/9d45aa2e-3cfa-4fad-9c08-53498b51a73b)

An average reduction from 1.2ms to 770us is seen, a 41.6% improvement.

Note: this is still not including bevyengine#9950's changes, so this may actually
result in even faster speedups once that's merged in.
rdrpenguin04 pushed a commit to rdrpenguin04/bevy that referenced this pull request Jan 9, 2024
# Objective
We cached system spans in bevyengine#9390, but another common span seen in most
Bevy apps when enabling tracing are Query::par_iter(_mut) related spans.

## Solution
Cache them in QueryState. The one downside to this is that we pay for
the memory for every Query(State) instantiated, not just those that are
used for parallel iteration, but this shouldn't be a significant cost
unless the app is creating hundreds of thousands of Query(State)s
regularly.

## Metrics
Tested against `cargo run --profile stress-test --features trace_tracy
--example many_cubes`. Yellow is this PR, red is main.

`sync_simple_transforms`:


![image](https://github.com/bevyengine/bevy/assets/3137680/d60f6d69-5586-4424-9d78-aac78992aacd)

`check_visibility`:


![image](https://github.com/bevyengine/bevy/assets/3137680/096a58d2-a330-4a32-b806-09cd524e6e15)

Full frame:


![image](https://github.com/bevyengine/bevy/assets/3137680/3b088cf8-9487-4bc7-a308-026e172d6672)
rdrpenguin04 pushed a commit to rdrpenguin04/bevy that referenced this pull request Jan 9, 2024
# Objective
`extract_meshes` can easily be one of the most expensive operations in
the blocking extract schedule for 3D apps. It also has no fundamentally
serialized parts and can easily be run across multiple threads. Let's
speed it up by parallelizing it!

## Solution
Use the `ThreadLocal<Cell<Vec<T>>>` approach utilized by bevyengine#7348 in
conjunction with `Query::par_iter` to build a set of thread-local
queues, and collect them after going wide.

## Performance
Using `cargo run --profile stress-test --features trace_tracy --example
many_cubes`. Yellow is this PR. Red is main.

`extract_meshes`:


![image](https://github.com/bevyengine/bevy/assets/3137680/9d45aa2e-3cfa-4fad-9c08-53498b51a73b)

An average reduction from 1.2ms to 770us is seen, a 41.6% improvement.

Note: this is still not including bevyengine#9950's changes, so this may actually
result in even faster speedups once that's merged in.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Diagnostics Logging, crash handling, error reporting and performance analysis A-ECS Entities, components, systems, and events C-Performance A change motivated by improving speed, memory usage or compile times S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants