Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logging overhead investigations #6046

Open
nical opened this issue Jul 25, 2024 · 8 comments
Open

Logging overhead investigations #6046

nical opened this issue Jul 25, 2024 · 8 comments

Comments

@nical
Copy link
Contributor

nical commented Jul 25, 2024

Context

Bevy folks measured that wgpu's trace-level logging when enabled at build time but disabled at run time can have a significant overhead (15%). This overhead comes from checking a global atomic variable that defines the log level.

In #6017, @Elabajaba proposes gating api logs and resource logs (trace level by default) behind a build flag.

I'm not fond of adding another feature flag for this, so I would like to explore other options first.

Disable trace level logs at build time

The most straightforward option is to note that trace level logging is the most verbose out of 5 logging levels. If the most verbose logging level is too verbose for a bevy application, then bevy could disable trace-level logging in optimized builds. This removes the performance concerns for bevy while letting its dependencies and bevy itself have a log level that has potentially verbose logs that are useful for development or for debugging particularly tricky issues. Firefox, for example, disable both trace and debug log levels in optimized builds by default for the same reason.

Debug builds being orders of magnitude slower than release builds, I believe it is safe to assume trace-level logging overhead to be negligible (unless someone has benchmarks to show otherwise).

I'm tempted to leave it at that. That said, 15% overhead is a lot and there should be ways to mitigate some of it without requiring more build configurations on the wgpu side.

Reduce the overhead of some of the logs

For context, wgpu's repository has a few benchmarks. Removing all trace-level logs yields improvements ranging from 0 to 5% on these benchmarks. This doesn't mean the that reducing the logging overhead will improve bevy by at most 5%, it only means that the highest improvement we can observe through these benchmarks is in that range. We'll have to scale these numbers a bit to transpose to other projects and getting some performance measurements from bevy itself would be valuable.

Resource logs

Resource logs only happen when creating and destroying API objects. For a typical application or game that's fairly rare (in the order of a few of times per frame). Not worth optimizing. Removing them entirely does not affect wgpu's benchmarks.

API logs

API logs can be very verbose. Most of the API log traffic in a typical frame would be in render and compute passes with one log per command (and games tend to submit a lot of commands per render pass). In #6010 I proposed to check the log level once at the beginning of the render pass instead of once per command. The idea was that checking a local variable instead of a global atomic at a high-ish frequency should be faster. I ran the wgpu's render pass benchmarks on two computes and the numbers (1) are puzzling. It is a mixed bag of improvements and regressions in the [-1.4% .. +1.7%] range.

Removing the api logs entirely was just as puzzling. The results (3) show either no changes or regressions.

Why? I don't know, I'm curious. It's easy to speculate but digging into it would take time. I ran the measurements multiple times on two computers and the results are consistent.

Other trace-level logs

There's a fair amount of trace-level logging outside of api and resource logs. Some of them might not be useful anymore and could be removed.

I'll open a PR to remove some of it and maintainers can chime in to "save" any log they care about. We already discussed keeping api, resource and initialization related logs.

Conclusions

Hard to draw conclusions at this point other that optimized builds are faster without trace-level logs, but that focusing on resource and api logs yields inconclusive results.

Measurements

Focusing on the render pass benchmarks here since other benchmarks were mostly not affected by the changes.

(1) Disable all trace-level logs

diff --git a/Cargo.lock b/Cargo.lock
index 8b4e604f4..19dfdc784 100644
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -4234,6 +4234,7 @@ dependencies = [
  "bincode",
  "bytemuck",
  "criterion",
+ "log",
  "naga",
  "nanorand",
  "once_cell",
diff --git a/benches/Cargo.toml b/benches/Cargo.toml
index 82207d510..a0391d838 100644
--- a/benches/Cargo.toml
+++ b/benches/Cargo.toml
@@ -44,3 +44,4 @@ profiling.workspace = true
 rayon.workspace = true
 tracy-client = { workspace = true, optional = true }
 wgpu = { workspace = true, features = ["wgsl", "metal", "dx12"] }
+log = { version = "0.4", features = ["release_max_level_debug"] }
[nical](~/dev/rust/wgpu)$ cargo bench -p wgpu-benchmark -- renderpass
   Compiling log v0.4.22
   Compiling wgpu-core v22.0.0 (/home/nical/dev/rust/wgpu/wgpu-core)
   Compiling naga v22.0.0 (/home/nical/dev/rust/wgpu/naga)
   Compiling wgpu-hal v22.0.0 (/home/nical/dev/rust/wgpu/wgpu-hal)
   Compiling wgpu v22.0.0 (/home/nical/dev/rust/wgpu/wgpu)
   Compiling wgpu-benchmark v22.0.0 (/home/nical/dev/rust/wgpu/benches)
    Finished bench [optimized + debuginfo] target(s) in 29.22s
     Running benches/root.rs (target/release/deps/root-093d36038254a381)
Gnuplot not found, using plotters backend
Benchmarking Renderpass: Single Threaded/1 renderpasses x 10000 draws (Renderpass Time): Warming up for 3.0000 sAdapterInfo { name: "AMD Radeon Pro WX 3200 Series (RADV POLARIS12)", vendor: 4098, device: 27009, device_type: DiscreteGpu, driver: "radv", driver_info: "Mesa 24.1.2", backend: Vulkan }
Benchmarking Renderpass: Single Threaded/1 renderpasses x 10000 draws (Renderpass Time): Collecting 100 samples in estRenderpass: Single Threaded/1 renderpasses x 10000 draws (Renderpass Time)
                        time:   [24.859 ms 24.907 ms 24.958 ms]
                        thrpt:  [400.68 Kelem/s 401.49 Kelem/s 402.26 Kelem/s]
                 change:
                        time:   [-1.8921% -1.5875% -1.3066%] (p = 0.00 < 0.05)
                        thrpt:  [+1.3239% +1.6132% +1.9286%]
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  11 (11.00%) high mild
  4 (4.00%) high severe
Benchmarking Renderpass: Single Threaded/2 renderpasses x 5000 draws (Renderpass Time): Collecting 100 samples in estiRenderpass: Single Threaded/2 renderpasses x 5000 draws (Renderpass Time)
                        time:   [27.182 ms 27.212 ms 27.244 ms]
                        thrpt:  [367.05 Kelem/s 367.48 Kelem/s 367.89 Kelem/s]
                 change:
                        time:   [-5.2213% -4.7559% -4.2893%] (p = 0.00 < 0.05)
                        thrpt:  [+4.4815% +4.9934% +5.5089%]
                        Performance has improved.
Benchmarking Renderpass: Single Threaded/4 renderpasses x 2500 draws (Renderpass Time): Collecting 100 samples in estiRenderpass: Single Threaded/4 renderpasses x 2500 draws (Renderpass Time)
                        time:   [27.899 ms 27.992 ms 28.091 ms]
                        thrpt:  [355.99 Kelem/s 357.25 Kelem/s 358.44 Kelem/s]
                 change:
                        time:   [-3.4614% -3.0480% -2.6443%] (p = 0.00 < 0.05)
                        thrpt:  [+2.7161% +3.1439% +3.5855%]
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
Benchmarking Renderpass: Single Threaded/8 renderpasses x 1250 draws (Renderpass Time): Collecting 100 samples in estiRenderpass: Single Threaded/8 renderpasses x 1250 draws (Renderpass Time)
                        time:   [28.401 ms 28.489 ms 28.581 ms]
                        thrpt:  [349.88 Kelem/s 351.01 Kelem/s 352.10 Kelem/s]
                 change:
                        time:   [-0.9584% -0.5177% -0.0746%] (p = 0.02 < 0.05)
                        thrpt:  [+0.0746% +0.5204% +0.9677%]
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
Benchmarking Renderpass: Single Threaded/1 renderpasses x 10000 draws (Submit Time): Collecting 100 samples in estimatRenderpass: Single Threaded/1 renderpasses x 10000 draws (Submit Time)
                        time:   [2.1742 ms 2.1788 ms 2.1838 ms]
                        thrpt:  [4.5792 Melem/s 4.5896 Melem/s 4.5995 Melem/s]
                 change:
                        time:   [-2.8352% -2.5681% -2.3122%] (p = 0.00 < 0.05)
                        thrpt:  [+2.3669% +2.6358% +2.9179%]
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  7 (7.00%) high mild
  1 (1.00%) high severe
Benchmarking Renderpass: Single Threaded/2 renderpasses x 5000 draws (Submit Time): Collecting 100 samples in estimateRenderpass: Single Threaded/2 renderpasses x 5000 draws (Submit Time)
                        time:   [3.6249 ms 3.6396 ms 3.6553 ms]
                        thrpt:  [2.7358 Melem/s 2.7475 Melem/s 2.7587 Melem/s]
                 change:
                        time:   [-1.4089% -0.9266% -0.4124%] (p = 0.00 < 0.05)
                        thrpt:  [+0.4141% +0.9353% +1.4290%]
                        Change within noise threshold.
Benchmarking Renderpass: Single Threaded/4 renderpasses x 2500 draws (Submit Time): Collecting 100 samples in estimateRenderpass: Single Threaded/4 renderpasses x 2500 draws (Submit Time)
                        time:   [4.6839 ms 4.6959 ms 4.7082 ms]
                        thrpt:  [2.1239 Melem/s 2.1295 Melem/s 2.1350 Melem/s]
                 change:
                        time:   [-1.7518% -1.3117% -0.8691%] (p = 0.00 < 0.05)
                        thrpt:  [+0.8767% +1.3291% +1.7831%]
                        Change within noise threshold.
Benchmarking Renderpass: Single Threaded/8 renderpasses x 1250 draws (Submit Time): Collecting 100 samples in estimateRenderpass: Single Threaded/8 renderpasses x 1250 draws (Submit Time)
                        time:   [6.2658 ms 6.2730 ms 6.2807 ms]
                        thrpt:  [1.5922 Melem/s 1.5941 Melem/s 1.5960 Melem/s]
                 change:
                        time:   [-1.7147% -1.4655% -1.2216%] (p = 0.00 < 0.05)
                        thrpt:  [+1.2367% +1.4873% +1.7446%]
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  7 (7.00%) high mild

(2) Reduce the overhead of api logs in render passes

By checking the global log level once per render pass instead of once per command (#6010).

[nical](~/dev/rust/wgpu)$ cargo bench -p wgpu-benchmark -- renderpass
   Compiling naga v22.0.0 (/home/nical/dev/rust/wgpu/naga)
   Compiling wgpu-core v22.0.0 (/home/nical/dev/rust/wgpu/wgpu-core)
   Compiling wgpu-hal v22.0.0 (/home/nical/dev/rust/wgpu/wgpu-hal)
   Compiling wgpu v22.0.0 (/home/nical/dev/rust/wgpu/wgpu)
   Compiling wgpu-benchmark v22.0.0 (/home/nical/dev/rust/wgpu/benches)
    Finished bench [optimized + debuginfo] target(s) in 29.31s
     Running benches/root.rs (target/release/deps/root-c662ca2c1cfd646e)
Gnuplot not found, using plotters backend
Benchmarking Renderpass: Single Threaded/1 renderpasses x 10000 draws (Renderpass Time): Warming up for 3.0000 sAdapterInfo { name: "AMD Radeon Pro WX 3200 Series (RADV POLARIS12)", vendor: 4098, device: 27009, device_type: DiscreteGpu, driver: "radv", driver_info: "Mesa 24.1.2", backend: Vulkan }
Benchmarking Renderpass: Single Threaded/1 renderpasses x 10000 draws (Renderpass Time): Collecting 100 samples in estRenderpass: Single Threaded/1 renderpasses x 10000 draws (Renderpass Time)
                        time:   [25.429 ms 25.497 ms 25.564 ms]
                        thrpt:  [391.17 Kelem/s 392.21 Kelem/s 393.25 Kelem/s]
                 change:
                        time:   [+1.4530% +1.7563% +2.0686%] (p = 0.00 < 0.05)
                        thrpt:  [-2.0267% -1.7260% -1.4321%]
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  11 (11.00%) high mild
  2 (2.00%) high severe
Benchmarking Renderpass: Single Threaded/2 renderpasses x 5000 draws (Renderpass Time): Collecting 100 samples in estiRenderpass: Single Threaded/2 renderpasses x 5000 draws (Renderpass Time)
                        time:   [27.644 ms 27.655 ms 27.669 ms]
                        thrpt:  [361.42 Kelem/s 361.60 Kelem/s 361.75 Kelem/s]
                 change:
                        time:   [-1.7414% -1.4481% -1.1609%] (p = 0.00 < 0.05)
                        thrpt:  [+1.1745% +1.4694% +1.7723%]
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
Benchmarking Renderpass: Single Threaded/4 renderpasses x 2500 draws (Renderpass Time): Collecting 100 samples in estiRenderpass: Single Threaded/4 renderpasses x 2500 draws (Renderpass Time)
                        time:   [28.426 ms 28.475 ms 28.528 ms]
                        thrpt:  [350.54 Kelem/s 351.18 Kelem/s 351.79 Kelem/s]
                 change:
                        time:   [-0.3652% -0.0600% +0.2200%] (p = 0.68 > 0.05)
                        thrpt:  [-0.2195% +0.0601% +0.3666%]
                        No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
Benchmarking Renderpass: Single Threaded/8 renderpasses x 1250 draws (Renderpass Time): Collecting 100 samples in estiRenderpass: Single Threaded/8 renderpasses x 1250 draws (Renderpass Time)
                        time:   [28.222 ms 28.261 ms 28.318 ms]
                        thrpt:  [353.13 Kelem/s 353.84 Kelem/s 354.34 Kelem/s]
                 change:
                        time:   [+0.8179% +0.9914% +1.1953%] (p = 0.00 < 0.05)
                        thrpt:  [-1.1811% -0.9817% -0.8112%]
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe
Benchmarking Renderpass: Single Threaded/1 renderpasses x 10000 draws (Submit Time): Collecting 100 samples in estimatRenderpass: Single Threaded/1 renderpasses x 10000 draws (Submit Time)
                        time:   [2.1659 ms 2.1688 ms 2.1720 ms]
                        thrpt:  [4.6041 Melem/s 4.6108 Melem/s 4.6171 Melem/s]
                 change:
                        time:   [+1.5074% +1.6509% +1.8064%] (p = 0.00 < 0.05)
                        thrpt:  [-1.7743% -1.6241% -1.4850%]
                        Performance has regressed.
Benchmarking Renderpass: Single Threaded/2 renderpasses x 5000 draws (Submit Time): Collecting 100 samples in estimateRenderpass: Single Threaded/2 renderpasses x 5000 draws (Submit Time)
                        time:   [3.5947 ms 3.6027 ms 3.6115 ms]
                        thrpt:  [2.7689 Melem/s 2.7757 Melem/s 2.7819 Melem/s]
                 change:
                        time:   [+0.1382% +0.5272% +0.8838%] (p = 0.01 < 0.05)
                        thrpt:  [-0.8760% -0.5245% -0.1381%]
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  7 (7.00%) high mild
  3 (3.00%) high severe
Benchmarking Renderpass: Single Threaded/4 renderpasses x 2500 draws (Submit Time): Collecting 100 samples in estimateRenderpass: Single Threaded/4 renderpasses x 2500 draws (Submit Time)
                        time:   [4.7490 ms 4.7583 ms 4.7677 ms]
                        thrpt:  [2.0975 Melem/s 2.1016 Melem/s 2.1057 Melem/s]
                 change:
                        time:   [+1.3025% +1.6018% +1.8908%] (p = 0.00 < 0.05)
                        thrpt:  [-1.8557% -1.5765% -1.2857%]
                        Performance has regressed.
Benchmarking Renderpass: Single Threaded/8 renderpasses x 1250 draws (Submit Time): Collecting 100 samples in estimateRenderpass: Single Threaded/8 renderpasses x 1250 draws (Submit Time)
                        time:   [6.3299 ms 6.3500 ms 6.3718 ms]
                        thrpt:  [1.5694 Melem/s 1.5748 Melem/s 1.5798 Melem/s]
                 change:
                        time:   [+0.0756% +0.4846% +0.8959%] (p = 0.03 < 0.05)
                        thrpt:  [-0.8879% -0.4822% -0.0755%]
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  9 (9.00%) high mild
  1 (1.00%) high severe

(3) Remove api logs at build time

diff --git a/wgpu-core/src/lib.rs b/wgpu-core/src/lib.rs
index c46a8f103..e563a4407 100644
--- a/wgpu-core/src/lib.rs
+++ b/wgpu-core/src/lib.rs
@@ -320,7 +320,7 @@ macro_rules! api_log {
 }
 #[cfg(not(feature = "api_log_info"))]
 macro_rules! api_log {
-    ($($arg:tt)+) => (log::trace!($($arg)+))
+    ($($arg:tt)+) => ()
 }
[nical](~/dev/rust/wgpu)$ cargo bench -p wgpu-benchmark -- renderpass
   Compiling wgpu-core v22.0.0 (/home/nical/dev/rust/wgpu/wgpu-core)
   Compiling wgpu v22.0.0 (/home/nical/dev/rust/wgpu/wgpu)
   Compiling wgpu-benchmark v22.0.0 (/home/nical/dev/rust/wgpu/benches)
    Finished bench [optimized + debuginfo] target(s) in 20.61s
     Running benches/root.rs (target/release/deps/root-c662ca2c1cfd646e)
Gnuplot not found, using plotters backend
Benchmarking Renderpass: Single Threaded/1 renderpasses x 10000 draws (Renderpass Time): Warming up for 3.0000 sAdapterInfo { name: "AMD Radeon Pro WX 3200 Series (RADV POLARIS12)", vendor: 4098, device: 27009, device_type: DiscreteGpu, driver: "radv", driver_info: "Mesa 24.1.2", backend: Vulkan }
Benchmarking Renderpass: Single Threaded/1 renderpasses x 10000 draws (Renderpass Time): Collecting 100 samples in estRenderpass: Single Threaded/1 renderpasses x 10000 draws (Renderpass Time)
                        time:   [25.254 ms 25.309 ms 25.367 ms]
                        thrpt:  [394.21 Kelem/s 395.12 Kelem/s 395.98 Kelem/s]
                 change:
                        time:   [-1.0935% -0.7374% -0.4044%] (p = 0.00 < 0.05)
                        thrpt:  [+0.4061% +0.7428% +1.1056%]
                        Change within noise threshold.
Benchmarking Renderpass: Single Threaded/2 renderpasses x 5000 draws (Renderpass Time): Collecting 100 samples in estiRenderpass: Single Threaded/2 renderpasses x 5000 draws (Renderpass Time)
                        time:   [28.435 ms 28.571 ms 28.711 ms]
                        thrpt:  [348.30 Kelem/s 350.00 Kelem/s 351.68 Kelem/s]
                 change:
                        time:   [+2.8122% +3.3149% +3.8114%] (p = 0.00 < 0.05)
                        thrpt:  [-3.6715% -3.2086% -2.7353%]
                        Performance has regressed.
Benchmarking Renderpass: Single Threaded/4 renderpasses x 2500 draws (Renderpass Time): Collecting 100 samples in estiRenderpass: Single Threaded/4 renderpasses x 2500 draws (Renderpass Time)
                        time:   [28.811 ms 28.872 ms 28.933 ms]
                        thrpt:  [345.63 Kelem/s 346.36 Kelem/s 347.09 Kelem/s]
                 change:
                        time:   [+1.1175% +1.3940% +1.6723%] (p = 0.00 < 0.05)
                        thrpt:  [-1.6448% -1.3748% -1.1051%]
                        Performance has regressed.
Benchmarking Renderpass: Single Threaded/8 renderpasses x 1250 draws (Renderpass Time): Collecting 100 samples in estiRenderpass: Single Threaded/8 renderpasses x 1250 draws (Renderpass Time)
                        time:   [28.552 ms 28.637 ms 28.726 ms]
                        thrpt:  [348.12 Kelem/s 349.19 Kelem/s 350.24 Kelem/s]
                 change:
                        time:   [+0.9958% +1.3312% +1.6786%] (p = 0.00 < 0.05)
                        thrpt:  [-1.6509% -1.3138% -0.9859%]
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
Benchmarking Renderpass: Single Threaded/1 renderpasses x 10000 draws (Submit Time): Collecting 100 samples in estimatRenderpass: Single Threaded/1 renderpasses x 10000 draws (Submit Time)
                        time:   [2.2331 ms 2.2363 ms 2.2396 ms]
                        thrpt:  [4.4651 Melem/s 4.4718 Melem/s 4.4780 Melem/s]
                 change:
                        time:   [+2.8865% +3.1087% +3.3260%] (p = 0.00 < 0.05)
                        thrpt:  [-3.2189% -3.0150% -2.8055%]
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
Benchmarking Renderpass: Single Threaded/2 renderpasses x 5000 draws (Submit Time): Collecting 100 samples in estimateRenderpass: Single Threaded/2 renderpasses x 5000 draws (Submit Time)
                        time:   [3.6632 ms 3.6737 ms 3.6845 ms]
                        thrpt:  [2.7140 Melem/s 2.7221 Melem/s 2.7299 Melem/s]
                 change:
                        time:   [+1.5783% +1.9687% +2.3443%] (p = 0.00 < 0.05)
                        thrpt:  [-2.2906% -1.9307% -1.5537%]
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
Benchmarking Renderpass: Single Threaded/4 renderpasses x 2500 draws (Submit Time): Collecting 100 samples in estimateRenderpass: Single Threaded/4 renderpasses x 2500 draws (Submit Time)
                        time:   [4.7413 ms 4.7583 ms 4.7764 ms]
                        thrpt:  [2.0936 Melem/s 2.1016 Melem/s 2.1091 Melem/s]
                 change:
                        time:   [-0.4009% +0.0016% +0.4750%] (p = 0.97 > 0.05)
                        thrpt:  [-0.4728% -0.0016% +0.4025%]
                        No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  9 (9.00%) high mild
Benchmarking Renderpass: Single Threaded/8 renderpasses x 1250 draws (Submit Time): Collecting 100 samples in estimateRenderpass: Single Threaded/8 renderpasses x 1250 draws (Submit Time)
                        time:   [6.3527 ms 6.3663 ms 6.3800 ms]
                        thrpt:  [1.5674 Melem/s 1.5708 Melem/s 1.5741 Melem/s]
                 change:
                        time:   [-0.1435% +0.2569% +0.6557%] (p = 0.21 > 0.05)
                        thrpt:  [-0.6514% -0.2562% +0.1437%]
                        No change in performance detected.
@Elabajaba
Copy link
Contributor

Elabajaba commented Jul 29, 2024

I ran the renderpass benchmarks through superluminal (with [profile.bench] debug=1 in Cargo.toml), and all logging seems to be completely disabled during them, so I'm not sure how useful they are for diagnosing wgpu's logging overhead.

@Wumpf
Copy link
Member

Wumpf commented Jul 30, 2024

and all logging seems to be completely disabled during them

they are still compiled in though? I assumed so far that the entire logging performance discussion is on the premise that most logs are not rendered (either displayed on console or written to disk). Log rendering performance depends on what logger to use (e.g. env_logger) and what target (some terminals are more horribly slow than others). I.e. we want to measure the overhead of the the calls to/interaction with the log crate and this should be included in here, yes?

@nical
Copy link
Contributor Author

nical commented Jul 30, 2024

I checked that they are indeed compiled in. They should be useful if the workload is a good enough proxy for a game (so it would still be useful to check using bevy itself).

I assumed so far that the entire logging performance discussion is on the premise that most logs are not rendered

That is my understanding as well.

@Elabajaba
Copy link
Contributor

I assumed so far that the entire logging performance discussion is on the premise that most logs are not rendered

All the bevy performance results have been with only info level logs enabled, and with wgpu's logs filtered out, so there's almost no logging output (basically just device info at the start, and maybe outputting an fps counter once a second).

@nical
Copy link
Contributor Author

nical commented Jul 31, 2024

All the bevy performance results have been with only info level logs enabled

Assuming you mean info level enabled at runtime (log::max_level() >= Info) and none of the log levels filtered out at compile time (log::STATIC_MAX_LEVEL == Trace). Sounds like we are on the same page. That matches wgpu's benchmark configuration as far as trace-level logs are concerned at least.

@nical nical mentioned this issue Jul 31, 2024
4 tasks
@nical
Copy link
Contributor Author

nical commented Aug 5, 2024

A lot of logging was removed in #6065, so it'll be important to pick a recent revision of wgpu for new measurements. Turns out d3d12 had a fair amount more logging than the other backends so it might explain discrepancies between results measured by different people (I've been benchmarking the vulkan backend on linux so far). Backends should produce similar more similar volume of logs now.

@Imberflur
Copy link
Contributor

Imberflur commented Aug 25, 2024

I have previously observed that logs which are ultimately filtered out but which aren't filtered out by the global log::max_level() can be much more expensive. The significant overhead of filtered out logs seen with bevy led me to suspect this was happening.

The benchmarks in wgpu don't include a logger, so the max_level will default to OFF and should filter out all logs fairly efficiently. Since I suspected the significant overhead might be due to missing this max_level filtering, I sought to confirm this by reproducing the logging configuration from the bevy report. However, it wasn't clear too me what the logging configuration was there. So my best guess was to use bevy's current default logging setup and see what that looks like.

I took https://github.com/bevyengine/bevy/blob/0070bdccd873e186a9ccec3ecb6f1d3d83b8710b/crates/bevy_log/src/lib.rs, removed the irrelevant portions, and simplified it to:

use tracing_subscriber::{prelude::*, registry::Registry, EnvFilter};

let filter = "wgpu=error,naga=warn".to_string();
let level = tracing::Level::INFO;
let default_filter = { format!("{},{}", level, filter) };
let filter_layer = EnvFilter::try_from_default_env()
    .or_else(|_| EnvFilter::try_new(&default_filter))
    .unwrap();

let fmt_layer = tracing_subscriber::fmt::Layer::default().with_writer(std::io::stderr);

let subscriber = Registry::default().with(filter_layer).with(fmt_layer);

tracing_log::LogTracer::init().unwrap();
tracing::subscriber::set_global_default(subscriber).unwrap();

Notably, LogTracer::init() is the interface between log and tracing and it isn't configured with a max level (this can be derived from the filter). Ignoring cases where the user's filter selectively toggles on trace for something unrelated (which would set the global max level to that), I suspect fixing this particular aspect would significantly alleviate the issue.

Benchmarks

To test this out, I did some benchmarks starting with the commit before the log cleanup in #6065

Using:

cargo bench -p wgpu-benchmark -- renderpass

(0) 7ff80d6

(1) 7ff80d6 with logging setup to match bevy defaults (compared against 0):

diff
diff --git a/Cargo.lock b/Cargo.lock
index 2dbf69ee7..91b650830 100644
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -3661,9 +3661,21 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
 checksum = "c3523ab5a71916ccf420eebdf5521fcef02141234bbc0b8a49f2fdc4544364ef"
 dependencies = [
  "pin-project-lite",
+ "tracing-attributes",
  "tracing-core",
 ]

+[[package]]
+name = "tracing-attributes"
+version = "0.1.27"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "34704c8d6ebcbc939824180af020566b01a7c01f80641264eba0999f6c2b6be7"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn 2.0.72",
+]
+
 [[package]]
 name = "tracing-core"
 version = "0.1.32"
@@ -4251,6 +4263,9 @@ dependencies = [
  "pollster",
  "profiling",
  "rayon",
+ "tracing",
+ "tracing-log",
+ "tracing-subscriber",
  "tracy-client",
  "wgpu",
 ]
diff --git a/benches/Cargo.toml b/benches/Cargo.toml
index 82207d510..75745579b 100644
--- a/benches/Cargo.toml
+++ b/benches/Cargo.toml
@@ -42,5 +42,8 @@ once_cell.workspace = true
 pollster.workspace = true
 profiling.workspace = true
 rayon.workspace = true
+tracing = "0.1.40"
+tracing-log = "0.2.0"
+tracing-subscriber = { version = "0.3.18", features = ["env-filter"] }
 tracy-client = { workspace = true, optional = true }
 wgpu = { workspace = true, features = ["wgsl", "metal", "dx12"] }
diff --git a/benches/benches/root.rs b/benches/benches/root.rs
index 064617783..c4aa3529c 100644
--- a/benches/benches/root.rs
+++ b/benches/benches/root.rs
@@ -6,6 +6,24 @@ mod renderpass;
 mod resource_creation;
 mod shader;

+fn init_logging() {
+    use tracing_subscriber::{prelude::*, registry::Registry, EnvFilter};
+
+    let filter = "wgpu=error,naga=warn".to_string();
+    let level = tracing::Level::INFO;
+    let default_filter = { format!("{},{}", level, filter) };
+    let filter_layer = EnvFilter::try_from_default_env()
+        .or_else(|_| EnvFilter::try_new(&default_filter))
+        .unwrap();
+
+    let fmt_layer = tracing_subscriber::fmt::Layer::default().with_writer(std::io::stderr);
+
+    let subscriber = Registry::default().with(filter_layer).with(fmt_layer);
+
+    tracing_log::LogTracer::init().unwrap();
+    tracing::subscriber::set_global_default(subscriber).unwrap();
+}
+
 struct DeviceState {
     adapter_info: wgpu::AdapterInfo,
     device: wgpu::Device,
@@ -14,6 +32,7 @@ struct DeviceState {

 impl DeviceState {
     fn new() -> Self {
+        init_logging();
         #[cfg(feature = "tracy")]
         tracy_client::Client::start();
Bench results
Renderpass: Single Threaded/1 renderpasses x 10000 draws (Renderpass Time)
                        time:   [19.018 ms 19.237 ms 19.463 ms]
                        thrpt:  [513.79 Kelem/s 519.84 Kelem/s 525.83 Kelem/s]
                 change:
                        time:   [+4.0045% +5.6972% +7.4766%] (p = 0.00 < 0.05)
                        thrpt:  [-6.9565% -5.3901% -3.8503%]
                        Performance has regressed.
Renderpass: Single Threaded/2 renderpasses x 5000 draws (Renderpass Time)
                        time:   [19.151 ms 19.368 ms 19.608 ms]
                        thrpt:  [510.01 Kelem/s 516.32 Kelem/s 522.18 Kelem/s]
                 change:
                        time:   [+0.2567% +1.9933% +3.6663%] (p = 0.02 < 0.05)
                        thrpt:  [-3.5366% -1.9544% -0.2560%]
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe
Renderpass: Single Threaded/4 renderpasses x 2500 draws (Renderpass Time)
                        time:   [21.326 ms 21.612 ms 21.909 ms]
                        thrpt:  [456.44 Kelem/s 462.72 Kelem/s 468.92 Kelem/s]
                 change:
                        time:   [+6.1976% +7.9215% +9.7927%] (p = 0.00 < 0.05)
                        thrpt:  [-8.9192% -7.3400% -5.8359%]
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
Renderpass: Single Threaded/8 renderpasses x 1250 draws (Renderpass Time)
                        time:   [23.281 ms 23.517 ms 23.763 ms]
                        thrpt:  [420.82 Kelem/s 425.23 Kelem/s 429.53 Kelem/s]
                 change:
                        time:   [+3.1529% +4.6281% +6.1486%] (p = 0.00 < 0.05)
                        thrpt:  [-5.7925% -4.4234% -3.0566%]
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
Renderpass: Single Threaded/1 renderpasses x 10000 draws (Submit Time)
                        time:   [1.3855 ms 1.3922 ms 1.4008 ms]
                        thrpt:  [7.1390 Melem/s 7.1830 Melem/s 7.2174 Melem/s]
                 change:
                        time:   [-3.5067% -2.4237% -1.3786%] (p = 0.00 < 0.05)
                        thrpt:  [+1.3979% +2.4839% +3.6342%]
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  7 (7.00%) high mild
  2 (2.00%) high severe
Renderpass: Single Threaded/2 renderpasses x 5000 draws (Submit Time)
                        time:   [2.1364 ms 2.1495 ms 2.1654 ms]
                        thrpt:  [4.6180 Melem/s 4.6524 Melem/s 4.6807 Melem/s]
                 change:
                        time:   [-4.1807% -3.4251% -2.6688%] (p = 0.00 < 0.05)
                        thrpt:  [+2.7420% +3.5466% +4.3631%]
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe
Renderpass: Single Threaded/4 renderpasses x 2500 draws (Submit Time)
                        time:   [2.9378 ms 2.9557 ms 2.9753 ms]
                        thrpt:  [3.3609 Melem/s 3.3833 Melem/s 3.4040 Melem/s]
                 change:
                        time:   [-2.7514% -1.7341% -0.8381%] (p = 0.00 < 0.05)
                        thrpt:  [+0.8452% +1.7647% +2.8292%]
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  7 (7.00%) high mild
  3 (3.00%) high severe
Renderpass: Single Threaded/8 renderpasses x 1250 draws (Submit Time)
                        time:   [4.1497 ms 4.1708 ms 4.1995 ms]
                        thrpt:  [2.3812 Melem/s 2.3976 Melem/s 2.4098 Melem/s]
                 change:
                        time:   [-0.7302% -0.0654% +0.7020%] (p = 0.87 > 0.05)
                        thrpt:  [-0.6971% +0.0654% +0.7356%]
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

(2) 1 with max level passed to LogTracer (compared against 1):

NOTE: The benchmarks varied a bit for me and I didn't put any effort into creating a stable benchmarking environment. I.E. I've see differences as significant as -10% in other runs comparing 1 vs 2.

diff
diff --git a/benches/benches/root.rs b/benches/benches/root.rs
index c4aa3529c..d3cea8afe 100644
--- a/benches/benches/root.rs
+++ b/benches/benches/root.rs
@@ -7,7 +7,8 @@ mod resource_creation;
 mod shader;

 fn init_logging() {
-    use tracing_subscriber::{prelude::*, registry::Registry, EnvFilter};
+    use tracing_log::AsLog;
+    use tracing_subscriber::{filter::LevelFilter, prelude::*, registry::Registry, EnvFilter};

     let filter = "wgpu=error,naga=warn".to_string();
     let level = tracing::Level::INFO;
@@ -15,12 +16,13 @@ fn init_logging() {
     let filter_layer = EnvFilter::try_from_default_env()
         .or_else(|_| EnvFilter::try_new(&default_filter))
         .unwrap();
+    let max_level = filter_layer.max_level_hint().unwrap_or(LevelFilter::TRACE);

     let fmt_layer = tracing_subscriber::fmt::Layer::default().with_writer(std::io::stderr);

     let subscriber = Registry::default().with(filter_layer).with(fmt_layer);

-    tracing_log::LogTracer::init().unwrap();
+    tracing_log::LogTracer::init_with_filter(max_level.as_log()).unwrap();
     tracing::subscriber::set_global_default(subscriber).unwrap();
 }
Bench results
Benchmarking Renderpass: Single Threaded/1 renderpasses x 10000 draws (Renderpass Time): Warming up for 3.0000 sAdapterInfo { name: "AMD Radeon RX 6500 XT (RADV NAVI24)", vendor: 4098, device: 29759, device_type: DiscreteGpu, driver: "radv", driver_info: "Mesa 24.1.2-arch1.1", backend: Vulkan }
Renderpass: Single Threaded/1 renderpasses x 10000 draws (Renderpass Time)
                        time:   [17.957 ms 18.197 ms 18.443 ms]
                        thrpt:  [542.22 Kelem/s 549.53 Kelem/s 556.90 Kelem/s]
                 change:
                        time:   [-7.0169% -5.4019% -3.8395%] (p = 0.00 < 0.05)
                        thrpt:  [+3.9928% +5.7104% +7.5464%]
                        Performance has improved.
Renderpass: Single Threaded/2 renderpasses x 5000 draws (Renderpass Time)
                        time:   [18.031 ms 18.242 ms 18.463 ms]
                        thrpt:  [541.63 Kelem/s 548.19 Kelem/s 554.59 Kelem/s]
                 change:
                        time:   [-7.4238% -5.8135% -4.2781%] (p = 0.00 < 0.05)
                        thrpt:  [+4.4692% +6.1723% +8.0191%]
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
Renderpass: Single Threaded/4 renderpasses x 2500 draws (Renderpass Time)
                        time:   [19.707 ms 19.954 ms 20.217 ms]
                        thrpt:  [494.64 Kelem/s 501.16 Kelem/s 507.42 Kelem/s]
                 change:
                        time:   [-9.3162% -7.6710% -5.9887%] (p = 0.00 < 0.05)
                        thrpt:  [+6.3702% +8.3083% +10.273%]
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
Renderpass: Single Threaded/8 renderpasses x 1250 draws (Renderpass Time)
                        time:   [21.766 ms 22.014 ms 22.280 ms]
                        thrpt:  [448.83 Kelem/s 454.25 Kelem/s 459.43 Kelem/s]
                 change:
                        time:   [-7.7734% -6.3879% -4.9644%] (p = 0.00 < 0.05)
                        thrpt:  [+5.2238% +6.8237% +8.4286%]
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
Renderpass: Single Threaded/1 renderpasses x 10000 draws (Submit Time)
                        time:   [1.4572 ms 1.4641 ms 1.4723 ms]
                        thrpt:  [6.7922 Melem/s 6.8302 Melem/s 6.8626 Melem/s]
                 change:
                        time:   [+4.3357% +5.1656% +6.0088%] (p = 0.00 < 0.05)
                        thrpt:  [-5.6682% -4.9119% -4.1555%]
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe
Renderpass: Single Threaded/2 renderpasses x 5000 draws (Submit Time)
                        time:   [2.1512 ms 2.1587 ms 2.1686 ms]
                        thrpt:  [4.6114 Melem/s 4.6324 Melem/s 4.6486 Melem/s]
                 change:
                        time:   [-0.3865% +0.4300% +1.2540%] (p = 0.30 > 0.05)
                        thrpt:  [-1.2385% -0.4282% +0.3880%]
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe
Renderpass: Single Threaded/4 renderpasses x 2500 draws (Submit Time)
                        time:   [3.0014 ms 3.0118 ms 3.0243 ms]
                        thrpt:  [3.3065 Melem/s 3.3202 Melem/s 3.3318 Melem/s]
                 change:
                        time:   [+1.1392% +1.8993% +2.6694%] (p = 0.00 < 0.05)
                        thrpt:  [-2.6000% -1.8639% -1.1264%]
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe
Renderpass: Single Threaded/8 renderpasses x 1250 draws (Submit Time)
                        time:   [4.2125 ms 4.2370 ms 4.2632 ms]
                        thrpt:  [2.3456 Melem/s 2.3602 Melem/s 2.3739 Melem/s]
                 change:
                        time:   [+0.6454% +1.5873% +2.4859%] (p = 0.00 < 0.05)
                        thrpt:  [-2.4256% -1.5625% -0.6413%]
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

(3) 9c6ae1b with changes from 1 (compared against 1)

Here we look at the impact of the log cleanup when logs aren't being filtered out immediately using the max level.

Bench results
Benchmarking Renderpass: Single Threaded/1 renderpasses x 10000 draws (Renderpass Time): Warming up for 3.0000 sAdapterInfo { name: "AMD Radeon RX 6500 XT (RADV NAVI24)", vendor
Renderpass: Single Threaded/1 renderpasses x 10000 draws (Renderpass Time)
                        time:   [17.930 ms 18.136 ms 18.342 ms]
                        thrpt:  [545.21 Kelem/s 551.40 Kelem/s 557.74 Kelem/s]
                 change:
                        time:   [-7.8911% -6.4672% -5.1828%] (p = 0.00 < 0.05)
                        thrpt:  [+5.4661% +6.9144% +8.5671%]
                        Performance has improved.
Renderpass: Single Threaded/2 renderpasses x 5000 draws (Renderpass Time)
                        time:   [18.663 ms 18.876 ms 19.093 ms]
                        thrpt:  [523.74 Kelem/s 529.76 Kelem/s 535.81 Kelem/s]
                 change:
                        time:   [-10.380% -8.7941% -7.3251%] (p = 0.00 < 0.05)
                        thrpt:  [+7.9041% +9.6421% +11.583%]
                        Performance has improved.
Renderpass: Single Threaded/4 renderpasses x 2500 draws (Renderpass Time)
                        time:   [19.739 ms 19.983 ms 20.241 ms]
                        thrpt:  [494.05 Kelem/s 500.44 Kelem/s 506.62 Kelem/s]
                 change:
                        time:   [-11.147% -9.8038% -8.4584%] (p = 0.00 < 0.05)
                        thrpt:  [+9.2400% +10.869% +12.546%]
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
Renderpass: Single Threaded/8 renderpasses x 1250 draws (Renderpass Time)
                        time:   [22.674 ms 22.884 ms 23.099 ms]
                        thrpt:  [432.92 Kelem/s 436.98 Kelem/s 441.04 Kelem/s]
                 change:
                        time:   [-8.4276% -7.2068% -5.9400%] (p = 0.00 < 0.05)
                        thrpt:  [+6.3151% +7.7665% +9.2032%]
                        Performance has improved.
Renderpass: Single Threaded/1 renderpasses x 10000 draws (Submit Time)
                        time:   [1.2765 ms 1.2800 ms 1.2835 ms]
                        thrpt:  [7.7910 Melem/s 7.8126 Melem/s 7.8340 Melem/s]
                 change:
                        time:   [-11.257% -10.572% -9.9545%] (p = 0.00 < 0.05)
                        thrpt:  [+11.055% +11.822% +12.685%]
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
Renderpass: Single Threaded/2 renderpasses x 5000 draws (Submit Time)
                        time:   [2.2134 ms 2.2244 ms 2.2367 ms]
                        thrpt:  [4.4709 Melem/s 4.4956 Melem/s 4.5180 Melem/s]
                 change:
                        time:   [-6.9191% -5.7771% -4.6521%] (p = 0.00 < 0.05)
                        thrpt:  [+4.8791% +6.1313% +7.4334%]
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe
Renderpass: Single Threaded/4 renderpasses x 2500 draws (Submit Time)
                        time:   [3.0021 ms 3.0148 ms 3.0274 ms]
                        thrpt:  [3.3032 Melem/s 3.3170 Melem/s 3.3310 Melem/s]
                 change:
                        time:   [-0.6289% +0.0223% +0.6818%] (p = 0.95 > 0.05)
                        thrpt:  [-0.6772% -0.0223% +0.6329%]
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
Renderpass: Single Threaded/8 renderpasses x 1250 draws (Submit Time)
                        time:   [4.3580 ms 4.3786 ms 4.4017 ms]
                        thrpt:  [2.2719 Melem/s 2.2838 Melem/s 2.2946 Melem/s]
                 change:
                        time:   [+3.9027% +4.7068% +5.4975%] (p = 0.00 < 0.05)
                        thrpt:  [-5.2110% -4.4953% -3.7561%]
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

(4) 1 with trace level logs statically disabled (compared against 1)

diff
diff --git a/Cargo.lock b/Cargo.lock
index 91b650830..2bee5ccd0 100644
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -4257,6 +4257,7 @@ dependencies = [
  "bincode",
  "bytemuck",
  "criterion",
+ "log",
  "naga",
  "nanorand",
  "once_cell",
diff --git a/benches/Cargo.toml b/benches/Cargo.toml
index 75745579b..848faa22d 100644
--- a/benches/Cargo.toml
+++ b/benches/Cargo.toml
@@ -47,3 +47,4 @@ tracing-log = "0.2.0"
 tracing-subscriber = { version = "0.3.18", features = ["env-filter"] }
 tracy-client = { workspace = true, optional = true }
 wgpu = { workspace = true, features = ["wgsl", "metal", "dx12"] }
+log = { version = "0.4", features = ["release_max_level_debug"] }
Bench results
Benchmarking Renderpass: Single Threaded/1 renderpasses x 10000 draws (Renderpass Time): Warming up for 3.0000 sAdapterInfo { name: "AMD Radeon RX 6500 XT (RADV NAVI24)", vendor: 4098, device: 29759, device_type: DiscreteGpu, driver: "radv", driver_info: "Mesa 24.1.2-arch1.1", backend: Vulkan }
Renderpass: Single Threaded/1 renderpasses x 10000 draws (Renderpass Time)
                        time:   [16.526 ms 16.722 ms 16.925 ms]
                        thrpt:  [590.83 Kelem/s 598.02 Kelem/s 605.12 Kelem/s]
                 change:
                        time:   [-15.560% -14.270% -12.828%] (p = 0.00 < 0.05)
                        thrpt:  [+14.716% +16.646% +18.428%]
                        Performance has improved.
Renderpass: Single Threaded/2 renderpasses x 5000 draws (Renderpass Time)
                        time:   [16.774 ms 16.978 ms 17.194 ms]
                        thrpt:  [581.60 Kelem/s 589.01 Kelem/s 596.17 Kelem/s]
                 change:
                        time:   [-17.401% -16.105% -14.695%] (p = 0.00 < 0.05)
                        thrpt:  [+17.226% +19.196% +21.067%]
                        Performance has improved.
Renderpass: Single Threaded/4 renderpasses x 2500 draws (Renderpass Time)
                        time:   [18.891 ms 19.123 ms 19.354 ms]
                        thrpt:  [516.70 Kelem/s 522.94 Kelem/s 529.34 Kelem/s]
                 change:
                        time:   [-14.086% -12.715% -11.318%] (p = 0.00 < 0.05)
                        thrpt:  [+12.763% +14.568% +16.396%]
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
Renderpass: Single Threaded/8 renderpasses x 1250 draws (Renderpass Time)
                        time:   [19.948 ms 20.128 ms 20.317 ms]
                        thrpt:  [492.21 Kelem/s 496.83 Kelem/s 501.29 Kelem/s]
                 change:
                        time:   [-17.670% -16.499% -15.353%] (p = 0.00 < 0.05)
                        thrpt:  [+18.138% +19.760% +21.463%]
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
Renderpass: Single Threaded/1 renderpasses x 10000 draws (Submit Time)
                        time:   [1.2535 ms 1.2583 ms 1.2637 ms]
                        thrpt:  [7.9135 Melem/s 7.9473 Melem/s 7.9775 Melem/s]
                 change:
                        time:   [-10.514% -10.062% -9.5665%] (p = 0.00 < 0.05)
                        thrpt:  [+10.579% +11.188% +11.750%]
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe
Renderpass: Single Threaded/2 renderpasses x 5000 draws (Submit Time)
                        time:   [2.0902 ms 2.1048 ms 2.1228 ms]
                        thrpt:  [4.7107 Melem/s 4.7510 Melem/s 4.7843 Melem/s]
                 change:
                        time:   [-2.1637% -1.2816% -0.2271%] (p = 0.01 < 0.05)
                        thrpt:  [+0.2276% +1.2982% +2.2115%]
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe
Renderpass: Single Threaded/4 renderpasses x 2500 draws (Submit Time)
                        time:   [2.9169 ms 2.9393 ms 2.9661 ms]
                        thrpt:  [3.3714 Melem/s 3.4021 Melem/s 3.4283 Melem/s]
                 change:
                        time:   [-1.6563% -0.8193% +0.0956%] (p = 0.10 > 0.05)
                        thrpt:  [-0.0955% +0.8260% +1.6842%]
                        No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) high mild
  8 (8.00%) high severe
Renderpass: Single Threaded/8 renderpasses x 1250 draws (Submit Time)
                        time:   [4.1185 ms 4.1365 ms 4.1571 ms]
                        thrpt:  [2.4055 Melem/s 2.4175 Melem/s 2.4281 Melem/s]
                 change:
                        time:   [-2.0112% -1.0797% -0.1977%] (p = 0.02 < 0.05)
                        thrpt:  [+0.1981% +1.0915% +2.0525%]
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

I think it would be interesting to see the impact of the change from 2 in bevy's many_cubes example and how it compares to statically disabling trace level logging.

Going back to the topic of selectively enabling trace level logs. In my previous investigations, tracing's event generation handles this a lot better than log because the filtering result is cached in a per call site static. I think it would be worth considering the use of tracing in wgpu to mitigate the performance impact of enabling trace level logs for unrelated code. I noticed that tracing_log has a interest cache that users can enable to do some caching of the filtering. I haven't experimented with this, but since that has to occur behind a function call and is otherwise limited by not being part of the logging macro, I suspect there is still some noticeable overhead compared to using tracing directly.


Debug builds being orders of magnitude slower than release builds, I believe it is safe to assume trace-level logging overhead to be negligible (unless someone has benchmarks to show otherwise).

In my experience, the typical slowness of the default debug build configuration leads to enabling optimizations for the debug/development profile to make things actually usable. So I don't think the slowness of default debug builds can be relied on to make the logging overhead negligible.

@nical
Copy link
Contributor Author

nical commented Aug 26, 2024

Thanks @Imberflur! Sounds like there is some digging to do on the bevy side to configure the tracing/logging in the most efficient way.

I'm operating under the assumption that overhead associated with logs that are filtered out by log::max_level() are wgpu's responsibility, but if log::max_level() is not configured to filter out trace-level logs, then the associated overhead should be fixable by the downstream crate/app.

I'll rebase #6010 and see if it makes a difference on top of your init_logging function (with LogTracer::init_with_filter(tracing_log::log::LevelFilter::Info)).

Going back to the topic of selectively enabling trace level logs. In my previous investigations, tracing's event generation handles this a lot better than log because the filtering result is cached in a per call site static. I think it would be worth considering the use of tracing in wgpu to mitigate the performance impact of enabling trace level logs for unrelated code.

Interesting. @ErichDonGubler recently suggested switching to tracing because it has more bells and whistles. The other maintainers felt that wgpu didn't need the extra capability, but if we get free performance out of switching, then it sounds like it would make sense to do so.

bushrat011899 added a commit to bushrat011899/bevy that referenced this issue Aug 30, 2024
Previous misunderstanding lead to an incorrect *Performance* section implying the log filter string was the cause of performance loss. Rather, there is performance to be _gained_ by using a compile time feature of the `log` crate to eliminate logging calls entirely. Updated information is based on the [`log` documentation](https://docs.rs/log/#compile-time-filters) and benchmark 4 from [this GitHub comment](gfx-rs/wgpu#6046 (comment))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

4 participants