Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute pass benchmark #5767

Merged
merged 12 commits into from
Jul 14, 2024
Merged

Compute pass benchmark #5767

merged 12 commits into from
Jul 14, 2024

Conversation

Wumpf
Copy link
Member

@Wumpf Wumpf commented Jun 2, 2024

Connections

Description
Adds a benchmark for compute pass recording, very similar to what we have for render passes.

The prime motivation for this was to figure out whether the extensive changes I made to compute pass recording made performance worse or better - in fact there are good reasons for either. The short answer: It improved by 4-10% pass time since before I started!! 🥳
Even better, including submit time the improvements are 10-30%, but this is very likely not associated with the compute pass recording refactors :)

Unfortunately those changes landed over a quite long period of time so unless someone bisects this carefully we won't know what caused it exactly. It could be that the "fully consume the pass" change caused these improvements (we now make use of the fact that a pass can't be submitted twice) but then again this is probably a wash since before compute pass lifetimes refactor work started, compute pass was a very simple data structure (now it has extensive resource ownership). So it's just as likely that something else caused this.
For this comparision, I backported the benchmarks to c1291bd. to check it out yourself use the before-computepass-work-with-benches branch on my fork.

Raw results comparing c1291bd1312a77be73954856d0e7728877232033 against this branch:

Computepass: Single Threaded/1 computepasses x 10000 dispatches (Computepass Time)
                        time:   [18.441 ms 18.719 ms 19.010 ms]
                        thrpt:  [526.03 Kelem/s 534.23 Kelem/s 542.28 Kelem/s]
                 change:
                        time:   [-6.1982% -4.3270% -2.4471%] (p = 0.00 < 0.05)
                        thrpt:  [+2.5085% +4.5227% +6.6077%]
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild
Computepass: Single Threaded/2 computepasses x 5000 dispatches (Computepass Time)
                        time:   [18.392 ms 18.560 ms 18.735 ms]
                        thrpt:  [533.77 Kelem/s 538.80 Kelem/s 543.73 Kelem/s]
                 change:
                        time:   [-8.6884% -7.5122% -6.2705%] (p = 0.00 < 0.05)
                        thrpt:  [+6.6900% +8.1224% +9.5151%]
                        Performance has improved.
Computepass: Single Threaded/4 computepasses x 2500 dispatches (Computepass Time)
                        time:   [19.154 ms 19.341 ms 19.535 ms]
                        thrpt:  [511.89 Kelem/s 517.04 Kelem/s 522.08 Kelem/s]
                 change:
                        time:   [-13.050% -11.257% -9.5528%] (p = 0.00 < 0.05)
                        thrpt:  [+10.562% +12.685% +15.008%]
                        Performance has improved.
Computepass: Single Threaded/8 computepasses x 1250 dispatches (Computepass Time)
                        time:   [20.198 ms 20.400 ms 20.610 ms]
                        thrpt:  [485.20 Kelem/s 490.21 Kelem/s 495.10 Kelem/s]
                 change:
                        time:   [-10.854% -9.1939% -7.4321%] (p = 0.00 < 0.05)
                        thrpt:  [+8.0288% +10.125% +12.176%]
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
Computepass: Single Threaded/1 computepasses x 10000 dispatches (Submit Time)
                        time:   [10.087 ms 10.181 ms 10.281 ms]
                        thrpt:  [972.70 Kelem/s 982.18 Kelem/s 991.37 Kelem/s]
                 change:
                        time:   [-35.718% -34.659% -33.555%] (p = 0.00 < 0.05)
                        thrpt:  [+50.501% +53.043% +55.564%]
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
Computepass: Single Threaded/2 computepasses x 5000 dispatches (Submit Time)
                        time:   [11.028 ms 11.129 ms 11.234 ms]
                        thrpt:  [890.17 Kelem/s 898.55 Kelem/s 906.79 Kelem/s]
                 change:
                        time:   [-32.267% -31.091% -29.847%] (p = 0.00 < 0.05)
                        thrpt:  [+42.546% +45.120% +47.638%]
                        Performance has improved.
Computepass: Single Threaded/4 computepasses x 2500 dispatches (Submit Time)
                        time:   [12.368 ms 12.456 ms 12.545 ms]
                        thrpt:  [797.11 Kelem/s 802.85 Kelem/s 808.52 Kelem/s]
                 change:
                        time:   [-28.125% -27.134% -26.125%] (p = 0.00 < 0.05)
                        thrpt:  [+35.363% +37.239% +39.131%]
                        Performance has improved.
Computepass: Single Threaded/8 computepasses x 1250 dispatches (Submit Time)
                        time:   [13.707 ms 13.818 ms 13.936 ms]
                        thrpt:  [717.56 Kelem/s 723.68 Kelem/s 729.57 Kelem/s]
                 change:
                        time:   [-24.102% -23.164% -22.189%] (p = 0.00 < 0.05)
                        thrpt:  [+28.516% +30.147% +31.756%]
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

Computepass: Multi Threaded/2 threads x 5000 dispatch
                        time:   [9.8718 ms 9.9380 ms 10.016 ms]
                        thrpt:  [998.43 Kelem/s 1.0062 Melem/s 1.0130 Melem/s]
                 change:
                        time:   [-9.8552% -8.8156% -7.7884%] (p = 0.00 < 0.05)
                        thrpt:  [+8.4462% +9.6678% +10.933%]
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) high mild
  2 (2.00%) high severe
Computepass: Multi Threaded/4 threads x 2500 dispatch
                        time:   [5.7890 ms 5.8287 ms 5.8719 ms]
                        thrpt:  [1.7030 Melem/s 1.7157 Melem/s 1.7274 Melem/s]
                 change:
                        time:   [-14.697% -13.393% -12.090%] (p = 0.00 < 0.05)
                        thrpt:  [+13.753% +15.464% +17.229%]
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe
Computepass: Multi Threaded/8 threads x 1250 dispatch
                        time:   [4.1858 ms 4.2230 ms 4.2613 ms]
                        thrpt:  [2.3467 Melem/s 2.3680 Melem/s 2.3890 Melem/s]
                 change:
                        time:   [-31.207% -29.893% -28.594%] (p = 0.00 < 0.05)
                        thrpt:  [+40.045% +42.640% +45.364%]
                        Performance has improved.

Computepass: Bindless/1000 dispatch
                        time:   [146.86 ms 147.21 ms 147.61 ms]
                        thrpt:  [6.7748 Kelem/s 6.7930 Kelem/s 6.8094 Kelem/s]
                 change:
                        time:   [+0.6461% +1.8619% +2.7813%] (p = 0.00 < 0.05)
                        thrpt:  [-2.7060% -1.8279% -0.6419%]
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

Computepass: Empty Submit with 60000 Resources
                        time:   [481.52 µs 484.35 µs 487.44 µs]
                        change: [-80.991% -79.937% -78.934%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) high mild
  2 (2.00%) high seve

Testing
it is a test!

Checklist

  • Run cargo fmt.
  • Run cargo clippy. If applicable, add:
    • --target wasm32-unknown-unknown
    • --target wasm32-unknown-emscripten
  • Run cargo xtask test to run tests.
  • Add change to CHANGELOG.md. See simple instructions inside file.

@Wumpf
Copy link
Member Author

Wumpf commented Jun 30, 2024

Despite some mitigations, Linux is failing this benchmark spuriously.
Need to look into that before merging even if it shows up green on pending re-run (mostly curious if the same thing fails always)

@Wumpf Wumpf merged commit d3edbc5 into gfx-rs:trunk Jul 14, 2024
25 checks passed
@Wumpf Wumpf deleted the compute-pass-benchmark branch July 14, 2024 20:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants