Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow parallelism on large number of threads #10

Open
Narsil opened this issue Jul 16, 2023 · 0 comments
Open

Slow parallelism on large number of threads #10

Narsil opened this issue Jul 16, 2023 · 0 comments

Comments

@Narsil
Copy link

Narsil commented Jul 16, 2023

Hi,

While investigating the crate performance I found out that running parallelism could be highly detrimental to performance.
This only occurs with machines with a lot of cores (and therefore threads)

Here is the bench I added https://github.com/Narsil/gemm/tree/bench_rayon

On a regular desktop (8 cores) I see:

parallelism-8-f32-nnn-gemm-6×2304×768
                        time:   [176.79 µs 182.91 µs 186.61 µs]
                        change: [-9.5273% -4.5251% -0.1034%] (p = 0.10 > 0.05)
                        No change in performance detected.

parallelism-none-f32-nnn-gemm-6×2304×768
                        time:   [685.08 µs 686.80 µs 687.87 µs]
                        change: [-1.0090% -0.5130% -0.1028%] (p = 0.04 < 0.05)
                        Change within noise threshold.

parallelism-8-f32-nnt-gemm-6×2304×768
                        time:   [433.25 µs 444.28 µs 459.26 µs]
                        change: [+9.9070% +13.388% +16.764%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

parallelism-none-f32-nnt-gemm-6×2304×768
                        time:   [1.3960 ms 1.4004 ms 1.4051 ms]
                        change: [+14.439% +15.374% +16.258%] (p = 0.00 < 0.05)
                        Performance has regressed.

Which is sort of OK, 8 parallelism is indeed ~3.5x faster so some speedups

However on 48 cores:

parallelism-48-f32-nnn-gemm-6×2304×768
                        time:   [2.2364 ms 2.2723 ms 2.3164 ms]
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) low mild
  1 (10.00%) high severe

parallelism-none-f32-nnn-gemm-6×2304×768
                        time:   [752.12 µs 752.97 µs 753.81 µs]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) low mild

parallelism-48-f32-nnt-gemm-6×2304×768
                        time:   [2.3022 ms 2.3255 ms 2.3660 ms]

parallelism-none-f32-nnt-gemm-6×2304×768
                        time:   [789.54 µs 789.93 µs 790.39 µs]

There is a big slowdown from over parallelism.

The flamegraph actually shows this pretty well
flamegraph

Is there anything we can do to help here ?
I'm under the impression that using a simple par_chunks instead of par_iter with maybe some length heuristics could help spawn little amount of threads when the matmul is small enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant