Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

computation performs slower than cpu version in benchmark #246

Open
viirya opened this issue Oct 20, 2022 · 3 comments
Open

computation performs slower than cpu version in benchmark #246

viirya opened this issue Oct 20, 2022 · 3 comments
Labels

Comments

@viirya
Copy link

viirya commented Oct 20, 2022

Hi, I'm running some benchmarks between the computation code from metal-rs and a cpu version.

I basically benchmark the compute example which does sum operation and a cpu version which simply loops input slice while summing it up.

I factor input data size as 1024 * factor. For all cases, metal-rs compute always performs worse than the cpu version. E.g.,

sum (metal), factor: 90 time:   [465.91 µs 479.72 µs 495.96 µs]                                      
                        change: [-2.5272% +1.6099% +6.1871%] (p = 0.46 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

sum (cpu), factor: 90   time:   [32.132 µs 32.140 µs 32.148 µs]    

I'm wondering if the benchmark result is expected? Because I suppose metal version should speed up the operation and should be faster.

Do you have any idea or suggestion?

@grovesNL
Copy link
Collaborator

You might want to look at benchmarking only the actual compute operation (not device initialization, copying data into buffers, etc.). You generally want to reuse as many GPU resources as possible so this might be impacting your benchmarks if they're not omitted.

Even then depending on the size, it still might not beat the CPU version. It really depends on the exact kinds of computations you're doing. For sum operations specifically you might look into how to perform prefix sums on the GPU, then compare against prefix sums on the CPU (e.g., using SIMD).

@viirya
Copy link
Author

viirya commented Oct 21, 2022

As the GPU has unified memory model, I suppose we are not counting in the cost of copying data into buffers.

I tried to revamp the benchmark by reusing initialized device for all runs. Good thing is that there is some improvements about 30% on GPU runs. But it is still slower than CPU at significant scale.

sum (metal), factor: 90 time:   [295.16 µs 301.82 µs 308.61 µs]                                                    
                        change: [-37.171% -34.824% -32.562%] (p = 0.00 < 0.05)                                     
                        Performance has improved.                                                                  
                                                                                                                   
sum (cpu), factor: 90   time:   [28.919 µs 28.923 µs 28.927 µs]                                                    
                        change: [-0.2977% -0.0365% +0.1904%] (p = 0.80 > 0.05)                                     
                        No change in performance detected.                                                         

So I guess that you're right that the point is the sum computation. I'm looking at prefix sum algorithm on GPU and see if it can improve the performance more.

@Congyuwang
Copy link

You still will have to copy into buffer, since the data needs to be properly aligned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants