Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix sort bug #2344

Closed
wants to merge 11 commits into from
Closed

fix sort bug #2344

wants to merge 11 commits into from

Conversation

xaellison
Copy link
Contributor

No description provided.

@xaellison xaellison changed the title attempt to add pipeline step fix sort bug Apr 25, 2024
@maleadt
Copy link
Member

maleadt commented Apr 25, 2024

You can reproduce this locally by restricting the number of threads kernel1 can use while setting the number for kernel2 higher than that (which shouldn't matter since it's the blocksize of a kernel we won't execute):

diff --git a/src/sorting.jl b/src/sorting.jl
index 7dd563831..3568aaafe 100644
--- a/src/sorting.jl
+++ b/src/sorting.jl
@@ -909,7 +909,7 @@ function bitonic_sort!(c; by = identity, lt = isless, rev = false, dims=1)
     # compile kernels (using Int32 for indexing, if possible, yielding a 70% speedup)
     I = c_len <= typemax(Int32) ? Int32 : Int
     args1 = (c, I(c_len), one(I), one(I), one(I), by, lt, Val(rev), Val(dims))
-    kernel1 = @cuda launch=false comparator_small_kernel(args1...)
+    kernel1 = @cuda maxthreads=896 launch=false comparator_small_kernel(args1...)

     config1 = launch_configuration(kernel1.fun, shmem = threads -> bitonic_shmem(c, threads))
     args2 = (c, I(c_len), one(I), one(I), by, lt, Val(rev), Val(dims))
@@ -917,6 +917,7 @@ function bitonic_sort!(c; by = identity, lt = isless, rev = false, dims=1)
     config2 = launch_configuration(kernel2.fun, shmem = threads -> bitonic_shmem(c, threads))
     # blocksize for kernel2 MUST be a power of 2
     threads2 = prevpow(2, config2.threads)
+    threads2 = 1024 # doesn't matter since we'll pick kernel1

     # determines cutoff for when to use kernel1 vs kernel2
     log_threads = threads2 |> log2 |> Int
julia> CUDA.bitonic_sort!(CUDA.rand(Int32, (2, 2, 50000)); dims=3)
ERROR: Number of threads per block exceeds kernel limit (1024 > 896).

@maleadt
Copy link
Member

maleadt commented Apr 27, 2024

#2353

@maleadt maleadt closed this Apr 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants