-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PTX: byval results in local memory usage #92
Comments
Bisected to #16. |
And reduced to something smaller: function kernel(y1, y2)
y = threadIdx().x == 1 ? y1 : y2
@inbounds y[] = 0
return
end
@cuda kernel(CUDA.rand(1), CUDA.rand(1)) With the old rewrite pass this yields:
Using
Some further reduced IR: ; ModuleID = 'julia'
source_filename = "julia"
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
target triple = "nvptx64-nvidia-cuda"
declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()
define void @fast({ [1 x i64], i8 addrspace(1)* }, { [1 x i64], i8 addrspace(1)* }) {
entry:
%2 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
%3 = icmp eq i32 %2, 0
%ptr1 = extractvalue { [1 x i64], i8 addrspace(1)* } %1, 1
%ptr2 = extractvalue { [1 x i64], i8 addrspace(1)* } %0, 1
%ptr = select i1 %3, i8 addrspace(1)* %ptr1, i8 addrspace(1)* %ptr2
%typed_ptr = bitcast i8 addrspace(1)* %ptr to float addrspace(1)*
store float 0.000000e+00, float addrspace(1)* %typed_ptr, align 4
ret void
} ; ModuleID = 'julia'
source_filename = "julia"
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
target triple = "nvptx64-nvidia-cuda"
declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()
define dso_local void @slow({ [1 x i64], i8 addrspace(1)* }* byval, { [1 x i64], i8 addrspace(1)* }* byval) {
top:
%2 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
%3 = icmp eq i32 %2, 0
%p_arr = select i1 %3, { [1 x i64], i8 addrspace(1)* }* %0, { [1 x i64], i8 addrspace(1)* }* %1
%p_ptr = getelementptr inbounds { [1 x i64], i8 addrspace(1)* }, { [1 x i64], i8 addrspace(1)* }* %p_arr, i64 0, i32 1
%p_typed_ptr = bitcast i8 addrspace(1)** %p_ptr to float addrspace(1)**
%typed_ptr = load float addrspace(1)*, float addrspace(1)** %p_typed_ptr, align 1
store float 0.000000e+00, float addrspace(1)* %typed_ptr, align 4
ret void
} cc @vchuravy |
Even shorted: target triple = "nvptx64-nvidia-cuda"
declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()
; kernel: store to one of two 'arrays' (struct with ptr), selected based on the thread idx
; byval results in st.local/ld.local
define void @fast({ i8* }* byval %p_arr1, { i8* }* byval %p_arr2) {
%tid = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
%pred = icmp eq i32 %tid, 0
%p_arr = select i1 %pred, { i8* }* %p_arr1, { i8* }* %p_arr2
%p_ptr = getelementptr { i8* }, { i8* }* %p_arr, i64 0, i32 0
%ptr = load i8*, i8** %p_ptr
store i8 0, i8* %ptr
ret void
}
; byval manually performed
define void @slow({ i8* } %arr1, { i8* } %arr2) {
%tid = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
%pred = icmp eq i32 %tid, 0
%arr = select i1 %pred, { i8* } %arr1, { i8* } %arr2
%ptr = extractvalue { i8* } %arr, 0
store i8 0, i8* %ptr
ret void
}
|
Going to reopen this as we ideally want to re-enable |
ah, felt so close to match performance of CUDA.jl to CUDAdrv.jl etc and be able to finally migrate to latest and greatest CUDA.jl in production code :( . Looking forward for a fix to come back :) |
Performance is identical: this issue has been fixed, as has JuliaGPU/CUDA.jl#799. I just reopened it because we ultimately want to go back to the old behavior if we manage to do so without paying a performance penalty. |
Maybe it is a different problem with the same code then?
works fine
however with latest (which uses
I see a roll back to original problem (non linear growth of execution time)
|
This code is now covered by the CUDA.jl benchmark suite, JuliaGPU/CUDA.jl@af86b69. |
Trying to migrate to CUDA.jl from CUDAdrv.jl (and Julia 1.3.1) but can't still due to performance issues. Here is an (artificial) example I managed to boil down to. I have this test code
add_kernel.jl
, where I have an array of 2d arrays and I want add it to another array of 2d arrays :in julia 1.3.1 with CUDADrv (Win 10, GTX 2070, I have only 1 card.) I get the following numbers
there is a slight overhead for adding z slicing code in case of just one slice. and then time grows reasonably linearly with the number of z slices increasing. however with julia 1.5.2 and CUDA.jl I get the following numbers:
i.e. totally unreasonable growth with slices 2 and 3 (yet same performance with simple add kernel and even just one slice). Looks like an issue in CUDA.jl
here is how I setup julia 1.3.1 env
and this is how I setup julia 1.5.2 env
and this is code to call the kernels in 1.3.1
add_kernel_1.3.1.jl
:this is code to call the kernels in 1.5.2
add_kernel_1.5.2.jl
:and I call it like this
C:\Bin\Julia-1.3.1\bin\julia.exe add_kernel_1.3.1.jl
C:\Bin\Julia 1.5.2\bin\julia.exe add_kernel_1.5.2.jl
aj
The text was updated successfully, but these errors were encountered: