Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PTX: byval results in local memory usage #92

Open
anj00 opened this issue Oct 3, 2020 · 9 comments · Fixed by JuliaGPU/CUDA.jl#653
Open

PTX: byval results in local memory usage #92

anj00 opened this issue Oct 3, 2020 · 9 comments · Fixed by JuliaGPU/CUDA.jl#653
Labels
ptx Stuff about the NVIDIA PTX back-end.

Comments

@anj00
Copy link

anj00 commented Oct 3, 2020

Trying to migrate to CUDA.jl from CUDAdrv.jl (and Julia 1.3.1) but can't still due to performance issues. Here is an (artificial) example I managed to boil down to. I have this test code add_kernel.jl, where I have an array of 2d arrays and I want add it to another array of 2d arrays :

const threads = 256

#simple add matrixes kernel
function kernel_add_mat(n, x1, x2, y)
    i = (blockIdx().x-1) * blockDim().x + threadIdx().x
    if i <= n
        @inbounds y[i] = x1[i] + x2[i]
    end
    return
end

@inline get_inputs3(indx_y, a, b, c)                                        = (a, b, c)
@inline get_inputs3(indx_y, a1, a2, b1, b2, c1, c2)                         = indx_y == 1 ? (a1, b1, c1) : (a2, b2, c2)
@inline get_inputs3(indx_y, a1, a2, a3, b1, b2, b3, c1, c2, c3)             = indx_y == 1 ? (a1, b1, c1) : indx_y == 2 ? (a2, b2, c2) : (a3, b3, c3)

#add arrays of matrixes kernel
function kernel_add_mat_z_slices(n, vararg...)
    x1, x2, y = get_inputs3(blockIdx().y, vararg...)
    i = (blockIdx().x-1) * blockDim().x + threadIdx().x
    if i <= n
        @inbounds y[i] = x1[i] + x2[i]
    end
    return
end

function add_z_slices!(y, x1, x2)
    m1, n1 = size(x1[1]) #get size of first slice
    blocks = (m1 * n1 + threads - 1) ÷ threads
    #get length(x1) more blocks than needed to process 1 slice
    @cuda blocks = blocks, length(x1) threads = threads kernel_add_mat_z_slices(m1 * n1, x1..., x2..., y...)
end

function add!(y, x1, x2)
    m1, n1 = size(x1)
    blocks = (m1 * n1 + threads - 1) ÷ threads
    @cuda blocks = blocks, 1          threads = threads kernel_add_mat(m1 * n1, x1, x2, y)
end

num_z_slices = 3
Random.seed!(1)

#m, n = 7, 5          # tiny to measure overhead
#m, n = 521, 111
#m, n = 1521, 1111
#m, n = 3001, 1511    # prime numbers to test memory access correctness
m, n = 3072, 1536    # 256 multiplier
#m, n = 6007, 3001    # prime numbers to test memory access correctness
    
x1 = [cu(randn(Float32, (m, n)) .+ Float32(0.5)) for i = 1:num_z_slices]
x2 = [cu(randn(Float32, (m, n)) .+ Float32(0.5)) for i = 1:num_z_slices]
y1 = [similar(x1[1]) for i = 1:num_z_slices]

#reference down to bones add on GPU
add!(y1[1], x1[1], x2[1])
print("add!                       "); 
@btime begin add!($y1[1], $x1[1], $x2[1]); synchronize() end

#adding arrays in an array
for slices = 1:num_z_slices
    add_z_slices!(y1[1:slices], x1[1:slices], x2[1:slices])
    print("add_z_slices!, slices = $slices  "); 
    @btime begin add_z_slices!($y1[1:$slices], $x1[1:$slices], $x2[1:$slices]); synchronize() end
end

in julia 1.3.1 with CUDADrv (Win 10, GTX 2070, I have only 1 card.) I get the following numbers

add!                         167.600 μs (35 allocations: 1.11 KiB)
add_z_slices!, slices = 1    171.101 μs (75 allocations: 2.97 KiB)
add_z_slices!, slices = 2    313.199 μs (93 allocations: 4.09 KiB)
add_z_slices!, slices = 3    456.600 μs (111 allocations: 5.41 KiB) 

there is a slight overhead for adding z slicing code in case of just one slice. and then time grows reasonably linearly with the number of z slices increasing. however with julia 1.5.2 and CUDA.jl I get the following numbers:

add!                         167.100 μs (14 allocations: 400 bytes)
add_z_slices!, slices = 1    169.800 μs (56 allocations: 2.33 KiB)
add_z_slices!, slices = 2    4.536 ms (68 allocations: 3.02 KiB)
add_z_slices!, slices = 3    2.435 ms (80 allocations: 3.83 KiB)

i.e. totally unreasonable growth with slices 2 and 3 (yet same performance with simple add kernel and even just one slice). Looks like an issue in CUDA.jl

here is how I setup julia 1.3.1 env

using Pkg
Pkg.activate("CUDAdrv")
pkg_ref = Dict("BenchmarkTools"  => v"0.5.0",
               "CUDAdrv"         => v"5.0.1",
               "CUDAnative"      => v"2.7.0",
               "CuArrays"        => v"1.6.0",
               )

for pkg in keys(pkg_ref)
    println("install $pkg => ", pkg_ref[pkg])
    Pkg.add(PackageSpec(name=pkg, version=pkg_ref[pkg]))
end

and this is how I setup julia 1.5.2 env

using Pkg
Pkg.activate("Cuda")

pkg_ref = Dict("BenchmarkTools"  => v"0.5.0",
               "CUDA"         => v"2.0.0"
               )

for pkg in keys(pkg_ref)
    println("install $pkg => ", pkg_ref[pkg])
    Pkg.add(PackageSpec(name=pkg, version=pkg_ref[pkg]))
end

and this is code to call the kernels in 1.3.1 add_kernel_1.3.1.jl:

using Pkg
Pkg.activate("CUDAdrv")

using BenchmarkTools, Printf, Random
using CUDAdrv: CuDevice, CuContext, DeviceSet, CuDefaultStream, synchronize, unsafe_destroy!, CUctx_flags, devices

use_dev = 0
dev = CuDevice(use_dev)
ctx = CuContext(dev)
using CUDAnative
CUDAnative.device!(use_dev)
using CuArrays
include("add_kernel.jl")

this is code to call the kernels in 1.5.2 add_kernel_1.5.2.jl:

using Pkg
Pkg.activate("Cuda")
using BenchmarkTools, Printf, Random, CUDA
include("add_kernel.jl")

and I call it like this
C:\Bin\Julia-1.3.1\bin\julia.exe add_kernel_1.3.1.jl
C:\Bin\Julia 1.5.2\bin\julia.exe add_kernel_1.5.2.jl

aj

@anj00 anj00 changed the title CUDA.jl is 5x slower than CUDAdrv.jl in certain mem access patterns CUDA.jl is 5-14x slower than CUDAdrv.jl in certain mem access patterns Oct 3, 2020
@maleadt
Copy link
Member

maleadt commented Oct 7, 2020

Bisected to #16.

@maleadt
Copy link
Member

maleadt commented Oct 7, 2020

And reduced to something smaller:

function kernel(y1, y2)
    y = threadIdx().x == 1 ? y1 : y2
    @inbounds y[] = 0
    return
end

@cuda kernel(CUDA.rand(1), CUDA.rand(1))

With the old rewrite pass this yields:

.visible .func kernel(
        .param .align 8 .b8 kernel_param_0[16],
        .param .align 8 .b8 kernel_param_1[16]
)
{
        .reg .pred      %p<2>;
        .reg .b32       %r<3>;
        .reg .b64       %rd<4>;

// %bb.0:                               // %entry
        ld.param.u64    %rd1, [kernel_param_0+8];
        ld.param.u64    %rd2, [kernel_param_1+8];
        mov.u32         %r1, %tid.x;
        setp.eq.s32     %p1, %r1, 0;
        selp.b64        %rd3, %rd2, %rd1, %p1;
        mov.u32         %r2, 0;
        st.global.u32   [%rd3], %r2;
        ret;
                                        // -- End function
}

Using byval results in local stores:

.visible .func kernel(
        .param .align 8 .b8 kernel_param_0[16],
        .param .align 8 .b8 kernel_param_1[16]
)
{
        .local .align 8 .b8     __local_depot0[32];
        .reg .b64       %SP;
        .reg .b64       %SPL;
        .reg .pred      %p<2>;
        .reg .b32       %r<3>;
        .reg .b64       %rd<32>;

// %bb.0:                               // %top
        mov.u64         %SPL, __local_depot0;
        cvta.local.u64  %SP, %SPL;
        add.u64         %rd1, %SP, 0;
        add.u64         %rd2, %SPL, 0;
        ld.param.u64    %rd3, [kernel_param_1+8];
        ld.param.u64    %rd4, [kernel_param_1];
        st.local.u64    [%rd2], %rd4;
        st.local.u64    [%rd2+8], %rd3;
        add.u64         %rd5, %SP, 16;
        add.u64         %rd6, %SPL, 16;
        ld.param.u64    %rd7, [kernel_param_0+8];
        ld.param.u64    %rd8, [kernel_param_0];
        st.local.u64    [%rd6], %rd8;
        st.local.u64    [%rd6+8], %rd7;
        mov.u32         %r1, %tid.x;
        setp.eq.s32     %p1, %r1, 0;
        selp.b64        %rd9, %rd5, %rd1, %p1;
        ld.u8   %rd10, [%rd9+8];
        ld.u8   %rd11, [%rd9+9];
        shl.b64         %rd12, %rd11, 8;
        or.b64          %rd13, %rd12, %rd10;
        ld.u8   %rd14, [%rd9+10];
        ld.u8   %rd15, [%rd9+11];
        shl.b64         %rd16, %rd15, 8;
        or.b64          %rd17, %rd16, %rd14;
        shl.b64         %rd18, %rd17, 16;
        or.b64          %rd19, %rd18, %rd13;
        ld.u8   %rd20, [%rd9+12];
        ld.u8   %rd21, [%rd9+13];
        shl.b64         %rd22, %rd21, 8;
        or.b64          %rd23, %rd22, %rd20;
        ld.u8   %rd24, [%rd9+14];
        ld.u8   %rd25, [%rd9+15];
        shl.b64         %rd26, %rd25, 8;
        or.b64          %rd27, %rd26, %rd24;
        shl.b64         %rd28, %rd27, 16;
        or.b64          %rd29, %rd28, %rd23;
        shl.b64         %rd30, %rd29, 32;
        or.b64          %rd31, %rd30, %rd19;
        mov.u32         %r2, 0;
        st.global.u32   [%rd31], %r2;
        ret;
                                        // -- End function
}

Some further reduced IR:

; ModuleID = 'julia'
source_filename = "julia"
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
target triple = "nvptx64-nvidia-cuda"

declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()

define void @fast({ [1 x i64], i8 addrspace(1)* }, { [1 x i64], i8 addrspace(1)* }) {
entry:
  %2 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %3 = icmp eq i32 %2, 0
  %ptr1 = extractvalue { [1 x i64], i8 addrspace(1)* } %1, 1
  %ptr2 = extractvalue { [1 x i64], i8 addrspace(1)* } %0, 1
  %ptr = select i1 %3, i8 addrspace(1)* %ptr1, i8 addrspace(1)* %ptr2
  %typed_ptr = bitcast i8 addrspace(1)* %ptr to float addrspace(1)*
  store float 0.000000e+00, float addrspace(1)* %typed_ptr, align 4
  ret void
}
; ModuleID = 'julia'
source_filename = "julia"
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
target triple = "nvptx64-nvidia-cuda"

declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()

define dso_local void @slow({ [1 x i64], i8 addrspace(1)* }* byval, { [1 x i64], i8 addrspace(1)* }* byval) {
top:
  %2 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %3 = icmp eq i32 %2, 0
  %p_arr = select i1 %3, { [1 x i64], i8 addrspace(1)* }* %0, { [1 x i64], i8 addrspace(1)* }* %1
  %p_ptr = getelementptr inbounds { [1 x i64], i8 addrspace(1)* }, { [1 x i64], i8 addrspace(1)* }* %p_arr, i64 0, i32 1
  %p_typed_ptr = bitcast i8 addrspace(1)** %p_ptr to float addrspace(1)**
  %typed_ptr = load float addrspace(1)*, float addrspace(1)** %p_typed_ptr, align 1
  store float 0.000000e+00, float addrspace(1)* %typed_ptr, align 4
  ret void
}

cc @vchuravy

@maleadt maleadt changed the title CUDA.jl is 5-14x slower than CUDAdrv.jl in certain mem access patterns Byval lowering results in local memory usage Oct 7, 2020
@maleadt
Copy link
Member

maleadt commented Oct 7, 2020

Even shorted:

target triple = "nvptx64-nvidia-cuda"

declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()

; kernel: store to one of two 'arrays' (struct with ptr), selected based on the thread idx

; byval results in st.local/ld.local
define void @fast({ i8* }* byval %p_arr1, { i8* }* byval %p_arr2) {
  %tid = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %pred = icmp eq i32 %tid, 0
  %p_arr = select i1 %pred, { i8* }* %p_arr1, { i8* }* %p_arr2
  %p_ptr = getelementptr { i8* }, { i8* }* %p_arr, i64 0, i32 0
  %ptr = load i8*, i8** %p_ptr
  store i8 0, i8* %ptr
  ret void
}

; byval manually performed
define void @slow({ i8* } %arr1, { i8* } %arr2) {
  %tid = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %pred = icmp eq i32 %tid, 0
  %arr = select i1 %pred, { i8* } %arr1, { i8* } %arr2
  %ptr = extractvalue { i8* } %arr, 0
  store i8 0, i8* %ptr
  ret void
}
.visible .func fast(
        .param .align 8 .b8 fast_param_0[8],
        .param .align 8 .b8 fast_param_1[8]
)
{
        .local .align 8 .b8     __local_depot0[16];
        .reg .b64       %SP;
        .reg .b64       %SPL;
        .reg .pred      %p<2>;
        .reg .b16       %rs<2>;
        .reg .b32       %r<2>;
        .reg .b64       %rd<9>;

// %bb.0:
        mov.u64         %SPL, __local_depot0;
        add.u64         %rd2, %SPL, 0;
        ld.param.u64    %rd3, [fast_param_1];
        st.local.u64    [%rd2], %rd3;
        add.u64         %rd5, %SPL, 8;
        ld.param.u64    %rd6, [fast_param_0];
        st.local.u64    [%rd5], %rd6;
        mov.u32         %r1, %tid.x;
        setp.eq.s32     %p1, %r1, 0;
        selp.b64        %rd7, %rd5, %rd2, %p1;
        ld.local.u64    %rd8, [%rd7];
        mov.u16         %rs1, 0;
        st.u8   [%rd8], %rs1;
        ret;
                                        // -- End function
}
        // .globl       slow            // -- Begin function slow
.visible .func slow(
        .param .align 8 .b8 slow_param_0[8],
        .param .align 8 .b8 slow_param_1[8]
)                                       // @slow
{
        .reg .pred      %p<2>;
        .reg .b16       %rs<2>;
        .reg .b32       %r<2>;
        .reg .b64       %rd<4>;

// %bb.0:
        ld.param.u64    %rd1, [slow_param_0];
        ld.param.u64    %rd2, [slow_param_1];
        mov.u32         %r1, %tid.x;
        setp.eq.s32     %p1, %r1, 0;
        selp.b64        %rd3, %rd1, %rd2, %p1;
        mov.u16         %rs1, 0;
        st.u8   [%rd3], %rs1;
        ret;
                                        // -- End function
}

@maleadt maleadt transferred this issue from JuliaGPU/CUDA.jl Oct 7, 2020
@maleadt maleadt changed the title Byval lowering results in local memory usage PTX: byval results in local memory usage Oct 7, 2020
@maleadt maleadt added the ptx Stuff about the NVIDIA PTX back-end. label Oct 7, 2020
@maleadt maleadt reopened this Apr 8, 2021
@maleadt
Copy link
Member

maleadt commented Apr 8, 2021

Going to reopen this as we ideally want to re-enable byval (since there's other bugs when not using byval, e.g., passing [1 x i128] to kernels). I had hoped that https://reviews.llvm.org/D98469 would fix the original issue, but it doesn't look like it. Probably needs to be extended to look past the select.

@anj00
Copy link
Author

anj00 commented Apr 14, 2021

ah, felt so close to match performance of CUDA.jl to CUDAdrv.jl etc and be able to finally migrate to latest and greatest CUDA.jl in production code :( . Looking forward for a fix to come back :)

@maleadt
Copy link
Member

maleadt commented Apr 14, 2021

Performance is identical: this issue has been fixed, as has JuliaGPU/CUDA.jl#799. I just reopened it because we ultimately want to go back to the old behavior if we manage to do so without paying a performance penalty.

@anj00
Copy link
Author

anj00 commented Apr 14, 2021

Maybe it is a different problem with the same code then?
The test code I posted add_kernel.jl with this set of modules

  [6e4b80f9] BenchmarkTools v0.6.0
  [052768ef] CUDA v3.0.0
  [61eb1bfa] GPUCompiler v0.11.2

works fine

add!                                  162.600 μs (256 allocations: 4.09 KiB)
add_z_slices!, slices = 1    164.400 μs (262 allocations: 5.31 KiB)
add_z_slices!, slices = 2    307.000 μs (236 allocations: 5.27 KiB)
add_z_slices!, slices = 3    449.100 μs (198 allocations: 5.16 KiB)

however with latest (which uses GPUCompiler v0.11.3)

  [6e4b80f9] BenchmarkTools v0.6.0
  [052768ef] CUDA v3.0.1

I see a roll back to original problem (non linear growth of execution time)

add!                         161.900 μs (128 allocations: 2.09 KiB)
add_z_slices!, slices = 1    165.200 μs (204 allocations: 4.41 KiB)
add_z_slices!, slices = 2    591.100 μs (276 allocations: 5.89 KiB)
add_z_slices!, slices = 3    2.425 ms (2972 allocations: 48.50 KiB)

@maleadt
Copy link
Member

maleadt commented Apr 14, 2021

Oh, nice catch, that change (I had locally to test the actual situation) slipped into bd5bea2... I'll revert it.
EDIT: 625bb92

@maleadt
Copy link
Member

maleadt commented Apr 28, 2021

This code is now covered by the CUDA.jl benchmark suite, JuliaGPU/CUDA.jl@af86b69.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ptx Stuff about the NVIDIA PTX back-end.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants