PTX: byval results in local memory usage #92

anj00 · 2020-10-03T13:47:18Z

Trying to migrate to CUDA.jl from CUDAdrv.jl (and Julia 1.3.1) but can't still due to performance issues. Here is an (artificial) example I managed to boil down to. I have this test code add_kernel.jl, where I have an array of 2d arrays and I want add it to another array of 2d arrays :

const threads = 256

#simple add matrixes kernel
function kernel_add_mat(n, x1, x2, y)
    i = (blockIdx().x-1) * blockDim().x + threadIdx().x
    if i <= n
        @inbounds y[i] = x1[i] + x2[i]
    end
    return
end

@inline get_inputs3(indx_y, a, b, c)                                        = (a, b, c)
@inline get_inputs3(indx_y, a1, a2, b1, b2, c1, c2)                         = indx_y == 1 ? (a1, b1, c1) : (a2, b2, c2)
@inline get_inputs3(indx_y, a1, a2, a3, b1, b2, b3, c1, c2, c3)             = indx_y == 1 ? (a1, b1, c1) : indx_y == 2 ? (a2, b2, c2) : (a3, b3, c3)

#add arrays of matrixes kernel
function kernel_add_mat_z_slices(n, vararg...)
    x1, x2, y = get_inputs3(blockIdx().y, vararg...)
    i = (blockIdx().x-1) * blockDim().x + threadIdx().x
    if i <= n
        @inbounds y[i] = x1[i] + x2[i]
    end
    return
end

function add_z_slices!(y, x1, x2)
    m1, n1 = size(x1[1]) #get size of first slice
    blocks = (m1 * n1 + threads - 1) ÷ threads
    #get length(x1) more blocks than needed to process 1 slice
    @cuda blocks = blocks, length(x1) threads = threads kernel_add_mat_z_slices(m1 * n1, x1..., x2..., y...)
end

function add!(y, x1, x2)
    m1, n1 = size(x1)
    blocks = (m1 * n1 + threads - 1) ÷ threads
    @cuda blocks = blocks, 1          threads = threads kernel_add_mat(m1 * n1, x1, x2, y)
end

num_z_slices = 3
Random.seed!(1)

#m, n = 7, 5          # tiny to measure overhead
#m, n = 521, 111
#m, n = 1521, 1111
#m, n = 3001, 1511    # prime numbers to test memory access correctness
m, n = 3072, 1536    # 256 multiplier
#m, n = 6007, 3001    # prime numbers to test memory access correctness
    
x1 = [cu(randn(Float32, (m, n)) .+ Float32(0.5)) for i = 1:num_z_slices]
x2 = [cu(randn(Float32, (m, n)) .+ Float32(0.5)) for i = 1:num_z_slices]
y1 = [similar(x1[1]) for i = 1:num_z_slices]

#reference down to bones add on GPU
add!(y1[1], x1[1], x2[1])
print("add!                       "); 
@btime begin add!($y1[1], $x1[1], $x2[1]); synchronize() end

#adding arrays in an array
for slices = 1:num_z_slices
    add_z_slices!(y1[1:slices], x1[1:slices], x2[1:slices])
    print("add_z_slices!, slices = $slices  "); 
    @btime begin add_z_slices!($y1[1:$slices], $x1[1:$slices], $x2[1:$slices]); synchronize() end
end

in julia 1.3.1 with CUDADrv (Win 10, GTX 2070, I have only 1 card.) I get the following numbers

add!                         167.600 μs (35 allocations: 1.11 KiB)
add_z_slices!, slices = 1    171.101 μs (75 allocations: 2.97 KiB)
add_z_slices!, slices = 2    313.199 μs (93 allocations: 4.09 KiB)
add_z_slices!, slices = 3    456.600 μs (111 allocations: 5.41 KiB)

there is a slight overhead for adding z slicing code in case of just one slice. and then time grows reasonably linearly with the number of z slices increasing. however with julia 1.5.2 and CUDA.jl I get the following numbers:

add!                         167.100 μs (14 allocations: 400 bytes)
add_z_slices!, slices = 1    169.800 μs (56 allocations: 2.33 KiB)
add_z_slices!, slices = 2    4.536 ms (68 allocations: 3.02 KiB)
add_z_slices!, slices = 3    2.435 ms (80 allocations: 3.83 KiB)

i.e. totally unreasonable growth with slices 2 and 3 (yet same performance with simple add kernel and even just one slice). Looks like an issue in CUDA.jl

here is how I setup julia 1.3.1 env

using Pkg
Pkg.activate("CUDAdrv")
pkg_ref = Dict("BenchmarkTools"  => v"0.5.0",
               "CUDAdrv"         => v"5.0.1",
               "CUDAnative"      => v"2.7.0",
               "CuArrays"        => v"1.6.0",
               )

for pkg in keys(pkg_ref)
    println("install $pkg => ", pkg_ref[pkg])
    Pkg.add(PackageSpec(name=pkg, version=pkg_ref[pkg]))
end

and this is how I setup julia 1.5.2 env

using Pkg
Pkg.activate("Cuda")

pkg_ref = Dict("BenchmarkTools"  => v"0.5.0",
               "CUDA"         => v"2.0.0"
               )

for pkg in keys(pkg_ref)
    println("install $pkg => ", pkg_ref[pkg])
    Pkg.add(PackageSpec(name=pkg, version=pkg_ref[pkg]))
end

and this is code to call the kernels in 1.3.1 add_kernel_1.3.1.jl:

using Pkg
Pkg.activate("CUDAdrv")

using BenchmarkTools, Printf, Random
using CUDAdrv: CuDevice, CuContext, DeviceSet, CuDefaultStream, synchronize, unsafe_destroy!, CUctx_flags, devices

use_dev = 0
dev = CuDevice(use_dev)
ctx = CuContext(dev)
using CUDAnative
CUDAnative.device!(use_dev)
using CuArrays
include("add_kernel.jl")

this is code to call the kernels in 1.5.2 add_kernel_1.5.2.jl:

using Pkg
Pkg.activate("Cuda")
using BenchmarkTools, Printf, Random, CUDA
include("add_kernel.jl")

and I call it like this
C:\Bin\Julia-1.3.1\bin\julia.exe add_kernel_1.3.1.jl
C:\Bin\Julia 1.5.2\bin\julia.exe add_kernel_1.5.2.jl

aj

The text was updated successfully, but these errors were encountered:

maleadt · 2020-10-07T07:54:21Z

Bisected to #16.

maleadt · 2020-10-07T09:39:09Z

And reduced to something smaller:

function kernel(y1, y2)
    y = threadIdx().x == 1 ? y1 : y2
    @inbounds y[] = 0
    return
end

@cuda kernel(CUDA.rand(1), CUDA.rand(1))

With the old rewrite pass this yields:

.visible .func kernel(
        .param .align 8 .b8 kernel_param_0[16],
        .param .align 8 .b8 kernel_param_1[16]
)
{
        .reg .pred      %p<2>;
        .reg .b32       %r<3>;
        .reg .b64       %rd<4>;

// %bb.0:                               // %entry
        ld.param.u64    %rd1, [kernel_param_0+8];
        ld.param.u64    %rd2, [kernel_param_1+8];
        mov.u32         %r1, %tid.x;
        setp.eq.s32     %p1, %r1, 0;
        selp.b64        %rd3, %rd2, %rd1, %p1;
        mov.u32         %r2, 0;
        st.global.u32   [%rd3], %r2;
        ret;
                                        // -- End function
}

Using byval results in local stores:

.visible .func kernel(
        .param .align 8 .b8 kernel_param_0[16],
        .param .align 8 .b8 kernel_param_1[16]
)
{
        .local .align 8 .b8     __local_depot0[32];
        .reg .b64       %SP;
        .reg .b64       %SPL;
        .reg .pred      %p<2>;
        .reg .b32       %r<3>;
        .reg .b64       %rd<32>;

// %bb.0:                               // %top
        mov.u64         %SPL, __local_depot0;
        cvta.local.u64  %SP, %SPL;
        add.u64         %rd1, %SP, 0;
        add.u64         %rd2, %SPL, 0;
        ld.param.u64    %rd3, [kernel_param_1+8];
        ld.param.u64    %rd4, [kernel_param_1];
        st.local.u64    [%rd2], %rd4;
        st.local.u64    [%rd2+8], %rd3;
        add.u64         %rd5, %SP, 16;
        add.u64         %rd6, %SPL, 16;
        ld.param.u64    %rd7, [kernel_param_0+8];
        ld.param.u64    %rd8, [kernel_param_0];
        st.local.u64    [%rd6], %rd8;
        st.local.u64    [%rd6+8], %rd7;
        mov.u32         %r1, %tid.x;
        setp.eq.s32     %p1, %r1, 0;
        selp.b64        %rd9, %rd5, %rd1, %p1;
        ld.u8   %rd10, [%rd9+8];
        ld.u8   %rd11, [%rd9+9];
        shl.b64         %rd12, %rd11, 8;
        or.b64          %rd13, %rd12, %rd10;
        ld.u8   %rd14, [%rd9+10];
        ld.u8   %rd15, [%rd9+11];
        shl.b64         %rd16, %rd15, 8;
        or.b64          %rd17, %rd16, %rd14;
        shl.b64         %rd18, %rd17, 16;
        or.b64          %rd19, %rd18, %rd13;
        ld.u8   %rd20, [%rd9+12];
        ld.u8   %rd21, [%rd9+13];
        shl.b64         %rd22, %rd21, 8;
        or.b64          %rd23, %rd22, %rd20;
        ld.u8   %rd24, [%rd9+14];
        ld.u8   %rd25, [%rd9+15];
        shl.b64         %rd26, %rd25, 8;
        or.b64          %rd27, %rd26, %rd24;
        shl.b64         %rd28, %rd27, 16;
        or.b64          %rd29, %rd28, %rd23;
        shl.b64         %rd30, %rd29, 32;
        or.b64          %rd31, %rd30, %rd19;
        mov.u32         %r2, 0;
        st.global.u32   [%rd31], %r2;
        ret;
                                        // -- End function
}

Some further reduced IR:

; ModuleID = 'julia'
source_filename = "julia"
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
target triple = "nvptx64-nvidia-cuda"

declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()

define void @fast({ [1 x i64], i8 addrspace(1)* }, { [1 x i64], i8 addrspace(1)* }) {
entry:
  %2 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %3 = icmp eq i32 %2, 0
  %ptr1 = extractvalue { [1 x i64], i8 addrspace(1)* } %1, 1
  %ptr2 = extractvalue { [1 x i64], i8 addrspace(1)* } %0, 1
  %ptr = select i1 %3, i8 addrspace(1)* %ptr1, i8 addrspace(1)* %ptr2
  %typed_ptr = bitcast i8 addrspace(1)* %ptr to float addrspace(1)*
  store float 0.000000e+00, float addrspace(1)* %typed_ptr, align 4
  ret void
}

; ModuleID = 'julia'
source_filename = "julia"
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
target triple = "nvptx64-nvidia-cuda"

declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()

define dso_local void @slow({ [1 x i64], i8 addrspace(1)* }* byval, { [1 x i64], i8 addrspace(1)* }* byval) {
top:
  %2 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %3 = icmp eq i32 %2, 0
  %p_arr = select i1 %3, { [1 x i64], i8 addrspace(1)* }* %0, { [1 x i64], i8 addrspace(1)* }* %1
  %p_ptr = getelementptr inbounds { [1 x i64], i8 addrspace(1)* }, { [1 x i64], i8 addrspace(1)* }* %p_arr, i64 0, i32 1
  %p_typed_ptr = bitcast i8 addrspace(1)** %p_ptr to float addrspace(1)**
  %typed_ptr = load float addrspace(1)*, float addrspace(1)** %p_typed_ptr, align 1
  store float 0.000000e+00, float addrspace(1)* %typed_ptr, align 4
  ret void
}

cc @vchuravy

maleadt · 2020-10-07T10:54:04Z

Even shorted:

target triple = "nvptx64-nvidia-cuda"

declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()

; kernel: store to one of two 'arrays' (struct with ptr), selected based on the thread idx

; byval results in st.local/ld.local
define void @fast({ i8* }* byval %p_arr1, { i8* }* byval %p_arr2) {
  %tid = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %pred = icmp eq i32 %tid, 0
  %p_arr = select i1 %pred, { i8* }* %p_arr1, { i8* }* %p_arr2
  %p_ptr = getelementptr { i8* }, { i8* }* %p_arr, i64 0, i32 0
  %ptr = load i8*, i8** %p_ptr
  store i8 0, i8* %ptr
  ret void
}

; byval manually performed
define void @slow({ i8* } %arr1, { i8* } %arr2) {
  %tid = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %pred = icmp eq i32 %tid, 0
  %arr = select i1 %pred, { i8* } %arr1, { i8* } %arr2
  %ptr = extractvalue { i8* } %arr, 0
  store i8 0, i8* %ptr
  ret void
}

.visible .func fast(
        .param .align 8 .b8 fast_param_0[8],
        .param .align 8 .b8 fast_param_1[8]
)
{
        .local .align 8 .b8     __local_depot0[16];
        .reg .b64       %SP;
        .reg .b64       %SPL;
        .reg .pred      %p<2>;
        .reg .b16       %rs<2>;
        .reg .b32       %r<2>;
        .reg .b64       %rd<9>;

// %bb.0:
        mov.u64         %SPL, __local_depot0;
        add.u64         %rd2, %SPL, 0;
        ld.param.u64    %rd3, [fast_param_1];
        st.local.u64    [%rd2], %rd3;
        add.u64         %rd5, %SPL, 8;
        ld.param.u64    %rd6, [fast_param_0];
        st.local.u64    [%rd5], %rd6;
        mov.u32         %r1, %tid.x;
        setp.eq.s32     %p1, %r1, 0;
        selp.b64        %rd7, %rd5, %rd2, %p1;
        ld.local.u64    %rd8, [%rd7];
        mov.u16         %rs1, 0;
        st.u8   [%rd8], %rs1;
        ret;
                                        // -- End function
}
        // .globl       slow            // -- Begin function slow
.visible .func slow(
        .param .align 8 .b8 slow_param_0[8],
        .param .align 8 .b8 slow_param_1[8]
)                                       // @slow
{
        .reg .pred      %p<2>;
        .reg .b16       %rs<2>;
        .reg .b32       %r<2>;
        .reg .b64       %rd<4>;

// %bb.0:
        ld.param.u64    %rd1, [slow_param_0];
        ld.param.u64    %rd2, [slow_param_1];
        mov.u32         %r1, %tid.x;
        setp.eq.s32     %p1, %r1, 0;
        selp.b64        %rd3, %rd1, %rd2, %p1;
        mov.u16         %rs1, 0;
        st.u8   [%rd3], %rs1;
        ret;
                                        // -- End function
}

maleadt · 2021-04-08T10:45:46Z

Going to reopen this as we ideally want to re-enable byval (since there's other bugs when not using byval, e.g., passing [1 x i128] to kernels). I had hoped that https://reviews.llvm.org/D98469 would fix the original issue, but it doesn't look like it. Probably needs to be extended to look past the select.

anj00 · 2021-04-14T09:22:55Z

ah, felt so close to match performance of CUDA.jl to CUDAdrv.jl etc and be able to finally migrate to latest and greatest CUDA.jl in production code :( . Looking forward for a fix to come back :)

maleadt · 2021-04-14T09:41:44Z

Performance is identical: this issue has been fixed, as has JuliaGPU/CUDA.jl#799. I just reopened it because we ultimately want to go back to the old behavior if we manage to do so without paying a performance penalty.

anj00 · 2021-04-14T10:23:29Z

Maybe it is a different problem with the same code then?
The test code I posted add_kernel.jl with this set of modules

  [6e4b80f9] BenchmarkTools v0.6.0
  [052768ef] CUDA v3.0.0
  [61eb1bfa] GPUCompiler v0.11.2

works fine

add!                                  162.600 μs (256 allocations: 4.09 KiB)
add_z_slices!, slices = 1    164.400 μs (262 allocations: 5.31 KiB)
add_z_slices!, slices = 2    307.000 μs (236 allocations: 5.27 KiB)
add_z_slices!, slices = 3    449.100 μs (198 allocations: 5.16 KiB)

however with latest (which uses GPUCompiler v0.11.3)

  [6e4b80f9] BenchmarkTools v0.6.0
  [052768ef] CUDA v3.0.1

I see a roll back to original problem (non linear growth of execution time)

add!                         161.900 μs (128 allocations: 2.09 KiB)
add_z_slices!, slices = 1    165.200 μs (204 allocations: 4.41 KiB)
add_z_slices!, slices = 2    591.100 μs (276 allocations: 5.89 KiB)
add_z_slices!, slices = 3    2.425 ms (2972 allocations: 48.50 KiB)

maleadt · 2021-04-14T10:27:26Z

Oh, nice catch, that change (I had locally to test the actual situation) slipped into bd5bea2... I'll revert it.
EDIT: 625bb92

maleadt · 2021-04-28T07:56:53Z

This code is now covered by the CUDA.jl benchmark suite, JuliaGPU/CUDA.jl@af86b69.

anj00 changed the title ~~CUDA.jl is 5x slower than CUDAdrv.jl in certain mem access patterns~~ CUDA.jl is 5-14x slower than CUDAdrv.jl in certain mem access patterns Oct 3, 2020

maleadt changed the title ~~CUDA.jl is 5-14x slower than CUDAdrv.jl in certain mem access patterns~~ Byval lowering results in local memory usage Oct 7, 2020

maleadt transferred this issue from JuliaGPU/CUDA.jl Oct 7, 2020

maleadt changed the title ~~Byval lowering results in local memory usage~~ PTX: byval results in local memory usage Oct 7, 2020

maleadt added the ptx Stuff about the NVIDIA PTX back-end. label Oct 7, 2020

maleadt mentioned this issue Oct 7, 2020

Lower byval ourselves for PTX. #93

Merged

maleadt mentioned this issue Jan 14, 2021

Bump GPUCompiler. JuliaGPU/CUDA.jl#653

Merged

maleadt closed this as completed in JuliaGPU/CUDA.jl#653 Jan 14, 2021

This was referenced Apr 7, 2021

NVPTX i128 support broken on LLVM 11 / Julia 1.6 JuliaGPU/CUDA.jl#793

Closed

Revisit byval for PTX #169

Closed

maleadt reopened this Apr 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PTX: byval results in local memory usage #92

PTX: byval results in local memory usage #92

anj00 commented Oct 3, 2020

maleadt commented Oct 7, 2020

maleadt commented Oct 7, 2020

maleadt commented Oct 7, 2020 •

edited

Loading

maleadt commented Apr 8, 2021

anj00 commented Apr 14, 2021 •

edited

Loading

maleadt commented Apr 14, 2021

anj00 commented Apr 14, 2021

maleadt commented Apr 14, 2021 •

edited

Loading

maleadt commented Apr 28, 2021

PTX: byval results in local memory usage #92

PTX: byval results in local memory usage #92

Comments

anj00 commented Oct 3, 2020

maleadt commented Oct 7, 2020

maleadt commented Oct 7, 2020

maleadt commented Oct 7, 2020 • edited Loading

maleadt commented Apr 8, 2021

anj00 commented Apr 14, 2021 • edited Loading

maleadt commented Apr 14, 2021

anj00 commented Apr 14, 2021

maleadt commented Apr 14, 2021 • edited Loading

maleadt commented Apr 28, 2021

maleadt commented Oct 7, 2020 •

edited

Loading

anj00 commented Apr 14, 2021 •

edited

Loading

maleadt commented Apr 14, 2021 •

edited

Loading