Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent load_store_tg test failure #428

Closed
maleadt opened this issue Sep 24, 2024 · 3 comments · Fixed by #432
Closed

Intermittent load_store_tg test failure #428

maleadt opened this issue Sep 24, 2024 · 3 comments · Fixed by #432

Comments

@maleadt
Copy link
Member

maleadt commented Sep 24, 2024

Error in testset device/intrinsics:
Test Failed at /Users/tim/Julia/pkg/Metal/test/device/intrinsics.jl:388
  Expression: Array(a) == Array(b)
   Evaluated: Float32[0.010908663 0.45874983 … 0.45182294 0.19928169; 0.049171805 0.6918573 … 0.7434447 0.11716604; … ; 0.72251517 0.859477 … 0.5544929 0.733749; 0.024094284 0.85571766 … 0.8398908 0.4304701] == Float32[0.0 0.0 … 0.45182294 0.19928169; 0.0 0.0 … 0.7434447 0.11716604; … ; 0.0 0.0 … 0.5544929 0.733749; 0.0 0.0 … 0.8398908 0.4304701]

Observed on:

macOS 15.0.0, Darwin 24.0.0

Toolchain:
- Julia: 1.10.5
- LLVM: 15.0.7

Julia packages:
- Metal.jl: 1.3.0
- GPUArrays: 10.3.1
- GPUCompiler: 0.27.8
- KernelAbstractions: 0.9.27
- ObjectiveC: 3.1.0
- LLVM: 9.1.2
- LLVMDowngrader_jll: 0.3.0+1

1 device:
- Apple M3 Pro (80.000 KiB allocated)
@christiangnrd
Copy link
Contributor

I can reproduce on M2 Max Julia 1.10 and 1.11.

Take between 350 and 23000 iterations to reproduce using the code below.

count = 0
for _ in 1:50000
    for typ in [Float16, Float32]
        function kernel(a::MtlDeviceArray{T}, b::MtlDeviceArray{T}) where {T}
            pos = thread_position_in_threadgroup_2d()

            tg_a = MtlThreadGroupArray(T, (8, 8))
            tg_a[pos.x, pos.y] = a[pos.x, pos.y]
            sg_a = simdgroup_load(tg_a)

            tg_b = MtlThreadGroupArray(T, (8, 8))
            simdgroup_store(sg_a, tg_b)
            b[pos.x, pos.y] = tg_b[pos.x, pos.y]

            return
        end

        a = MtlArray(rand(typ, 8, 8))
        b = MtlArray(zeros(typ, 8, 8))
        @metal threads=(8, 8) kernel(a, b)
        if Array(a) != Array(b)
            @test Array(a) == Array(b)
        end
    end
    count += 1
end

@christiangnrd
Copy link
Contributor

This didn't fail for 1000000 iterations just by adding threadgroup_barrier(Metal.MemoryFlagThreadGroup) as shown below. Is this just a fluke? I still don't fully understand what barriers do so I don't know if this actually fixes the problem or just hides it.

count = 0
for _ in 1:50000
    for typ in [Float16, Float32]
        function kernel(a::MtlDeviceArray{T}, b::MtlDeviceArray{T}) where {T}
            pos = thread_position_in_threadgroup_2d()

            tg_a = MtlThreadGroupArray(T, (8, 8))
            tg_a[pos.x, pos.y] = a[pos.x, pos.y]
            threadgroup_barrier(Metal.MemoryFlagThreadGroup)
            sg_a = simdgroup_load(tg_a)

            tg_b = MtlThreadGroupArray(T, (8, 8))
            simdgroup_store(sg_a, tg_b)
            b[pos.x, pos.y] = tg_b[pos.x, pos.y]

            return
        end

        a = MtlArray(rand(typ, 8, 8))
        b = MtlArray(zeros(typ, 8, 8))
        @metal threads=(8, 8) kernel(a, b)
        if Array(a) != Array(b)
            @test Array(a) == Array(b)
        end
    end
    count += 1
end

@maleadt
Copy link
Member Author

maleadt commented Sep 26, 2024

Yeah, that does seem correct. I guess the simdgroup_load could result in a different thread reading the threadgroup array, so we need a memory fence for the threagroup memory (aka. threadgroup_barrier(Metal.MemoryFlagThreadGroup)) to ensure the correct memory is read by the SIMD operation afterwards.

Possibly we need another fence after the simdgroup_store, since we're not directly having the wavefront store to device memory, but go through the threadgroup array first. I'll add one to be sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants