Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve efficiency of casting complex numbers to complex vectors. #9

Merged
merged 1 commit into from
Apr 17, 2023

Conversation

heltonmc
Copy link
Owner

Improves partly #8. Does not re-implement approaches but improves the efficiency of converting a tuple of complex numbers to complex vectors.

Before...

a = 1.1 + 1.3im
b = 1.1 + 1.6im
c = 2.1 + 1.6im
d = 2.1 + 1.9im

julia> @code_llvm debuginfo=:none ComplexVec((a, b, c, d))
define void @julia_ComplexVec_2450([2 x <4 x double>]* noalias nocapture sret([2 x <4 x double>]) %0, {}* nonnull readonly %1, [4 x [2 x double]]* nocapture nonnull readonly align 8 dereferenceable(64) %2) #0 {
top:
  %3 = getelementptr inbounds [4 x [2 x double]], [4 x [2 x double]]* %2, i64 0, i64 0, i64 0
  %4 = getelementptr inbounds [4 x [2 x double]], [4 x [2 x double]]* %2, i64 0, i64 0, i64 1
  %5 = getelementptr inbounds [4 x [2 x double]], [4 x [2 x double]]* %2, i64 0, i64 1, i64 0
  %6 = getelementptr inbounds [4 x [2 x double]], [4 x [2 x double]]* %2, i64 0, i64 1, i64 1
  %7 = getelementptr inbounds [4 x [2 x double]], [4 x [2 x double]]* %2, i64 0, i64 2, i64 0
  %8 = getelementptr inbounds [4 x [2 x double]], [4 x [2 x double]]* %2, i64 0, i64 2, i64 1
  %9 = getelementptr inbounds [4 x [2 x double]], [4 x [2 x double]]* %2, i64 0, i64 3, i64 0
  %10 = getelementptr inbounds [4 x [2 x double]], [4 x [2 x double]]* %2, i64 0, i64 3, i64 1
  %11 = load double, double* %3, align 8
  %12 = load double, double* %5, align 8
  %13 = load double, double* %7, align 8
  %14 = load double, double* %9, align 8
  %15 = insertelement <4 x double> undef, double %11, i32 0
  %16 = insertelement <4 x double> %15, double %12, i32 1
  %17 = insertelement <4 x double> %16, double %13, i32 2
  %18 = insertelement <4 x double> %17, double %14, i32 3
  %19 = load double, double* %4, align 8
  %20 = load double, double* %6, align 8
  %21 = load double, double* %8, align 8
  %22 = load double, double* %10, align 8
  %23 = insertelement <4 x double> undef, double %19, i32 0
  %24 = insertelement <4 x double> %23, double %20, i32 1
  %25 = insertelement <4 x double> %24, double %21, i32 2
  %26 = insertelement <4 x double> %25, double %22, i32 3
  %.sroa.0.0..sroa_idx = getelementptr inbounds [2 x <4 x double>], [2 x <4 x double>]* %0, i64 0, i64 0
  store <4 x double> %18, <4 x double>* %.sroa.0.0..sroa_idx, align 32
  %.sroa.2.0..sroa_idx1 = getelementptr inbounds [2 x <4 x double>], [2 x <4 x double>]* %0, i64 0, i64 1
  store <4 x double> %26, <4 x double>* %.sroa.2.0..sroa_idx1, align 32
  ret void


julia> @code_native debuginfo=:none ComplexVec((a, b, c, d))
	.section	__TEXT,__text,regular,pure_instructions
	.build_version macos, 11, 0
	.globl	_julia_ComplexVec_2452          ; -- Begin function julia_ComplexVec_2452
	.p2align	2
_julia_ComplexVec_2452:                 ; @julia_ComplexVec_2452
	.cfi_startproc
; %bb.0:                                ; %top
	add	x9, x1, #16                     ; =16
	add	x10, x1, #24                    ; =24
	add	x11, x1, #48                    ; =48
	add	x12, x1, #56                    ; =56
	ldp	d0, d1, [x1]
	ld1	{ v0.d }[1], [x9]
	ldp	d2, d3, [x1, #32]
	ld1	{ v2.d }[1], [x11]
	ld1	{ v1.d }[1], [x10]
	ld1	{ v3.d }[1], [x12]
	stp	q0, q2, [x8]
	stp	q1, q3, [x8, #32]
	ret
	.cfi_endproc
                                        ; -- End function
.subsections_via_symbols

After...

julia> @code_llvm debuginfo=:none SIMDMath.ComplexVec3((a, b, c, d))
define void @julia_ComplexVec3_2553([2 x <4 x double>]* noalias nocapture sret([2 x <4 x double>]) %0, [4 x [2 x double]]* nocapture nonnull readonly align 8 dereferenceable(64) %1) #0 {
top:
  %2 = bitcast [4 x [2 x double]]* %1 to <8 x double>*
  %3 = load <8 x double>, <8 x double>* %2, align 8
  %res.i = shufflevector <8 x double> %3, <8 x double> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
  %res.i2 = shufflevector <8 x double> %3, <8 x double> undef, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
  %.sroa.0.0..sroa_idx = getelementptr inbounds [2 x <4 x double>], [2 x <4 x double>]* %0, i64 0, i64 0
  store <4 x double> %res.i, <4 x double>* %.sroa.0.0..sroa_idx, align 32
  %.sroa.2.0..sroa_idx1 = getelementptr inbounds [2 x <4 x double>], [2 x <4 x double>]* %0, i64 0, i64 1
  store <4 x double> %res.i2, <4 x double>* %.sroa.2.0..sroa_idx1, align 32
  ret void
}

julia> @code_native debuginfo=:none SIMDMath.ComplexVec3((a, b, c, d))
	.section	__TEXT,__text,regular,pure_instructions
	.build_version macos, 11, 0
	.globl	_julia_ComplexVec3_2577         ; -- Begin function julia_ComplexVec3_2577
	.p2align	2
_julia_ComplexVec3_2577:                ; @julia_ComplexVec3_2577
	.cfi_startproc
; %bb.0:                                ; %top
	ld2	{ v0.2d, v1.2d }, [x0], #32
	ld2	{ v2.2d, v3.2d }, [x0]
	stp	q0, q2, [x8]
	stp	q1, q3, [x8, #32]
	ret
	.cfi_endproc
                                        ; -- End function
.subsections_via_symbols

@codecov-commenter
Copy link

Codecov Report

Merging #9 (bf302fe) into main (abf7ec4) will decrease coverage by 0.66%.
The diff coverage is 87.50%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@            Coverage Diff             @@
##             main       #9      +/-   ##
==========================================
- Coverage   93.47%   92.82%   -0.66%     
==========================================
  Files           6        6              
  Lines         230      237       +7     
==========================================
+ Hits          215      220       +5     
- Misses         15       17       +2     
Impacted Files Coverage Δ
src/SIMDMath.jl 100.00% <ø> (ø)
src/types.jl 89.47% <87.50%> (-10.53%) ⬇️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@heltonmc
Copy link
Owner Author

After some benchmarks this does improve performance slightly and greatly reduces the llvm code generated and native code generation

function test(x, y)
    @assert length(x) == length(y)
    out = SIMDMath.ComplexVec{4, Float64}((0.0, 0.0, 0.0, 0.0), (0.0, 0.0, 0.0, 0.0))
    for i in eachindex(x)
        out = SIMDMath.fadd(out, SIMDMath.fmul(SIMDMath.ComplexVec(x[i]), SIMDMath.ComplexVec(y[i])))
    end
    return out
end
function test2(x, y)
    @assert length(x) == length(y)
    out = (0.0 + 0.0im, 0.0 + 0.0im, 0.0 + 0.0im, 0.0 + 0.0im)
    for i in eachindex(x)
        out = @. out + x[i] * y[i]
    end
    return out
end
# base (auto-vectorizer)
julia> @benchmark test2($x, $y)
BenchmarkTools.Trial: 10000 samples with 955 evaluations.
 Range (min  max):  92.932 ns  137.261 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     93.020 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   93.420 ns ±   1.941 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅█▄▃                              ▁▃▁   ▁                    ▁
  ████▆▃▄▃▃▃▁▃▁▅▁▄▁▄▄▃▄▃▃▄▄▅▃▃▁▃▁▃▁▆███▅▅███▆▅▃▄▁▄▅▃▅▃▅▆▆▅▆▅▅▆ █
  92.9 ns       Histogram: log(frequency) by time      97.9 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

# master branch
julia> @benchmark test($x, $y)
BenchmarkTools.Trial: 10000 samples with 969 evaluations.
 Range (min  max):  80.753 ns  116.185 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     80.840 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   81.129 ns ±   1.557 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅█▅▂                                 ▁▃                      ▁
  ████▆▅▄▄▁▁▃▁▄▃▁▃▄▄▃▃▅▄▄▃▁▃▁▁▁▁▃▁▁▁▁▆███▅▆▄▄▄▁▁▁▄▁▅▃▁▅▅▅▄▅▆▆▅ █
  80.8 ns       Histogram: log(frequency) by time      85.4 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

# this PR
julia> @benchmark test($x, $y)
BenchmarkTools.Trial: 10000 samples with 970 evaluations.
 Range (min  max):  77.620 ns  114.777 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     77.749 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   78.005 ns ±   1.677 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄█▆▃▂                                 ▂▂                     ▁
  █████▆▄▄▃▃▃▃▄▃▃▁▄▅▃▁▄▃▄▃▄▄▁▁▁▁▃▄▃▃▄▁▃███▇▅▆▁▄▄▄▁▃▁▃▁▄▄▄▅▄▅▆▆ █
  77.6 ns       Histogram: log(frequency) by time      82.1 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

This is a clear improvement over the current method so will merge. The discussion in the other thread can be left open as the discussion on which format to use is separate from improving the implementation of one method.

@heltonmc heltonmc merged commit 6d2955e into main Apr 17, 2023
@heltonmc heltonmc deleted the complexcast branch April 17, 2023 01:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants