Improve efficiency of casting complex numbers to complex vectors. #9

heltonmc · 2023-04-13T18:53:16Z

Improves partly #8. Does not re-implement approaches but improves the efficiency of converting a tuple of complex numbers to complex vectors.

Before...

a = 1.1 + 1.3im
b = 1.1 + 1.6im
c = 2.1 + 1.6im
d = 2.1 + 1.9im

julia> @code_llvm debuginfo=:none ComplexVec((a, b, c, d))
define void @julia_ComplexVec_2450([2 x <4 x double>]* noalias nocapture sret([2 x <4 x double>]) %0, {}* nonnull readonly %1, [4 x [2 x double]]* nocapture nonnull readonly align 8 dereferenceable(64) %2) #0 {
top:
  %3 = getelementptr inbounds [4 x [2 x double]], [4 x [2 x double]]* %2, i64 0, i64 0, i64 0
  %4 = getelementptr inbounds [4 x [2 x double]], [4 x [2 x double]]* %2, i64 0, i64 0, i64 1
  %5 = getelementptr inbounds [4 x [2 x double]], [4 x [2 x double]]* %2, i64 0, i64 1, i64 0
  %6 = getelementptr inbounds [4 x [2 x double]], [4 x [2 x double]]* %2, i64 0, i64 1, i64 1
  %7 = getelementptr inbounds [4 x [2 x double]], [4 x [2 x double]]* %2, i64 0, i64 2, i64 0
  %8 = getelementptr inbounds [4 x [2 x double]], [4 x [2 x double]]* %2, i64 0, i64 2, i64 1
  %9 = getelementptr inbounds [4 x [2 x double]], [4 x [2 x double]]* %2, i64 0, i64 3, i64 0
  %10 = getelementptr inbounds [4 x [2 x double]], [4 x [2 x double]]* %2, i64 0, i64 3, i64 1
  %11 = load double, double* %3, align 8
  %12 = load double, double* %5, align 8
  %13 = load double, double* %7, align 8
  %14 = load double, double* %9, align 8
  %15 = insertelement <4 x double> undef, double %11, i32 0
  %16 = insertelement <4 x double> %15, double %12, i32 1
  %17 = insertelement <4 x double> %16, double %13, i32 2
  %18 = insertelement <4 x double> %17, double %14, i32 3
  %19 = load double, double* %4, align 8
  %20 = load double, double* %6, align 8
  %21 = load double, double* %8, align 8
  %22 = load double, double* %10, align 8
  %23 = insertelement <4 x double> undef, double %19, i32 0
  %24 = insertelement <4 x double> %23, double %20, i32 1
  %25 = insertelement <4 x double> %24, double %21, i32 2
  %26 = insertelement <4 x double> %25, double %22, i32 3
  %.sroa.0.0..sroa_idx = getelementptr inbounds [2 x <4 x double>], [2 x <4 x double>]* %0, i64 0, i64 0
  store <4 x double> %18, <4 x double>* %.sroa.0.0..sroa_idx, align 32
  %.sroa.2.0..sroa_idx1 = getelementptr inbounds [2 x <4 x double>], [2 x <4 x double>]* %0, i64 0, i64 1
  store <4 x double> %26, <4 x double>* %.sroa.2.0..sroa_idx1, align 32
  ret void


julia> @code_native debuginfo=:none ComplexVec((a, b, c, d))
	.section	__TEXT,__text,regular,pure_instructions
	.build_version macos, 11, 0
	.globl	_julia_ComplexVec_2452          ; -- Begin function julia_ComplexVec_2452
	.p2align	2
_julia_ComplexVec_2452:                 ; @julia_ComplexVec_2452
	.cfi_startproc
; %bb.0:                                ; %top
	add	x9, x1, #16                     ; =16
	add	x10, x1, #24                    ; =24
	add	x11, x1, #48                    ; =48
	add	x12, x1, #56                    ; =56
	ldp	d0, d1, [x1]
	ld1	{ v0.d }[1], [x9]
	ldp	d2, d3, [x1, #32]
	ld1	{ v2.d }[1], [x11]
	ld1	{ v1.d }[1], [x10]
	ld1	{ v3.d }[1], [x12]
	stp	q0, q2, [x8]
	stp	q1, q3, [x8, #32]
	ret
	.cfi_endproc
                                        ; -- End function
.subsections_via_symbols

After...

julia> @code_llvm debuginfo=:none SIMDMath.ComplexVec3((a, b, c, d))
define void @julia_ComplexVec3_2553([2 x <4 x double>]* noalias nocapture sret([2 x <4 x double>]) %0, [4 x [2 x double]]* nocapture nonnull readonly align 8 dereferenceable(64) %1) #0 {
top:
  %2 = bitcast [4 x [2 x double]]* %1 to <8 x double>*
  %3 = load <8 x double>, <8 x double>* %2, align 8
  %res.i = shufflevector <8 x double> %3, <8 x double> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
  %res.i2 = shufflevector <8 x double> %3, <8 x double> undef, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
  %.sroa.0.0..sroa_idx = getelementptr inbounds [2 x <4 x double>], [2 x <4 x double>]* %0, i64 0, i64 0
  store <4 x double> %res.i, <4 x double>* %.sroa.0.0..sroa_idx, align 32
  %.sroa.2.0..sroa_idx1 = getelementptr inbounds [2 x <4 x double>], [2 x <4 x double>]* %0, i64 0, i64 1
  store <4 x double> %res.i2, <4 x double>* %.sroa.2.0..sroa_idx1, align 32
  ret void
}

julia> @code_native debuginfo=:none SIMDMath.ComplexVec3((a, b, c, d))
	.section	__TEXT,__text,regular,pure_instructions
	.build_version macos, 11, 0
	.globl	_julia_ComplexVec3_2577         ; -- Begin function julia_ComplexVec3_2577
	.p2align	2
_julia_ComplexVec3_2577:                ; @julia_ComplexVec3_2577
	.cfi_startproc
; %bb.0:                                ; %top
	ld2	{ v0.2d, v1.2d }, [x0], #32
	ld2	{ v2.2d, v3.2d }, [x0]
	stp	q0, q2, [x8]
	stp	q1, q3, [x8, #32]
	ret
	.cfi_endproc
                                        ; -- End function
.subsections_via_symbols

codecov-commenter · 2023-04-13T18:57:19Z

Codecov Report

Merging #9 (bf302fe) into main (abf7ec4) will decrease coverage by 0.66%.
The diff coverage is 87.50%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@            Coverage Diff             @@
##             main       #9      +/-   ##
==========================================
- Coverage   93.47%   92.82%   -0.66%     
==========================================
  Files           6        6              
  Lines         230      237       +7     
==========================================
+ Hits          215      220       +5     
- Misses         15       17       +2

Impacted Files	Coverage Δ
src/SIMDMath.jl	`100.00% <ø> (ø)`
src/types.jl	`89.47% <87.50%> (-10.53%)`	⬇️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

heltonmc · 2023-04-17T01:35:43Z

After some benchmarks this does improve performance slightly and greatly reduces the llvm code generated and native code generation

function test(x, y)
    @assert length(x) == length(y)
    out = SIMDMath.ComplexVec{4, Float64}((0.0, 0.0, 0.0, 0.0), (0.0, 0.0, 0.0, 0.0))
    for i in eachindex(x)
        out = SIMDMath.fadd(out, SIMDMath.fmul(SIMDMath.ComplexVec(x[i]), SIMDMath.ComplexVec(y[i])))
    end
    return out
end
function test2(x, y)
    @assert length(x) == length(y)
    out = (0.0 + 0.0im, 0.0 + 0.0im, 0.0 + 0.0im, 0.0 + 0.0im)
    for i in eachindex(x)
        out = @. out + x[i] * y[i]
    end
    return out
end

# base (auto-vectorizer)
julia> @benchmark test2($x, $y)
BenchmarkTools.Trial: 10000 samples with 955 evaluations.
 Range (min … max):  92.932 ns … 137.261 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     93.020 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   93.420 ns ±   1.941 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅█▄▃                              ▁▃▁   ▁                    ▁
  ████▆▃▄▃▃▃▁▃▁▅▁▄▁▄▄▃▄▃▃▄▄▅▃▃▁▃▁▃▁▆███▅▅███▆▅▃▄▁▄▅▃▅▃▅▆▆▅▆▅▅▆ █
  92.9 ns       Histogram: log(frequency) by time      97.9 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

# master branch
julia> @benchmark test($x, $y)
BenchmarkTools.Trial: 10000 samples with 969 evaluations.
 Range (min … max):  80.753 ns … 116.185 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     80.840 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   81.129 ns ±   1.557 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅█▅▂                                 ▁▃                      ▁
  ████▆▅▄▄▁▁▃▁▄▃▁▃▄▄▃▃▅▄▄▃▁▃▁▁▁▁▃▁▁▁▁▆███▅▆▄▄▄▁▁▁▄▁▅▃▁▅▅▅▄▅▆▆▅ █
  80.8 ns       Histogram: log(frequency) by time      85.4 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

# this PR
julia> @benchmark test($x, $y)
BenchmarkTools.Trial: 10000 samples with 970 evaluations.
 Range (min … max):  77.620 ns … 114.777 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     77.749 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   78.005 ns ±   1.677 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄█▆▃▂                                 ▂▂                     ▁
  █████▆▄▄▃▃▃▃▄▃▃▁▄▅▃▁▄▃▄▃▄▄▁▁▁▁▃▄▃▃▄▁▃███▇▅▆▁▄▄▄▁▃▁▃▁▄▄▄▅▄▅▆▆ █
  77.6 ns       Histogram: log(frequency) by time      82.1 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

This is a clear improvement over the current method so will merge. The discussion in the other thread can be left open as the discussion on which format to use is separate from improving the implementation of one method.

improve complex casting

bf302fe

heltonmc mentioned this pull request Apr 13, 2023

Improve casting of complex tuples to complex vectors.. Perhaps consider alternative approach? #8

Open

heltonmc merged commit 6d2955e into main Apr 17, 2023

heltonmc deleted the complexcast branch April 17, 2023 01:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve efficiency of casting complex numbers to complex vectors. #9

Improve efficiency of casting complex numbers to complex vectors. #9

heltonmc commented Apr 13, 2023

codecov-commenter commented Apr 13, 2023

heltonmc commented Apr 17, 2023

Improve efficiency of casting complex numbers to complex vectors. #9

Improve efficiency of casting complex numbers to complex vectors. #9

Conversation

heltonmc commented Apr 13, 2023

codecov-commenter commented Apr 13, 2023

Codecov Report

heltonmc commented Apr 17, 2023