cmd/compile: explore optimizations for fixed-length array copies #46529

mdempsky · 2021-06-02T21:11:23Z

#46505 (comment) reports these results for this benchmark:

BenchmarkNamed-8        201051339                6.263 ns/op
BenchmarkCopy-8         172203873                6.071 ns/op
BenchmarkPointer-8      1000000000               1.031 ns/op

We should see if cmd/compile can optimize the first two as efficiently as the last.

One observation though is they have different semantics: SumNamed and SumCopy need to support partial copies if len(b) < Size256, but SumPointer panics in this case.

We could probably efficiently handle copy(ret[:], b[:Size256]) though, by rewriting it into ret = *(*[Size256]byte)(b) in the frontend.

Marking for Go 1.18 to at least explore possibilities here.

Related #20859.

/cc @randall77 @bradfitz @katiehockman

The text was updated successfully, but these errors were encountered:

tmthrgd · 2021-06-02T22:46:07Z

There’s a mistake in that benchmark, BenchmarkCopy measures SumNamed instead of SumCopy. That leaves it identical to BenchmarkNamed. I have no idea if that impacts the results though.

zigo101 · 2021-08-07T15:22:17Z

For large slices, the performance difference of SumNamed and SumPointer is very small if they are expanded (inline manually): https://play.golang.org/p/WovKO4U-xRk

SumCopy is much slower for large slices. It looks the stack frame size of SumCopy is larger than others. Maybe it copies data twice.

Other observations:

Named results sometimes cause positive performance impact, sometimes cause negative impact.
The effects of manual inline and auto inline might be different.
For small slices, manuallly expanded SumNamed is slower than manuallly expanded SumPointer.

zigo101 · 2021-08-23T13:59:50Z

It looks, the conversion way is more efficient for very small and large enough arrays:

package arrays

import "testing"

const N = 32
var r [128][N]byte
var s = make([]byte, N)

func init() {
	println("============= N =", N)
}

func Benchmark_CopyClone(b *testing.B) {
	for i := 0; i < b.N; i++ {
		copy(r[i&127][:], s)
	}
}

func Benchmark_ConvertClone(b *testing.B) {
	for i := 0; i < b.N; i++ {
		r[i&127] = *(*[N]byte)(s)
	}
}

Benchmark results:

============= N = 32
goos: linux
goarch: amd64
cpu: Intel(R) Core(TM) i5-4210U CPU @ 1.70GHz
Benchmark_CopyClone-4      	258112560	         4.515 ns/op
Benchmark_ConvertClone-4   	783765790	         1.507 ns/op

============= N = 64
Benchmark_CopyClone-4      	229087599	         5.080 ns/op
Benchmark_ConvertClone-4   	589875326	         2.004 ns/op

============= N = 96
Benchmark_CopyClone-4      	181417629	         6.356 ns/op
Benchmark_ConvertClone-4   	281229399	         4.241 ns/op

============= N = 128
Benchmark_CopyClone-4      	192070240	         6.065 ns/op
Benchmark_ConvertClone-4   	224375317	         5.349 ns/op

============= N = 256
Benchmark_CopyClone-4      	137411401	         8.610 ns/op
Benchmark_ConvertClone-4   	135560024	         8.805 ns/op

============= N = 2048
Benchmark_CopyClone-4      	12372487	        94.05 ns/op
Benchmark_ConvertClone-4   	11931901	        97.88 ns/op

============= N = 2560
Benchmark_CopyClone-4      	 9082016	       127.2 ns/op
Benchmark_ConvertClone-4   	10472389	       110.6 ns/op

============= N = 4096
Benchmark_CopyClone-4      	 5103782	       226.3 ns/op
Benchmark_ConvertClone-4   	 6493408	       180.8 ns/op

============= N = 8192
Benchmark_CopyClone-4      	 2470774	       433.1 ns/op
Benchmark_ConvertClone-4   	 3447882	       344.9 ns/op

============= N = 65536
Benchmark_CopyClone-4      	  151039	      7240 ns/op
Benchmark_ConvertClone-4   	  242791	      4149 ns/op

willbeason · 2021-08-24T04:38:08Z

I actually see an 18% performance decrease when copying 2048 bytes. This seems super platform- and case-dependent.

I don't see significant improvements for my machine other than for <=64 bytes.

On further testing (I increased the number of arrays from 128 to 256), the results are also sensitive to the number of distinct arrays being copied. The main factors in whether this provides a significant speedup are:

1, Your machine's block size (64 bytes for my machine)
2. The L1 cache size for each processor (256kb / 4 processors = 64kb for me)

Once the arrays being copied exceed my machine's block size, I don't see a significant improvement.

Eventually, once the total set of memory I'm copying from/to exceeds the L1 cache size, I start to see slowdowns (number of arrays * size of array > 64kb).

The slowdowns become significant once the individual arrays being copied exceed the size of the L1 cache.

mknyszek · 2021-11-10T21:08:36Z

Since there hasn't been any movement here since August and we're in the freeze, I'm going to put it into the backlog.

@mdempsky Feel free to move it out to 1.19 if you plan to do some more exploration of this for next release.

rip-create-your-account · 2024-08-03T19:00:18Z

My experience when it comes to byte slices is that cap(buf) is almost always much larger than the fixed-length destination array but len(buf) tends to be smaller. This is a very common outcome for a slice of bytes after n, err := conn.Read(buf) & process(buf[:n]). For example in lexers, string interners and other byte slice processing places I have code like the following to take advantage of this and to a great effect too.

func next8BytesForKeywordLookup(bytes []byte) (chs uint64) {
	if len(bytes) > 8 {
		return // not a keyword
	}
	if cap(bytes) >= 8 {
		chs = binary.LittleEndian.Uint64(bytes[0:8:8])
		chs = chs & masks_for_len[len(bytes)]
	} else {
		var b [8]byte
		copy(b[:], bytes)
		chs = binary.LittleEndian.Uint64(b[:])
	}
	return
}

The compiler could special case byte slices and generate cap(bytes) >= 8 branch for small copies. And with AVX-128 and AVX-256 the compiler could do it for larger copies until switching to rep stosb. This would allow me to get rid of some code.

mdempsky added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jun 2, 2021

mdempsky added this to the Go1.18 milestone Jun 2, 2021

mdempsky mentioned this issue Jun 2, 2021

spec: allow conversion from slice to array #46505

Closed

bcmills added Performance NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Jun 3, 2021

mknyszek modified the milestones: Go1.18, Backlog Nov 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: explore optimizations for fixed-length array copies #46529

cmd/compile: explore optimizations for fixed-length array copies #46529

mdempsky commented Jun 2, 2021

tmthrgd commented Jun 2, 2021

zigo101 commented Aug 7, 2021 •

edited

Loading

zigo101 commented Aug 23, 2021

willbeason commented Aug 24, 2021 •

edited

Loading

mknyszek commented Nov 10, 2021

rip-create-your-account commented Aug 3, 2024

cmd/compile: explore optimizations for fixed-length array copies #46529

cmd/compile: explore optimizations for fixed-length array copies #46529

Comments

mdempsky commented Jun 2, 2021

tmthrgd commented Jun 2, 2021

zigo101 commented Aug 7, 2021 • edited Loading

zigo101 commented Aug 23, 2021

willbeason commented Aug 24, 2021 • edited Loading

mknyszek commented Nov 10, 2021

rip-create-your-account commented Aug 3, 2024

zigo101 commented Aug 7, 2021 •

edited

Loading

willbeason commented Aug 24, 2021 •

edited

Loading