Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: explore optimizations for fixed-length array copies #46529

Open
mdempsky opened this issue Jun 2, 2021 · 6 comments
Open

cmd/compile: explore optimizations for fixed-length array copies #46529

mdempsky opened this issue Jun 2, 2021 · 6 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@mdempsky
Copy link
Contributor

mdempsky commented Jun 2, 2021

#46505 (comment) reports these results for this benchmark:

BenchmarkNamed-8        201051339                6.263 ns/op
BenchmarkCopy-8         172203873                6.071 ns/op
BenchmarkPointer-8      1000000000               1.031 ns/op

We should see if cmd/compile can optimize the first two as efficiently as the last.

One observation though is they have different semantics: SumNamed and SumCopy need to support partial copies if len(b) < Size256, but SumPointer panics in this case.

We could probably efficiently handle copy(ret[:], b[:Size256]) though, by rewriting it into ret = *(*[Size256]byte)(b) in the frontend.

Marking for Go 1.18 to at least explore possibilities here.

Related #20859.

/cc @randall77 @bradfitz @katiehockman

@mdempsky mdempsky added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jun 2, 2021
@mdempsky mdempsky added this to the Go1.18 milestone Jun 2, 2021
@tmthrgd
Copy link
Contributor

tmthrgd commented Jun 2, 2021

There’s a mistake in that benchmark, BenchmarkCopy measures SumNamed instead of SumCopy. That leaves it identical to BenchmarkNamed. I have no idea if that impacts the results though.

@bcmills bcmills added Performance NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Jun 3, 2021
@zigo101
Copy link

zigo101 commented Aug 7, 2021

For large slices, the performance difference of SumNamed and SumPointer is very small if they are expanded (inline manually): https://play.golang.org/p/WovKO4U-xRk

SumCopy is much slower for large slices. It looks the stack frame size of SumCopy is larger than others. Maybe it copies data twice.

Other observations:

  • Named results sometimes cause positive performance impact, sometimes cause negative impact.
  • The effects of manual inline and auto inline might be different.
  • For small slices, manuallly expanded SumNamed is slower than manuallly expanded SumPointer.

@zigo101
Copy link

zigo101 commented Aug 23, 2021

It looks, the conversion way is more efficient for very small and large enough arrays:

package arrays

import "testing"

const N = 32
var r [128][N]byte
var s = make([]byte, N)

func init() {
	println("============= N =", N)
}

func Benchmark_CopyClone(b *testing.B) {
	for i := 0; i < b.N; i++ {
		copy(r[i&127][:], s)
	}
}

func Benchmark_ConvertClone(b *testing.B) {
	for i := 0; i < b.N; i++ {
		r[i&127] = *(*[N]byte)(s)
	}
}

Benchmark results:

============= N = 32
goos: linux
goarch: amd64
cpu: Intel(R) Core(TM) i5-4210U CPU @ 1.70GHz
Benchmark_CopyClone-4      	258112560	         4.515 ns/op
Benchmark_ConvertClone-4   	783765790	         1.507 ns/op

============= N = 64
Benchmark_CopyClone-4      	229087599	         5.080 ns/op
Benchmark_ConvertClone-4   	589875326	         2.004 ns/op

============= N = 96
Benchmark_CopyClone-4      	181417629	         6.356 ns/op
Benchmark_ConvertClone-4   	281229399	         4.241 ns/op

============= N = 128
Benchmark_CopyClone-4      	192070240	         6.065 ns/op
Benchmark_ConvertClone-4   	224375317	         5.349 ns/op

============= N = 256
Benchmark_CopyClone-4      	137411401	         8.610 ns/op
Benchmark_ConvertClone-4   	135560024	         8.805 ns/op

============= N = 2048
Benchmark_CopyClone-4      	12372487	        94.05 ns/op
Benchmark_ConvertClone-4   	11931901	        97.88 ns/op

============= N = 2560
Benchmark_CopyClone-4      	 9082016	       127.2 ns/op
Benchmark_ConvertClone-4   	10472389	       110.6 ns/op

============= N = 4096
Benchmark_CopyClone-4      	 5103782	       226.3 ns/op
Benchmark_ConvertClone-4   	 6493408	       180.8 ns/op

============= N = 8192
Benchmark_CopyClone-4      	 2470774	       433.1 ns/op
Benchmark_ConvertClone-4   	 3447882	       344.9 ns/op

============= N = 65536
Benchmark_CopyClone-4      	  151039	      7240 ns/op
Benchmark_ConvertClone-4   	  242791	      4149 ns/op

@willbeason
Copy link
Contributor

willbeason commented Aug 24, 2021

I actually see an 18% performance decrease when copying 2048 bytes. This seems super platform- and case-dependent.

I don't see significant improvements for my machine other than for <=64 bytes.

On further testing (I increased the number of arrays from 128 to 256), the results are also sensitive to the number of distinct arrays being copied. The main factors in whether this provides a significant speedup are:

1, Your machine's block size (64 bytes for my machine)
2. The L1 cache size for each processor (256kb / 4 processors = 64kb for me)

Once the arrays being copied exceed my machine's block size, I don't see a significant improvement.

Eventually, once the total set of memory I'm copying from/to exceeds the L1 cache size, I start to see slowdowns (number of arrays * size of array > 64kb).

The slowdowns become significant once the individual arrays being copied exceed the size of the L1 cache.

@mknyszek
Copy link
Contributor

Since there hasn't been any movement here since August and we're in the freeze, I'm going to put it into the backlog.

@mdempsky Feel free to move it out to 1.19 if you plan to do some more exploration of this for next release.

@mknyszek mknyszek modified the milestones: Go1.18, Backlog Nov 10, 2021
@rip-create-your-account
Copy link

My experience when it comes to byte slices is that cap(buf) is almost always much larger than the fixed-length destination array but len(buf) tends to be smaller. This is a very common outcome for a slice of bytes after n, err := conn.Read(buf) & process(buf[:n]). For example in lexers, string interners and other byte slice processing places I have code like the following to take advantage of this and to a great effect too.

func next8BytesForKeywordLookup(bytes []byte) (chs uint64) {
	if len(bytes) > 8 {
		return // not a keyword
	}
	if cap(bytes) >= 8 {
		chs = binary.LittleEndian.Uint64(bytes[0:8:8])
		chs = chs & masks_for_len[len(bytes)]
	} else {
		var b [8]byte
		copy(b[:], bytes)
		chs = binary.LittleEndian.Uint64(b[:])
	}
	return
}

The compiler could special case byte slices and generate cap(bytes) >= 8 branch for small copies. And with AVX-128 and AVX-256 the compiler could do it for larger copies until switching to rep stosb. This would allow me to get rid of some code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
None yet
Development

No branches or pull requests

7 participants