cmd/compile: unexpected performance difference accessing slices with different caps #27857

iand · 2018-09-25T16:56:57Z

Using tip (go version devel +b794ca64d2 Mon Sep 3 07:14:25 2018 +0000 linux/amd64)

While implementing some bounds check optimizations in image/draw CL 136935 I came across an unexpected performance difference between accessing elements of a slice s created like this

s := spix[i : i+4 : len(spix)]

and

s := spix[i : i+4 : i+4]

Both forms eliminate bound checks for accessing elements 0-3 of s but the second form has a significant performance improvement over the first (see CL for benchmarks).

Building with go build -gcflags="-d=ssa/check_bce/debug=1" shows no difference in bounds checks between the two forms so presumably it has something to do with the resulting size of the slice.

Attached is a simple program and disassembly derived from the image/draw code that demonstrates the effect. You can see that the assembly generated for the second form is quite a bit shorter.

bounds.go.txt
bounds.asm.txt

The text was updated successfully, but these errors were encountered:

randall77 · 2018-09-25T17:30:07Z

Go has a special case when slicing and generating the empty slice.
This has the potential to generate a pointer to the next object in memory, like this:

x := make([]int, 4)
x = x[4:]

The resulting x has capacity 0, and if we just computed its base pointer as x.ptr += 4*sizeof(int) then x.ptr would point to the next object in memory after the [4]int that was allocated. This is bad because it could mistakenly retain a random object in the heap.

So when slicing, we need to somehow avoid contructing such a pointer. Currently we do this by avoiding the add when the capacity is 0. So we do x.ptr += newcap == 0 ? 0 : 4*sizeof(int). And we do it without a conditional, using shift and mask tricks.

When you do spix[i:i+4:i+4] the new capacity is 4, and the compiler knows that. So the conditional in the expression above constant folds, and we just get x.ptr += i*sizeof(int). When you do spix[i:i+4:len(spix)], it's not obvious whether the resulting slice has capacity 0 or not, so the arithmetic is still all there.

The optimization that would fix this is that if the resulting slice is known to have nonzero length (we know that here because i+4-i > 0), we don't need to guard against next object pointers.

gopherbot · 2018-09-25T18:12:29Z

Change https://golang.org/cl/136935 mentions this issue: image/draw: optimize bounds checks in loops

randall77 · 2018-09-25T18:31:34Z

s/the new capacity is 0/the new capacity is 4/
s/x.ptr += 0/x.ptr += i*sizeof(int)/

Use subslices with known length and cap to give bounds checking hints to the compiler. Improves over the earlier pointer based optimizations in https://go-review.googlesource.com/c/go/+/14093 for GlyphOver but not for FillOver so the latter is left unchanged. See #27857 for discussion of small caps used in subslices. name old time/op new time/op delta FillOver-8 607µs ± 1% 609µs ± 1% ~ (p=0.447 n=9+10) FillSrc-8 23.0µs ± 1% 22.9µs ± 2% ~ (p=0.412 n=9+10) CopyOver-8 647µs ± 0% 560µs ± 0% -13.43% (p=0.000 n=9+10) CopySrc-8 19.3µs ± 1% 19.1µs ± 2% -0.66% (p=0.029 n=10+10) NRGBAOver-8 697µs ± 1% 651µs ± 1% -6.64% (p=0.000 n=10+10) NRGBASrc-8 405µs ± 1% 347µs ± 0% -14.23% (p=0.000 n=10+10) YCbCr-8 432µs ± 2% 431µs ± 1% ~ (p=0.764 n=10+9) Gray-8 164µs ± 1% 139µs ± 1% -15.44% (p=0.000 n=10+10) CMYK-8 498µs ± 0% 461µs ± 0% -7.49% (p=0.000 n=10+9) GlyphOver-8 220µs ± 0% 199µs ± 0% -9.52% (p=0.000 n=9+10) RGBA-8 3.81ms ± 5% 3.79ms ± 5% ~ (p=0.549 n=9+10) Paletted-8 1.73ms ± 0% 1.73ms ± 1% ~ (p=0.278 n=10+9) GenericOver-8 11.0ms ± 2% 11.0ms ± 1% ~ (p=0.842 n=9+10) GenericMaskOver-8 5.29ms ± 1% 5.30ms ± 0% ~ (p=0.182 n=9+10) GenericSrc-8 4.24ms ± 1% 4.24ms ± 0% ~ (p=0.436 n=9+9) GenericMaskSrc-8 7.89ms ± 1% 7.90ms ± 2% ~ (p=0.631 n=10+10) Change-Id: I6fe1b21bb5e255826cbfdd2e73efd5858cd5557c Reviewed-on: https://go-review.googlesource.com/136935 Reviewed-by: Brad Fitzpatrick <[email protected]> Run-TryBot: Brad Fitzpatrick <[email protected]> TryBot-Result: Gobot Gobot <[email protected]>

gopherbot · 2018-09-25T22:26:57Z

Change https://golang.org/cl/137495 mentions this issue: image: optimize bounds checking for At and Set methods

Use a subslice of the pixel data to give the compiler hints for bounds checking. Only do this for image formats that require 4 or more slice reads/writes. See #27857 for discussion of small cap sizes. name old time/op new time/op delta At/rgba-8 18.8ns ± 2% 18.5ns ± 1% -1.49% (p=0.026 n=10+10) At/rgba64-8 22.2ns ± 2% 21.1ns ± 3% -4.51% (p=0.000 n=10+10) At/nrgba-8 18.8ns ± 2% 18.7ns ± 2% ~ (p=0.467 n=10+10) At/nrgba64-8 21.9ns ± 2% 21.0ns ± 2% -4.15% (p=0.000 n=10+9) At/alpha-8 14.3ns ± 1% 14.3ns ± 2% ~ (p=0.543 n=10+10) At/alpha16-8 6.43ns ± 1% 6.47ns ± 1% ~ (p=0.053 n=10+10) At/gray-8 14.4ns ± 2% 14.6ns ± 5% ~ (p=0.194 n=10+10) At/gray16-8 6.52ns ± 1% 6.55ns ± 2% ~ (p=0.610 n=10+10) At/paletted-8 4.17ns ± 1% 4.21ns ± 2% ~ (p=0.095 n=9+10) Set/rgba-8 39.2ns ± 2% 40.1ns ± 4% +2.45% (p=0.007 n=10+10) Set/rgba64-8 46.2ns ± 3% 43.3ns ± 3% -6.11% (p=0.000 n=10+10) Set/nrgba-8 39.2ns ± 1% 39.7ns ± 5% ~ (p=0.407 n=10+10) Set/nrgba64-8 45.6ns ± 3% 42.9ns ± 3% -5.83% (p=0.000 n=9+10) Set/alpha-8 35.0ns ± 3% 34.1ns ± 2% -2.43% (p=0.017 n=10+10) Set/alpha16-8 36.3ns ± 4% 35.8ns ± 5% ~ (p=0.254 n=10+10) Set/gray-8 19.8ns ± 1% 19.7ns ± 0% -0.69% (p=0.002 n=8+6) Set/gray16-8 36.0ns ± 1% 36.4ns ± 2% +1.08% (p=0.037 n=10+10) Set/paletted-8 39.1ns ± 0% 39.6ns ± 1% +1.30% (p=0.000 n=10+10) RGBAAt-8 3.72ns ± 1% 3.58ns ± 1% -3.76% (p=0.000 n=9+10) RGBASetRGBA-8 4.35ns ± 1% 3.70ns ± 1% -14.92% (p=0.000 n=10+10) RGBA64At-8 5.08ns ± 1% 3.69ns ± 1% -27.40% (p=0.000 n=9+9) RGBA64SetRGBA64-8 6.65ns ± 2% 3.63ns ± 0% -45.35% (p=0.000 n=10+9) NRGBAAt-8 3.72ns ± 1% 3.59ns ± 1% -3.55% (p=0.000 n=10+10) NRGBASetNRGBA-8 4.05ns ± 0% 3.71ns ± 1% -8.57% (p=0.000 n=9+10) NRGBA64At-8 4.99ns ± 1% 3.69ns ± 0% -26.07% (p=0.000 n=10+9) NRGBA64SetNRGBA64-8 6.66ns ± 1% 3.64ns ± 1% -45.40% (p=0.000 n=10+10) AlphaAt-8 1.44ns ± 1% 1.44ns ± 0% ~ (p=0.176 n=10+7) AlphaSetAlpha-8 1.60ns ± 2% 1.56ns ± 0% -2.62% (p=0.000 n=10+6) Alpha16At-8 2.87ns ± 1% 2.92ns ± 2% +1.67% (p=0.001 n=10+10) AlphaSetAlpha16-8 3.26ns ± 1% 3.35ns ± 1% +2.68% (p=0.012 n=8+3) Fixes #14884 Change-Id: Ia0383530596a550e1b1c7aafce5220e5e0935a53 Reviewed-on: https://go-review.googlesource.com/137495 Reviewed-by: Brad Fitzpatrick <[email protected]>

mundaym · 2019-03-15T11:03:13Z

@randall77

So when slicing, we need to somehow avoid contructing such a pointer. Currently we do this by avoiding the add when the capacity is 0. So we do x.ptr += newcap == 0 ? 0 : 4*sizeof(int). And we do it without a conditional, using shift and mask tricks.

I wonder if we should revisit using 'shift and mask tricks' here. It reminds me a bit of #26306 (i.e. the extra pointer arithmetic might be interfering with speculative loads). Since setting the pointer to nil only usually happens on the last iteration of a loop, perhaps we should just emit the obvious code instead:

if likely(newcap > 0) {
     x.ptr += 4*sizeof(int)
} else {
     x.ptr = nil
}

Might be worth an experiment. Plus prove wouldn't need a special case to optimize the branch away.

randall77 · 2019-03-15T14:50:34Z

I'm happy for someone to experiment with this.
The special case here is going to be ugly either way, I expect.

josharian · 2019-03-22T16:03:06Z

I wonder if we should revisit using 'shift and mask tricks' here.

Aside: There are also a few other alternatives (of the pointer arithmetic tricks) variety here.

We currently do:

SUBQ k, i
NEGQ i
SARQ $63, i
ANDQ i, off

We could also do:

SUBQ k, i
SETNE x
NEGL x
ANDQ x, off

Or even:

SUBQ k, i
SETNE i
MULQ i, off

I hacked these in locally. They all execute at basically identical speeds, but the middle form encodes shortest.

To use the middle form throughout, in practice we'd need to have a OpReslice that can pass all the way through to lowering. (This is because we treat SUBQ and SUBQborrow differently, which inhibits CSE. And because other architectures might want to use their own specialized slicing tricks, e.g. conditional instructions on ARM.)

A dedicated OpReslice might make the prove pass simpler as well, since (I believe) the prove pass doesn't care whether the resulting pointer is at the beginning or in the middle of the object.

josharian · 2019-03-22T16:15:45Z

(Or another variant of the middle form: SUBQ; SETEQ; DECQ; ANDQ, which might be even shorter. I'll explore if folks think it is worth it.)

randall77 · 2019-03-22T16:20:34Z

SETNE only sets the low byte of the target register. You'd need to add an XORL x, x before the subtraction or MOVL $0, AX before the SETNE.

josharian · 2019-03-22T16:34:47Z

Thanks! It still appears to be a minor win. The bigger question is about moving slicing from SSA gen to a dedicated op.

randall77 · 2019-03-22T16:37:00Z

A dedicated op is fine. It should make it easier to generate better code for arm as well.

mundaym · 2019-03-22T16:38:43Z

we'd need to have a OpReslice

One thing I was wondering is whether we should subtract the length from the capacity when decomposing slices in SSA. That way we wouldn't need to update the capacity when re-slicing. It would mean slightly more work when doing append-like operations though. We'd also need to re-add the length to the capacity when storing the slice back to memory.

For example:

for len(x) > 0 {
    ...
    x = x[n:]
}

Currently compiles to:

for x.len > 0 {
    ...
    x.cap += n
    x.len += n
    x.ptr += SliceMask(n, cap)
}

If we instead replaced the capacity (x.cap) with the length subtracted from the length (x.ext) then the loop could be done as:

x.ext = x.cap - x.len
for x.len > 0 {
    ...
    x.len += n
    t := n
    if unlikely(x.len == 0) {
        t = SliceMask(n, x.ext)
    }
    x.ptr += t
}
x.cap = x.len + x.ext

This involves a branch but crucially means we don't have to load/update the capacity in the likely path of the loop, reducing register pressure. I started hacking on this idea but haven't finished it yet - seems to be relatively simple to achieve though.

Anyway, just thought I'd bring this up since I'm not sure if this idea would compliment or conflict with the OpReslice idea.

josharian · 2019-03-22T20:54:45Z

One thing I was wondering is whether we should subtract the length from the capacity when decomposing slices in SSA.

Interesting. How often do we slice in a loop as opposed to straight line code? Seems worth exploring.

Another, crazier, idea is to change the underlying representation in memory of all slices to be {ptr, len, cap-len/extra/slack}. This would make calling cap involve addition, but I think that calling cap is much rarer than slicing, appending, etc. This would probably break so much code as to be impossible, though.

Anyway, just thought I'd bring this up since I'm not sure if this idea would compliment or conflict with the OpReslice idea.

Conflict, I think. The idea behind OpReslice is to bundle a bunch of these calculations into a single op (easier to deal with during prove, etc.) and do the complex decomposition on a per-arch basis. I'm quite content to let you experiment with your ideas first, since they seem quite promising. And if you end up not liking the results, we can experiment some with OpReslice etc.

josharian · 2019-03-27T14:15:33Z

One argument in favor of writing out the if statement: For code like this

if lo < hi {
  s = s[lo:hi]
}

Then we can entirely avoid adjusting ptr, whereas with masking tricks we always end up masking. (We could add a special check for boundedness, like we do with shifts, and perhaps should, but with a branch, it falls out for free.) Code example from discussion at https://go-review.googlesource.com/c/go/+/169518

josharian · 2019-04-29T02:02:52Z

@mundaym if you want a real life test case, the code in #31586 (comment) spends about 50% of its time reslicing in a loop. TL;DR: go get nhooyr.io/websocket, git checkout 40b4, go test -bench=Mask/4096//nhooyr, line 74 (b = b[8:]) is >50% of execution time. (cc @nhooyr)

zigo101 · 2021-08-20T07:19:16Z

So when slicing, we need to somehow avoid contructing such a pointer. Currently we do this by avoiding the add when the capacity is 0. So we do x.ptr += newcap == 0 ? 0 : 4*sizeof(int). And we do it without a conditional, using shift and mask tricks.

When you do spix[i:i+4:i+4] the new capacity is 4, and the compiler knows that. So the conditional in the expression above constant folds, and we just get x.ptr += i*sizeof(int). When you do spix[i:i+4:len(spix)], it's not obvious whether the resulting slice has capacity 0 or not, so the arithmetic is still all there.

I'm not very understanding why spix[i:i+4:len(spix)] is not obvious to prove that the resulting slice has a capacity larger than zero. At least it is obvious for humans.

zigo101 · 2023-10-10T17:49:45Z

It looks the performances of the two ways are on par now with Go 1.21.2. Any change has ever made for this?

[edit]: sorry, it looks no change here. It is just that some benchmarks becomes slice length sensitive since Go 1.21.

bcmills added the Performance label Sep 25, 2018

bcmills added this to the Unplanned milestone Sep 25, 2018

randall77 added the NeedsFix The path to resolution is known, but the work has not been done. label Sep 25, 2018

josharian mentioned this issue Apr 29, 2019

cmd/compile: optimize XOR masking code #31586

Closed

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: unexpected performance difference accessing slices with different caps #27857

cmd/compile: unexpected performance difference accessing slices with different caps #27857

iand commented Sep 25, 2018

randall77 commented Sep 25, 2018 •

edited by rasky

Loading

gopherbot commented Sep 25, 2018

randall77 commented Sep 25, 2018

gopherbot commented Sep 25, 2018

mundaym commented Mar 15, 2019 •

edited

Loading

randall77 commented Mar 15, 2019

josharian commented Mar 22, 2019

josharian commented Mar 22, 2019

randall77 commented Mar 22, 2019

josharian commented Mar 22, 2019

randall77 commented Mar 22, 2019

mundaym commented Mar 22, 2019 •

edited

Loading

josharian commented Mar 22, 2019

josharian commented Mar 27, 2019

josharian commented Apr 29, 2019

zigo101 commented Aug 20, 2021 •

edited

Loading

zigo101 commented Oct 10, 2023 •

edited

Loading

cmd/compile: unexpected performance difference accessing slices with different caps #27857

cmd/compile: unexpected performance difference accessing slices with different caps #27857

Comments

iand commented Sep 25, 2018

randall77 commented Sep 25, 2018 • edited by rasky Loading

gopherbot commented Sep 25, 2018

randall77 commented Sep 25, 2018

gopherbot commented Sep 25, 2018

mundaym commented Mar 15, 2019 • edited Loading

randall77 commented Mar 15, 2019

josharian commented Mar 22, 2019

josharian commented Mar 22, 2019

randall77 commented Mar 22, 2019

josharian commented Mar 22, 2019

randall77 commented Mar 22, 2019

mundaym commented Mar 22, 2019 • edited Loading

josharian commented Mar 22, 2019

josharian commented Mar 27, 2019

josharian commented Apr 29, 2019

zigo101 commented Aug 20, 2021 • edited Loading

zigo101 commented Oct 10, 2023 • edited Loading

randall77 commented Sep 25, 2018 •

edited by rasky

Loading

mundaym commented Mar 15, 2019 •

edited

Loading

mundaym commented Mar 22, 2019 •

edited

Loading

zigo101 commented Aug 20, 2021 •

edited

Loading

zigo101 commented Oct 10, 2023 •

edited

Loading