Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARM64: loop array indexing inefficiencies #34810

Closed
kunalspathak opened this issue Apr 10, 2020 · 13 comments
Closed

ARM64: loop array indexing inefficiencies #34810

kunalspathak opened this issue Apr 10, 2020 · 13 comments
Assignees
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@kunalspathak
Copy link
Member

kunalspathak commented Apr 10, 2020

public int Test()
{
    int[] arr = new int[10];
    int i = 0;
    while (i < 9)
    {
        if (i >= 2) 
        {
            arr[i] = 1;  // <---- IG04
        }
        i++;
    }
    return 0;
}

The line arr[i] = 1 generates the following code to calculate the address of element to save the value.

...
G_M8556_IG04:
        93407C22          sxtw    x2, x1
        D37EF442          lsl     x2, x2, #2
        91004042          add     x2, x2, #16
        52800023          mov     w3, #1
        B8226803          str     w3, [x0, x2]
...

vs. how x64 generates:

G_M27956_IG04:
       4863CA               movsxd   rcx, edx
       C744881001000000     mov      dword ptr [rax+4*rcx+16], 1

The ARM64 pattern can be optimized to use post-index addressing mode using:

# x1 contains <<base address of arr>>+16
mov w0, 1
str w0, [x1], 4

category:cq
theme:optimization
skill-level:intermediate
cost:medium

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner labels Apr 10, 2020
@kunalspathak
Copy link
Member Author

@BruceForstall
Copy link
Member

A case of a simple optimization where a STR or LDR immediately followed by an address variable addition could be transformed to subsume the addition into the STR/LDR instruction was discussed here in the context of the intrinsics.

It looks like what you are suggesting would be a sequence of transformations, loop induction variable strength reduction, where we either wouldn't maintain i separately, or would maintain both i and <array base> + 16 + i * 4 in the loop.

cc @AndyAyersMS

@kunalspathak
Copy link
Member Author

Just verified that gcc seems to do that optimization but clang doesn't.
https://godbolt.org/z/Wp9Xhu

@BruceForstall BruceForstall removed the untriaged New issue has not been triaged by the area owner label Apr 13, 2020
@BruceForstall BruceForstall added this to the Future milestone Apr 13, 2020
@TamarChristinaArm
Copy link
Contributor

TamarChristinaArm commented Apr 14, 2020

Just verified that gcc seems to do that optimization but clang doesn't.
https://godbolt.org/z/Wp9Xhu

Clang uses a more complicated addressing mode but also equally valid. (your example is missing an -O1).

        str     w8, [x9, x8, lsl #2]
        add     x8, x8, #1              // =1
        cmp     x8, #10                 // =10
        b.ne    .LBB0_1

Of course simpler addressing modes are always preferred :)

@BruceForstall BruceForstall changed the title ARM64: Use post-index addressing mode to access array elements ARM64: loop array indexing inefficiencies Apr 23, 2020
@BruceForstall
Copy link
Member

Note that we have to be careful with ref/byref creation and reporting. E.g., hoisting <array base> + 16 out of the loop to create a pointer to the array element base would create a byref pointer that needs to be reported. Note the comment in fgMorphArrayIndex:

// Be careful to only create the byref pointer when the full index expression is added to the array reference.
// We don't want to create a partial byref address expression that doesn't include the full index offset:
// a byref must point within the containing object. It is dangerous (especially when optimizations come into
// play) to create a "partial" byref that doesn't point exactly to the correct object; there is risk that
// the partial byref will not point within the object, and thus not get updated correctly during a GC.
// This is mostly a risk in fully-interruptible code regions.

@BruceForstall
Copy link
Member

The PR where this comment was introduced: dotnet/coreclr#17524

@AndyAyersMS
Copy link
Member

Right, if we have an address computation where the full computation tree has a mixture of positive and negative adjustments to the address, we need to be careful not to reassociate too broadly; all the intermediate results must be addresses within the bounds of the parent object.

@BruceForstall
Copy link
Member

Given that ARM doesn't have base + scaled index + offset addressing mode, it seems like we really need to be able to hoist <object base> [ref] + <array first element offset> [native int] out of a loop as a byref.

@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@BruceForstall BruceForstall modified the milestones: Future, 6.0.0 Nov 25, 2020
@BruceForstall BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Nov 25, 2020
@JulieLeeMSFT JulieLeeMSFT added the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Mar 23, 2021
@kunalspathak
Copy link
Member Author

Definitely, this will not happen in .NET 6.0.

@kunalspathak kunalspathak modified the milestones: 6.0.0, Future Jun 4, 2021
@JulieLeeMSFT JulieLeeMSFT removed the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Jun 7, 2021
@EgorBo
Copy link
Member

EgorBo commented Oct 7, 2021

I think it worth moving this to 7.0 as I'd expect noticeable perf improvements from it:
image

I tried to implement it via https://github.com/dotnet/runtime/pull/60085/files and even emitted something similar but it needs more work.

@BruceForstall BruceForstall modified the milestones: Future, 7.0.0 Oct 7, 2021
@EgorBo
Copy link
Member

EgorBo commented Nov 1, 2021

I made some progress on this and re-assigning to myself if you don't mind

@JulieLeeMSFT
Copy link
Member

@EgorBo you said that you completed this work. Can you link your PR and close this issue?

@EgorBo EgorBo closed this as completed Feb 24, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Mar 26, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
Archived in project
Development

No branches or pull requests

7 participants