Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid zeroing memory on stackalloc? #4384

Closed
RossNordby opened this issue Jul 23, 2015 · 57 comments
Closed

Avoid zeroing memory on stackalloc? #4384

RossNordby opened this issue Jul 23, 2015 · 57 comments
Labels
area-System.Memory enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Milestone

Comments

@RossNordby
Copy link

The current stackalloc implementation uses a tight loop of push instructions to allocate and fill in the memory. For example, stackalloc int[16384] yields:

000007FE8D8A05C4  mov         eax,10000h
000007FE8D8A05C9  push        0
000007FE8D8A05CB  push        0
000007FE8D8A05CD  sub         rax,10h
000007FE8D8A05D1  jne         000007FE8D8A05C9

Given that the use of stackalloc is often performance related, it would be nice if the stack pointer just jumped directly to the end and left the memory as-is. This appears to be permitted by the C# spec, section 18.8: "The content of the newly allocated memory is undefined."

Changing this behavior would nastily break applications that relied on the zeroing (despite spec). I'm not sure how much of a problem this would be. It's a niche feature that tends to be used for very specific reasons, and older JIT versions seemed to leave the memory as-is (according to some postings circa 2003).

@BruceForstall
Copy link
Member

In your example, the JIT has been told to force a zero initialization. There is code in the JIT to simply do large frame stack page probing without zero init, but that didn't kick in here.

@mikedn
Copy link
Contributor

mikedn commented Jul 23, 2015

Changing this behavior would nastily break applications that relied on the zeroing (despite spec).

Note that zeroing is not always done even today. The following code prints 42, that value remains on the stack from Foo:

static unsafe void Main() {
    Foo();
    int* p = stackalloc int[16384];
    Console.WriteLine(p[0]);
}
static unsafe void Foo() {
    int* p = stackalloc int[16384];
    for (int i = 0; i < 16384; i++)
        p[i] = 42;
}

@RossNordby
Copy link
Author

Interesting! I should have looked deeper.

If my new understanding is right, the JIT just obeys the method IL header CorILMethod_InitLocals flag when choosing how to allocate, and III.3.47 of the CLI spec suggests that the JIT has no choice but to obey. Darn.

I guess any remaining degrees of freedom would be over on the roslyn side of things.

@mikedn
Copy link
Contributor

mikedn commented Jul 24, 2015

It's even more interesting if you check the C# specification and you find this text in 19.8:

The content of the newly allocated memory is undefined.

Go figure. I think that the IL spec is a bit messed up for requiring localalloc to follow the localsinit flag. That's why in my above example no zeroing is done, Main doesn't contain any locals and there's no localsinit flag. Perhaps Roslyn should just refrain from using localsinit in unsafe methods, it's of no use anyway.

@RossNordby
Copy link
Author

Not setting localsinit for unsafe methods sounds like a good option. I haven't found anything that would restrict that in the specs yet, given that verifiability is already out the window. I think I might go write up an issue on that.

Another option might be to introduce a separate IL instruction that does not have the same baggage as localloc, allowing stackalloc to be implemented according to its looser C# spec. I think this might run afoul of the blanket implications of localsinit, though:

A “localsinit flag” that indicates whether the local variables and memory pool (§I.12.3.2.4) should be initialized by the CLI.

At best, it would be a weird corner case. Also, while I'm not familiar with the process behind doing such a thing, I would guess a change to Roslyn's localsinit behavior would be easier if acceptable.

@benaadams
Copy link
Member

@mikedn I'm not seeing this (re localsinit); even declaring the stackalloc'd variable results in a localsinit being emitted?

char* output = stackalloc char[length];
.method private hidebysig static string MultiBlockAsciiString(
    class MemoryBlock block,
    int32 offset,
    int32 length) cil managed
{
  // Code size       17 (0x11)
  .maxstack  4
  .locals init ([0] char* output)
  IL_0000:  ldarg.2
  IL_0001:  conv.u
  IL_0002:  ldc.i4.2
  IL_0003:  mul.ovf.un
  IL_0004:  localloc
  IL_0006:  stloc.0
  IL_0007:  ldarg.0
  IL_0008:  ldarg.1
  IL_0009:  ldarg.2
  IL_000a:  ldloc.0
  IL_000b:  call       string ::MultiBlockAsciiIter(
                                                class MemoryBlock,
                                                int32,
                                                int32,
                                                char*)
  IL_0010:  ret
}

@benaadams
Copy link
Member

Correction: localsinit being dropped does happen but there are some extra caveats; trying to narrow down

@benaadams
Copy link
Member

Haven't quite narrowed it down; but if you call a method it needs to be first param; and some sorts of work triggers it; possibly for loop even with the index being not local.

Pattern that seems to work:

1..function that declares via statckalloc and does no other work other than
2. returning result of another function that takes the stackalloc as first param

Can pass though needed vars from first function to second. Bit of a contort...

See as example: https://github.com/aspnet/KestrelHttpServer/pull/312/files#diff-7042cc5345ca45a84e36433636d7f10fR15

@choikwa
Copy link
Contributor

choikwa commented Nov 1, 2016

There is code in the JIT to simply do large frame stack page probing without zero init, but that didn't kick in here.

@BruceForstall Could you share where this might be? I saw that CoreCLR generates 0 initialization of stack locals very late in the prolog in one of the Kestrel hotspots.

@benaadams
Copy link
Member

@choikwa you can do it with stackalloc, but you have to do a weird pattern of stackallocing in one function and nothing else except passing that variable into another function - then the function doing the stackalloc won't zero.

@BruceForstall
Copy link
Member

For stackalloc, the code is in CodeGen::genLclHeap(). Zero init of stack locals in the normal prolog is in CodeGen::genZeroInitFrame()

@choikwa
Copy link
Contributor

choikwa commented Nov 1, 2016

Is there any particular reason why zero init of stack locals must be done at the prologue *generation? Why not expose this to the optimizer which can dead-store-eliminate or push down init. where it's only necessary?

@BruceForstall
Copy link
Member

(I believe) we only zero init local vars with GC types that are untracked (thus don't get individual operation GC state change information). There may be other cases, such as potentially uninitialized locals in verifiable code.

@choikwa
Copy link
Contributor

choikwa commented Nov 2, 2016

Cursory search for zero init of locals:

  • Above method, in the prolog generation by tracking lvMustInit field of LclVarDsc
    • tracking seems to be done at Compiler::fgInterBlockLocalVarLiveness().

      I am confused by this comment; I would be thrilled if someone could explain to me the need for volatility and must-init.

    /* Mark all pointer variables live on exit from a 'finally'
       block as either volatile for non-GC ref types or as
       'explicitly initialized' (volatile and must-init) for GC-ref types */
  • Compiler::fgInlinePrependStatements() for zero init of inlinee's locals by inserting assignment trees

The BOTR doesn't seem to describe why the former is needed or GC's relation to the stack locals. It would be nice if there are no hard restrictions to moving zero-init from prolog to earlier pass.

@BruceForstall
Copy link
Member

We don't enregister anything live in/out of handlers. That's because an exception in a "try" can happen at any time, and the handler needs to be able to find the current value of the variable in that case. The OS/VM doesn't preserve registers from the "try" to the handler, so the only place we can find the variable is on the stack. Thus, all writes to the variable must be to the stack. We could do better here in some cases, but that's the current status.

We only need to explicitly zero non-arg, GC-ref types because we're going to report them to the GC, and they must always have a valid value (zero is valid). If we knew the value was always defined before entering every EH region that had a handler using that value, we might be able to avoid force zero initializing it in the prolog. This could be improved with some better analysis.

It looks like Compiler::fgInlinePrependStatements() only zero inits locals for cases where the IL for a function requires it for the inlinee.

@jkotas
Copy link
Member

jkotas commented Nov 2, 2016

@BruceForstall @choikwa This issue is about unnecessary zero-initialization for stackalloc in C# / localloc in IL.

@jkotas
Copy link
Member

jkotas commented Jun 12, 2017

We need to do something about this for Span<T> to work end-to-end.

@jkotas
Copy link
Member

jkotas commented Jun 15, 2017

I have look into this some more:

  • Run CoreCLR tests with a hack that suppressed the init-locals bit for everything. No issues in the JIT found. (~30 tests written in IL that were exercising the init-locals bit failed - by design.)
  • Found one place in corelib that depends on the stackalloc memory to be zero-initialized: https://github.com/dotnet/coreclr/blob/97c58ac4fce27b7796206a59eea0ca27cb49fe1a/src/mscorlib/src/System/RtType.cs#L621
  • Observed that clearing the init-locals bit helps code quality with “out” arguments. The storage for “out” arguments is always zero-initialized today because of the init-locals bits, but the zero-initialization is not necessary (if the out arguments do not contain GC pointers).

The current options that I am thinking about are - from the most preferred to least preferred:

  1. Change C# compiler to not set the init-locals bit for unsafe methods; or even for everything in assemblies compiled with /unsafe or with a new compiler switch. Works for existing runtimes. Potentially breaks code like the corelib example above.
  2. Introduce option for ILLinker to clear the zero init locals bit on all methods in assembly. It would be the start of turning ILLinker into a full IL-to-IL optimizer. There are number of optimization ideas that we do not have a good place to put today – they are hard/expensive/breaking/complex to implement in either Roslyn or JIT. Proposal: IL optimization step roslyn#15929
  3. Introduce a new CompilationRelaxations.NoZeroInitLocals flag and make folks to explicitly add it if they need fast stackalloc or other benefits of no init-locals bit. It is a bit odd design because of one piece data invalidates other piece of data. Does not work for existing runtimes.

Having a new intrinsic method that does not zero init, or having flexibility to control this at method granularity, seems to be unnecessary. Assembly level granularity should be sufficient for this.

@benaadams
Copy link
Member

benaadams commented Jun 15, 2017

Change C# compiler to not set the init-locals bit for unsafe methods

Probably works; and unsafe blocks.

for everything in assemblies compiled with /unsafe or with a new compiler switch

Probably a bit heavy?

MethodImpl flag?

[MethodImpl(MethodImplOptions.NoZeroInitLocals)]

@jkotas
Copy link
Member

jkotas commented Jun 15, 2017

[MethodImpl(MethodImplOptions.NoZeroInitLocals)]

This is variant of 3. I think that folks who care about this would want it for all their code, without need to annotate every method.

@benaadams
Copy link
Member

On 1. just realized you can't use stackalloc outside an unsafe block, so it would switch it off for everything always?

@jkotas
Copy link
Member

jkotas commented Jun 15, 2017

Right - that's the idea.

@jkotas
Copy link
Member

jkotas commented Jun 15, 2017

@russellhadley The ILLink option above is the easiest one to start with. Could you please look into it for ILLinker?

@jkotas
Copy link
Member

jkotas commented Aug 31, 2017

Stephen and I have tried to switch CoreCLR to use managed version of number parsing and formatting. It is straightforward port of C++ version of code. The C++ version of the code is using stack allocated scratch buffers. We found that it is impossible to get C# version close to C++ version because of the zero initialization imposses 15%-20% penalty. The changes are under https://github.com/jkotas/coreclr/tree/corertnumbercode branch for now. The microbenchmark tested is to run ((Int32)12345).ToString("C"); in a loop.

To unblock progress on the CoreLib work, I planning to add a workaround to suppress unnecessary zero initialization for CoreLib, linked to this issue. This workaround should be removed once we have permanent solution for this problem.

@Drawaes
Copy link
Contributor

Drawaes commented Aug 31, 2017

I have hit this perf issue before when using stackalloc in a tight loop for scratchspace for crypto blocks (16bytes). Is there an option for adding a new keyword?

stackalloc nozero byte[16];

or similar? I realise that the spec says that stackalloc won't necessarily zero, but I expect we are too far down the path to change behaviour in code that probably doesn't have a lot of safe guards anyway. As always people probably have come to expect it to be zeroed but an extra keyword or different keyword for it would leave existing code and allow an "opt in"

@benaadams
Copy link
Member

Also related "JIT: consider optimizing small fixed-sized stackalloc" https://github.com/dotnet/coreclr/issues/8542 to help stackalloc functions inline

@benaadams
Copy link
Member

benaadams commented Sep 2, 2017

If the input value of output is important then it should be ref and then you init to null (and the C# compiler I believe would force you to). out means its input value is not important.

byte* output = null;
SomeInteropFunction(ref output);

out and ref are functionally the same; though semantically different (i.e ref needs init first, out needs init before return in callee)

If you interop it doesn't change whether you use ref or out so should use the one that agrees; i.e. if the input is important then it should be ref

@jakobbotsch
Copy link
Member

out output will pass a pointer to the output variable to the method. In either case this will not be a null pointer. In P/invokes your code is equivalent to
SomeInteropFunction(&output);
The value of the 'output' variable does not matter unless the p/invoked function dereferences the pointer.

@Drawaes
Copy link
Contributor

Drawaes commented Sep 2, 2017

Sure so the c might be

void MyMethod ( void** output);

So I love the idea of init being ditched. I worry that it will cause the kind of managed/unmanaged bugs that lead to double freeing, segfaults and general nastiness.

That is why from where I am sitting it needs to be opt in. I hate to think what horrible interop code there is in some point of sale machine or similar somewhere that this could break.

Heck there is a lot of historical finance system interop code I would need to go double check.

@Drawaes
Copy link
Contributor

Drawaes commented Sep 2, 2017

A key word or an attribute would work fine. Even though init "isn't in the spec" it's observed behaviour for what 10 years ?

@GSPP
Copy link

GSPP commented Sep 3, 2017

Maybe make the assembly opt into this new faster behavior by setting an attribute? That way the core libraries can make use of this enhancement easily.

It could look like:

[assembly: CodeOptimizationCompatibility(Level = CodeOptimizationLevel.V5_0)]

Or similar. The Level could trigger many optimizations that are currently not being done for compat reasons. This would be similar to the SQL Server compatibility level. It's a per-database setting enabling general improvements with each new version. In my opinion the SQL Server compatibility level has been a very useful concept.

@Drawaes
Copy link
Contributor

Drawaes commented Sep 3, 2017

I like it, it's opt in so doesn't break anything but you don't have to tag every method.

@omariom
Copy link
Contributor

omariom commented Sep 7, 2017

@benaadams

On 1. just realized you can't use stackalloc outside an unsafe block, so it would switch it off for everything always?

With @VSadov's work on ref like types it will be soon allowed.
Span<byte> buffer = stackalloc byte[10];

@tannergooding
Copy link
Member

Why do we do a tight loop of pushes, rather than a single adjustment of the stack plus a call to ZeroMemory (or equivalent)?

Also what are the numbers for the overall perf improvement from stripping the flag?

@BruceForstall
Copy link
Member

CodeGen::genZeroInitFrame() uses different code sequences for different conditions: https://github.com/dotnet/coreclr/blob/6aa119dcdb88ba1157191600a992aa94eee59b22/src/jit/codegencommon.cpp#L6977

@tannergooding
Copy link
Member

@BruceForstall, I would think that. for anything that is > 32 bytes, ZeroMemory would probably be faster (especially for large buffers) as it may take advantage of the underlying hardware (such as using rep movsb or writing up to 512-bits per iteration on hardware that support those sequences as being fast).

It would also be interesting to determine whether non-temporal zeroing would be faster for most cases (I would guess the common scenario is for a user to initialize the bytes themselves, so zeroing without polluting the cache might be better).

@tannergooding
Copy link
Member

In either case, if we had some numbers for how much stripping the init flag would save, it might be possible to convince the compiler team to add a flag.

The init flag impacts all locals, not just stackalloc, and I imagine (for more complicated methods) the JIT isn't able to always elide the zeroing. So knowing the benefit across the board would be useful.

@erozenfeld
Copy link
Member

Note that clearing initlocals flag will still leave lots of zero-initializations of structs with GC fields even though they can be avoided in many cases.

I recently opened several items related to this: dotnet/coreclr#13822 (alreday fixed), dotnet/coreclr#13823, dotnet/coreclr#13825, dotnet/coreclr#13827. Also, ILLink now has the ability to clear initlocals unconditionally. I will soon update the version of ILLink used in corefx to have this option. We can try to enable it for all corefx assemblies.

@benaadams
Copy link
Member

it may take advantage of the underlying hardware (such as using rep movsb

Already does this; though that causes issues as isn't necessarily the most efficient way https://github.com/dotnet/coreclr/issues/13827

@tannergooding
Copy link
Member

I will soon update the version of ILLink used in corefx to have this option. We can try to enable it for all corefx assemblies.

Getting perf numbers for this would be great. If the performance improvement is actually measurable for normal applications (or even things like CoreFX, Roslyn, etc) then it becomes worthwhile to suggest the compiler should have support for stripping the flag integrated into it.

@ahsonkhan
Copy link
Member

@RussKeldorph, can you please triage/prioritize this issue. Ty

@RussKeldorph
Copy link
Contributor

RussKeldorph commented Dec 16, 2017

@ahsonkhan What do you mean by triage/prioritize? (It's already triaged) By when do you need it? Can you summarize why it is more important now? I'm just looking for a summary to help us weigh it against other tasks.

@dotnet/jit-contrib

@RussKeldorph RussKeldorph removed their assignment Dec 16, 2017
@ahsonkhan
Copy link
Member

ahsonkhan commented Dec 16, 2017

By when do you need it?

2.1, and it is already marked as such. So please ignore my comment. I didn't notice the milestone.

We have some experimental APIs in corefxlab that use stackalloc spans and this could improve their performance.

For example:
https://github.com/dotnet/corefxlab/blob/master/src/System.Buffers.Experimental/System/Buffers/Text/BufferReader.cs#L209
https://github.com/dotnet/corefxlab/blob/master/src/System.Azure.Experimental/System/Azure/CosmosDbAuthorizationHeader.cs#L28
https://github.com/dotnet/corefxlab/blob/master/src/System.Text.Primitives/System/Text/Formatters/Formatter_floats.cs#L73
etc.

@jkotas
Copy link
Member

jkotas commented Dec 16, 2017

The current plan is:

  • C# will keep zero initializing by default. Changing the default would be too breaking.
  • We have a set of issues opened to make the zero initialization done by the JIT more efficient or reduce need for it (Move ILVerify tool to dotnet/runtime repo #13827, dotnet/coreclr#13823, dotnet/coreclr#13825)
  • Folks who really want to get the last bit of performance by avoiding zero initialization can use custom ILLinker step (Add ILLink.CustomsSteps.dll. linker#159) when they know what they are doing. We do this for CoreLib today (via VM hack, but we should switch to the ILLinker), and we plan to experiment with this in CoreFX (Update ILLink version. corefx#25956). Based on the results of these experiments, we may consider introducing a more streamlined way to do this in future. @ahsonkhan You should consider experimenting with it in CoreFXLab as well if you believe that it would help.

I do not think there is anything actionable for 2.1 on this issue currently. Setting the milestone to future.

@MusorGaming
Copy link

Do people at Microsoft still write documentation?
https://docs.microsoft.com/en-us/dotnet/csharp/language-reference/keywords/stackalloc
Doesn't mention Span.
Doesn't say whether it DOES or DOES NOT, in theory or practice zero memory or not.

This is important.

@josetr
Copy link

josetr commented Mar 28, 2019

If you think you would benefit from this feature I have created a package that makes it easy to control this flag for a specific method or for all methods within a type/assembly using a custom attribute InitLocals(false).

https://github.com/josetr/InitLocals

Goes without saying that you shouldn't use it in places where you have no tests since you may rely on this zero-initialization behavior.

Also, don't forget to benchmark your code because it's very likely that you don't need it at all.

@jkotas jkotas closed this as completed Sep 16, 2019
@msftgits msftgits transferred this issue from dotnet/coreclr Jan 30, 2020
@msftgits msftgits added this to the Future milestone Jan 30, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Jan 5, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Memory enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests