Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NativeAOT-LLVM] Add a portable implementation of RhGetCodeTarget #2333

Merged
merged 7 commits into from
Jul 17, 2023

Conversation

SingleAccretion
Copy link

@SingleAccretion SingleAccretion commented Jul 6, 2023

In this implementation:

  1. The compiler generates an array of tuples { Stub, Target }. This is done for both simple and instantiating stubs.
  2. At runtime, this information is sorted, with lookups done via binary search.

Costs:

  1. The size of the array is ~5KB for the HelloWasm test, which has ~400 simple unboxing stubs and ~100 instantiating ones. For a medium-sized application that has ~50K stubs, that's 400K on 32 bit (and double that on 64 bit).
  2. Binary search is not "cheap" and, unfortunately, we will often be hit it on the startup path because of ComputeLdftnReverseLookup. Fortunately, we should rarely hit in steady state, as it will only be used for Delegate.MethodInfo and Delegate.CreateDelegate (the latter - in rare scenarios).
  3. Managed code that supports all of this clocks in at about 5K, which is rather large, but is mostly a reflection of how our codegen is not that great.

Code structure choices:

  1. ILC - layered under MetadataManager like the other maps used by reflection.
  2. Runtime - layered in CoreLib for consistency with the existing (non-portable) implementation.

Contributes to #2307.

@@ -116,39 +122,24 @@ public void InitializeDebugInfo(MethodDebugInformation debugInfo)
public void InitializeLocalTypes(TypeDesc[] localTypes)
{
}
}

internal sealed class LlvmMethodBodyNode : LLVMMethodCodeNode, IMethodBodyNode
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I deleted the LlvmMethodBodyNode/LLVMMethodCodeNode fork, it did not look to serve any meaningful purpose. MethodCodeNode does not use such pattern either.


public override IEnumerable<DependencyListEntry> GetStaticDependencies(NodeFactory factory)
public ISymbolNode GetUnboxingThunkTarget(NodeFactory factory)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ISpecialUnboxThunkNode implementation is C&P from MethodCodeNode.

@SingleAccretion SingleAccretion marked this pull request as ready for review July 6, 2023 22:56
@SingleAccretion
Copy link
Author

@dotnet/nativeaot-llvm

@jkotas
Copy link
Member

jkotas commented Jul 7, 2023

This may be interesting to upstream to address dotnet/runtime#73906

uint32_t instantiatingStubCount = allMappingsCount - simpleStubCount;

if (!initMap(m_unboxingStubTargetsMap, &allMappings->Data[0], simpleStubCount) ||
!initMap(m_unboxingAndInstantiatingStubTargetsMap, &allMappings->Data[simpleStubCount], instantiatingStubCount))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core runtime should not need this mapping for anything. Is that correct?

Would it make more sense to initialize this map lazily and do the whole thing in C#, like we do that for other maps that are needed by reflection?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core runtime should not need this mapping for anything. Is that correct?

Yes. It's only consumed at the CoreLib layer and above.

Would it make more sense to initialize this map lazily and do the whole thing in C#, like we do that for other maps that are needed by reflection?

I considered it. The main problem is that the functionality is used by delegate equality, which seems low-level enough to be non-trivial to initialize robustly in managed code.

Additionally, the SHash closed hashing table implementation is very nicely fitted to this use case, a managed dictionary would use 3x+ more memory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it need to be used by the delegate equality in the first place? It looks like that it is a left-over from some flavor of multi-module mode that allowed multiple unboxing stubs to exist for given method. It should not be a thing in current native AOT. You can try to upstream removing it from delegate equality.

Even if it still needed for some reason, delegate equality should not be used by low-level runtime code where the lazy initialization would lead to problems.

A reverse map like this one is a common problem for reflection-like functionality. I think it would be fine to implement this reverse map in a similar way. We have a few of those. For example, look for FunctionPointersToOffsets. (This one is also more complicated than it needs to be due to support for multimodule flavors that we do not care about anymore.)

In fact, it looks like this reverse map was implemented in managed code at some point. There are unused UnboxingAndInstantiatingStubMap and struct UnboxingAndInstantiatingStubMapEntry that one would need to implement reverse maps like this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it need to be used by the delegate equality in the first place? It looks like that it is a left-over from some flavor of multi-module mode that allowed multiple unboxing stubs to exist for given method. It should not be a thing in current native AOT. You can try to upstream removing it from delegate equality.

Doing some research, agree it looks unnecessary.

I will pursue its removal and a managed implementation for these maps.

src/coreclr/nativeaot/Runtime/RuntimeInstance.cpp Outdated Show resolved Hide resolved
uint32_t instantiatingStubCount = allMappingsCount - simpleStubCount;

if (!initMap(m_unboxingStubTargetsMap, &allMappings->Data[0], simpleStubCount) ||
!initMap(m_unboxingAndInstantiatingStubTargetsMap, &allMappings->Data[simpleStubCount], instantiatingStubCount))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it need to be used by the delegate equality in the first place? It looks like that it is a left-over from some flavor of multi-module mode that allowed multiple unboxing stubs to exist for given method. It should not be a thing in current native AOT. You can try to upstream removing it from delegate equality.

Even if it still needed for some reason, delegate equality should not be used by low-level runtime code where the lazy initialization would lead to problems.

A reverse map like this one is a common problem for reflection-like functionality. I think it would be fine to implement this reverse map in a similar way. We have a few of those. For example, look for FunctionPointersToOffsets. (This one is also more complicated than it needs to be due to support for multimodule flavors that we do not care about anymore.)

In fact, it looks like this reverse map was implemented in managed code at some point. There are unused UnboxingAndInstantiatingStubMap and struct UnboxingAndInstantiatingStubMapEntry that one would need to implement reverse maps like this.

@SingleAccretion SingleAccretion marked this pull request as draft July 10, 2023 17:03
@SingleAccretion SingleAccretion force-pushed the RhGetCodeTarget branch 2 times, most recently from 5d120c4 to ec99e4e Compare July 13, 2023 16:22
@SingleAccretion SingleAccretion marked this pull request as ready for review July 13, 2023 17:49
@SingleAccretion
Copy link
Author

@dotnet/nativeaot-llvm PTAL

In this implementation:
 1) The compiler generates an array of tuples { Stub, Target }.
    This is done for both simple and instantiating stubs.
 2) At runtime startup, pointers into this array are stored in
    two hash tables, with the stubs as keys. This allow for more
    or less efficient target retrieval.

Costs:
  1) The size of the array is ~5KB for the reflection test, which
     has ~500 simple unboxing stubs and ~100 instantiating ones.
     Thus for a medium-sized program, overhead on the order of
     ~50KB is expected.
  2) The hash table construction startup penalty should be on the
     order of 10us for the reflection test and scaling linearly for
     larger programs, which seems acceptable.
dotnet/corert@08d78ae

The original motivation for this was handling import stubs:
```
Function pointer equality comparison was not handling cross-module pointers correctly when optimizations were enabled
(causes target pointers to be wrapped in jump stubs sometimes). The delegate equality comparison was hitting this bug.
```
We do not have import stubs anymore and unwrapping unboxing stubs serves no purpose here.

Microbenchmarks of delegate equality show ~3x improvement with this change:
```
Bench_DelegateEquality_Positive_OpenStatic<10000000>() took: 355 ms
Bench_DelegateEquality_Positive_ClosedStatic<10000000>() took: 367 ms
Bench_DelegateEquality_Positive_ClosedInstance<10000000>() took: 371 ms

Bench_DelegateEquality_Positive_OpenStatic<10000000>() took: 121 ms
Bench_DelegateEquality_Positive_ClosedStatic<10000000>() took: 120 ms
Bench_DelegateEquality_Positive_ClosedInstance<10000000>() took: 122 ms
```

Additionally, there is some desire to upstream changes for a portable RhGetCodeTarget implementation. Not having to
deal with it at this relatively low-level layer will make things more robust.
In this implementation:
 1) The compiler generates an array of tuples { Stub, Target }.
    This is done for both simple and instantiating stubs.
 2) At runtime, this information is sorted, with lookups done
    via binary search.

Costs:
 1) The size of the array is ~5KB for the HelloWasm test, which
    has ~400 simple unboxing stubs and ~100 instantiating ones.
    For a medium-sized application that has ~50K stubs, that's
    400K on 32 bit (and double that on 64 bit).
 2) Binary search is not "cheap" and, unfortunately, we will often
    hit it on the startup path because of ComputeLdftnReverseLookup.
    Fortunately, we should rarely hit in steady state, as it will
    only be used for Delegate.MethodInfo and Delegate.CreateDelegate
    (the latter - in rare scenarios).
 3) Managed code that supports all of this clocks in at about 5K,
    which is rather large, but is mostly a reflection of how our
    codegen is not that great.

Code structure choices:
 1) ILC - layered under MetadataManager like the other maps used by
    reflection.
 2) Runtime - layered in CoreLib for consistency with the existing
    (non-portable) implementation.
@jkotas jkotas added the area-NativeAOT-LLVM LLVM generation for Native AOT compilation (including Web Assembly) label Jul 16, 2023
Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@jkotas jkotas merged commit 8657c73 into dotnet:feature/NativeAOT-LLVM Jul 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-NativeAOT-LLVM LLVM generation for Native AOT compilation (including Web Assembly)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants