Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for dehydrated runtime data structures #77884

Merged
merged 6 commits into from
Nov 18, 2022

Conversation

MichalStrehovsky
Copy link
Member

@MichalStrehovsky MichalStrehovsky commented Nov 4, 2022

This adds support for dehydrating pointer-rich data structures at compile time and rehydrating them at runtime.

NativeAOT compiler generates several pointer-heavy data structures (the worst offender being MethodTable). These data structures get emitted at compile time and used at runtime to e.g. support casting or virtual method dispatch.

We want to be able to generate structures that have pointers in them because e.g. virtual method dispatch needs to be fast and we don't want to be doing extra math to compute destination (just dereference a pointer in the data structure the compiler generated and call it).

But pointers are big, and there's extra metadata the OS needs in the executable file on top of that (2 bytes on Windows, 24 (!!) bytes on Linux/ELF).

This adds support for "dehydrating" the data structures with pointers at compile time (representing pointers more efficiently) and "rehydrating" them at runtime.

The rehydration is quite fast - I'm seeing 2.2 GB/s throughput on my machine. Hello world rehydrates under a millisecond.

The size savings are significant: 7+% on Windows, 30+% on Linux.

Depends on #77972 getting through (with the update from llvm-project repo).

Cc @dotnet/ilc-contrib

@VSadov
Copy link
Member

VSadov commented Nov 7, 2022

So far I assumed we could have method table pointers emitted in the native code and relocations would take care of them when loaded. But how that would work with rehydration of the method tables?
For example if I have if (o.GetType() == typeof(string)) , would I have an address to string's method table baked into the native code? What happens when I new something? How the code knows where the rehydrated method table is located? Or is the rehydration done in-place? (like we reserve the space and store enough data to later rehydrate/patch up). How do we guarantee the data fits (i.e. compression rate is better than 1)?

I am obviously missing some important parts. I think some more explanation on how this works could be helpful.

@MichalStrehovsky
Copy link
Member Author

The comments are somewhat scattered across the code, but the key part is:

                        // If the object node is getting dehydrated, emit it into a zero-initialized
                        // data section along with all its symbols.
                        // Dehydrated data will be emitted elsewhere.

So the executable now defines an extra region within the .bss section. Code points to this region. It is zero filled at startup by the OS and we "decompress" the compressed data structures into it during very early startup. At runtime there's no difference in efficiency of accessing this area - it's still all direct pointers - the only difference is that previously it would be a pointer to the "const data" section, and after this it would be a pointer to "uninitialized data" that we initialized ourselves.

@VSadov
Copy link
Member

VSadov commented Nov 7, 2022

Code points to this region. It is zero filled at startup by the OS and we "decompress" the compressed data structures into it during very early startup.

I see. Makes sense.
Alternatively, it could be just writeable data filled with compressed representation + padding to reserve correct size, but then we would need to be sure compressed representation actually fits and the obj file would need to store the padding.

int command, payload;

byte[] buf = new byte[5];
Debug.Assert(Encode(1, 0, buf) == 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like IfFailGo type macro style. Why not just throw onerror instead of error codes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is under #if false - it was my debugging helper to make sure I didn't mess up the bit packing (this file can be compiled on its own with the region enabled).

I'm still debating whether to check it in, delete it, or spend more time converting it to a unit test that is never going to fail because nobody will want to touch this anyway.

Copy link
Member

@sbomer sbomer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat! LGTM as someone not super familiar with the code. :)

Copy link
Member

@LakshanF LakshanF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for walking through the code and doing this work to reduce the size!

I took a look at the Encode and Decode routines on DehydratedDataCommand class and the testing there. The testing seems to cover the 1 byte boundary for EncodeShort well and I couldn't think of any missing cases. I also took look at the DehydratedDataNode & StartupCodeHelpers where this is used and while not fully able to follow the pointer manipulation for RehydrateData for the last 2 options, I assume the lookup table logic holds (and will blow up the program if it doesn't :-))

Hopefully the dependent update merge happens soon so the larger testing matrix can follow.

switch (command)
{
case DehydratedDataCommand.Copy:
// TODO: can we do any kind of memcpy here?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Buffer.MemoryCopy ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't call that from Test.CoreLib. I'm also a little bit worried - this code runs during very early startup - casting, typeof, newobj and many other things are not available. Buffer.MemoryCopy feels a bit higher up the stack that I'm not sure it would be safe to call.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this runs very early. I am not sure if Buffer.MemoryCopy does anything complex, but since the data here is unmanaged, it would eventually just call something like InternalCalls.memmove.

Are these copied chunks long enough to involve memmove? If they are typically just 10-20 bytes, it may not get any faster.

Copy link
Member

@VSadov VSadov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Nice!

@lambdageek
Copy link
Member

/cc @ivanpovazan we should rerun our sample after this is merged and see if we get more size reduction

@eerhardt
Copy link
Member

I can confirm this change makes a major size reduction on Linux. Publishing NativeAOT a simple ASP.NET WebAPI app:

Before: 29.31 MB
After: 19.78 MB

This is a 32.5% size reduction!

MichalStrehovsky added a commit that referenced this pull request Nov 20, 2022
This is a follow up to #77884. In the original pull request, all relocation targets went into a lookup table. This is not a very efficient way to represent rarely used relocs. In this update, I'm extending the dehydration format to allow representing relocations inline - instead of indirecting through the lookup table, the target immediately follows the instruction. I'm changing the emitter to emit this if there's less than 3 references to the reloc.

This produces a ~0.5% size saving. It likely also speeds up the decoding at runtime since there's less cache thrashing. On a hello world, the lookup table originally had about 11k entries. With this change, the lookup table only has 1700 entries.

If multiple relocations follow after each other, generate a single command with the payload specifying the number of subsequent relocations. This saves additional 0.1%.
@MichalStrehovsky
Copy link
Member Author

I can confirm this change makes a major size reduction on Linux. Publishing NativeAOT a simple ASP.NET WebAPI app:

Nice! Thank you for measuring it! #78545 should bring another maybe ~1% saving, and I have a list of a couple more savings (not quite 32.5%, but maybe another 5%).

@eerhardt
Copy link
Member

Just a quick clarification - I later discovered that these numbers also included #78198, which cut ~1 MB of that app size off. So the full 32.5% decrease was with both of these changes. #77884 contributed the vast majority of it though.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants