-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[wasm] Interpreter automatic PGO #92981
Conversation
Tagging subscribers to this area: @BrzVlad, @kotlarmilos Issue DetailsThis PR adds automatic PGO for interpreter tiering. The idea is that by maintaining a list of which interpreter methods were tiered during execution, we can skip directly to generating optimized code on future runs. This will get the application into a steady high-performance state faster on future runs, while also reducing the amount of time/memory we spend on code generation (since normally we generate unoptimized code, then generate optimized code). Application code can either use
|
src/mono/mono/mini/interp/interpgo.c
Outdated
size_t size = sizeof(uint32_t) + 16; | ||
uint32_t *inbuf = alloca (size); | ||
// method tokens are globally unique within a given assembly | ||
inbuf[0] = mono_method_get_token (method); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not true for generic instances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so I need to hash the generic arguments? Or is it more complex than that?
This comment was marked as duplicate.
This comment was marked as duplicate.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some generic instances are very hot in the BCL, like the ones used in string/array searches
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can hash the full method name for example, but maybe its good enough this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's probably fine this way, if any instance of the method got tiered we can probably tier all of them. This still won't generate code for other instances until they're actually used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's a wrapper method, I think the token is either shared with the method that it is wrapping, or it is zero. maybe we don't mind the collisions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we share the AOT compiler's logic for uniquely identifying methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The aot compiler/runtime has a mono_aot_method_hash () which returns a good hash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mono_aot_method_hash looks comprehensive but also looks really expensive, and i'm not sure how well it will avoid collisions. do you still want me to use it?
src/mono/mono/mini/interp/interpgo.c
Outdated
size_t size = sizeof(uint32_t) + 16; | ||
uint32_t *inbuf = alloca (size); | ||
// method tokens are globally unique within a given assembly | ||
inbuf[0] = mono_method_get_token (method); |
This comment was marked as duplicate.
This comment was marked as duplicate.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WBT changes look good.
} | ||
|
||
{ | ||
_testOutput.WriteLine("/// Second run"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not very familiar with playwrite, should we navigate the browser away before we load the page second time ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each page is basically a separate tab as I understand it
Checkpoint: Interpgo bloom filter works Workaround for 'reach managed cold' hanging forever Remove logging Checkpoint Checkpoint Configurable interpreter PGO Enable interpgo for browser bench Checkpoint: Refactor interpgo to share cache code with memory snapshots Use cache storage for Interpreter PGO Code cleanup + repair merge damage Repair merge damage Fix some bugs and troubleshoot WBT failure Fix build Move prototypes I love C Fix key collision between interpgo and memory snapshot Fix storeCacheEntry always writing to the memory snapshot key Checkpoint Move more of interpgo outside of platform guards Fix build Make the inline murmurhash functions also static to fix linker errors Remove unused signature argument Cleanup Address PR feedback Rearrange code Address PR feedback Fix assert when loading empty table Checkpoint Update dotnet.d.ts Move interp codegen timing to a runtime option + make it thread safe Move interp pgo logging to a runtime option Add a WBT test that verifies basic functionality of interpreter PGO Add missing file Await interp_pgo_save_data Apply suggestions from code review Co-authored-by: Ankit Jain <[email protected]> Cleanup / address PR feedback Address PR feedback Increase browser-bench waitFor timeout Fix syntax
This PR adds automatic PGO for interpreter tiering. The idea is that by maintaining a list of which interpreter methods were tiered during execution, we can skip directly to generating optimized code on future runs. This will get the application into a steady high-performance state faster on future runs, while also reducing the amount of time/memory we spend on code generation (since normally we generate unoptimized code, then generate optimized code).
The infrastructure for this feature is not platform gated, but the only platform that has an implementation for loading/saving the PGO data right now is the browser, and the runtime option controlling it is only default-on for the browser.
Web application code can either use
INTERNAL.interpgo_load_data
andINTERNAL.interpgo_save_data
to manually control the loading/saving of the data, or can use the newwithInterpreterPgo(true, autoSaveDelay)
builder method to turn on automatic mode. Automatic mode will attempt to load the data during startup, and will automatically save it after a delay (at which point your application should be in something approximating a steady-state and critical code has already tiered.) The current web implementation repurposes the cache storage code from pavel's memory snapshot feature to store the pgo table in the cache next to the snapshot(s).This PR also adds basic (controlled by a
#define
and off by default) timing instrumentation for generate_code, because in my testing the browser's profiler was producing wildly incorrect timing data for interpreter codegen. This should make it easy to evaluate how well the feature is working since you can just look at the output.One interesting finding: The list of method hashes generated on a first run is bigger than the list generated on future runs - I believe this is because once a full set of methods is tiered, we no longer need the tiered version of some of them because the hottest methods have gotten inlined into their callers. So on a second run the list of hashes stabilizes into a smaller list.
In a couple local test runs of browser-bench with this active and without, the time spent in generate_code was 3600ms without interpgo and 3209ms with interpgo, so the improvement over the course of a full run of an application with lots of generated code should be worthwhile. For smaller applications it's below the noise floor, I haven't been able to measure it successfully at that scale.