-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test failure tracing/eventpipe/providervalidation/providervalidation/providervalidation.sh #59296
Comments
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label. |
Tagging subscribers to this area: @tommcdon Issue DetailsRun: runtime 20210917.69 Failed test:
Error message:
|
@mikem8361 there is an interesting error in the logs from when it tried to grab a dump:
Any idea what happened? There is not dump or artifact besides the logs indicating a timeout. I'll see if I can repro this locally and do a quick search in the Helix data to see if this a one-off failure. |
I have no idea what happened here. This error is returned when the DAC initialization fails for various reasons usually because it can't find/get/initialize the global dac table in the runtime. |
Converted to a runfo live issue, happened 6 times this week. |
The 4 failures with core dumps is due to an assert where runtime tries to load an aot version of the assembly running in full aot mode, but since this is not a full aot test lane, there are no aot versions of assemblies: Failed to load AOT module '/datadisks/disk1/work/A70408DA/w/BEDE09DE/e/baseservices/varargs/varargsupport_r/varargsupport_r.dll.so' ('/datadisks/disk1/work/A70408DA/w/BEDE09DE/e/baseservices/varargs/varargsupport_r/varargsupport_r.dll') in aot-only mode. so this will fail several different kinds of tests, not just the test tracked by this issue, above is a failure in varargsupport_r.dll for example. For some reason the runtime gets executed in full aot mode, even if these are lanes not using that mode, so nothing have aot:ed the assemblies, causing the assert. 1388663 seems to be unrelated and fails a lot of different tests due failing to install mobile app, could be infra related. 1389780, 1390826 and 1391966 seems to be related to the same issue in rundownvalidation.sh. I think we could disable the test and create issue, we would need to see if we could get local repro on this, since we don't get any info from Mono on hanged processes on CI, so currently no way to see where it hangs. |
@josalem consider disabling test while investigation is pending |
I can do that during the day. |
Investigating, having repro on OSX, only reproduce very infrequently (on release builds), been able to track it down to be related to the file streaming stack hash map and duplications of entries not correctly detected for this specific hash map (using custom hash key and compare functions). Hash map implementation is runtime specific, so likely a Mono specific issue. |
As observed by dotnet#59296, EventPipe streaming thread could infrequently cause an infinite loop on Mono when cleaning up stack hash map, ep_rt_stack_hash_remove_all called from ep_file_write_sequence_point, flushing buffer memory into file stream. Issue only occurred on Release builds and so far, only observed on OSX, and reproduced in 1 of around 100 runs of the test suite. After debugging the assembler when hitting the hang, it turns out that one item in the hash map has a hash key, that doesn't correspond to its hash bucket, this scenario should not be possible since items get placed into buckets based on hash key value that doesn't change for the lifetime of the item. This indicates that there is some sort of corruption happening to the key, after it has been added to the hash map. After some more instrumentation it turns out that insert into the hash map infrequently triggers a replace, but Mono hash table used in EventPipe is setup to insert without replace, meaning it will keep old key but switch and free old value. Stack has map uses same memory for its key and value, so freeing the old value will also free the key, but since old key is kept, it will point into freed memory and future reuse of that memory region will cause corruption of the hash table key. This scenario should not be possible since EventPipe code will only add to the hash map, if the item is not already in the hash map. After some further investigation it turns out that the call to ep_rt_stack_hash_lookup reports false, while call to ep_rt_stack_hash_add for the same key will hit replace scenario in g_hash_table_insert_replace. g_hash_table_insert_replace finds item in the hash map, using callbacks for hash and equal of hash keys. It turns out that the equal callback is defined to return gboolean, while the callback implementation used in EventPipe is defined to return bool. gboolean is typed as int32_t on Mono and this is the root cause of the complete issue. On optimized OSX build (potential on other platforms) the callback will do a memcmp (updating full eax register) and when returning from callback, callback will only update first byte of eax register to 0/1, keeping upper bits, so if memcmp returns negative value or a positive value bigger than first byte, eax will contains garbage in byte 2, 3 and 4, but since Mono's g_hash_table_insert_replace expects gboolean, it will look at complete eax content meaning if any of the bits in byte 2, 3 or 4 are still set, condition will still be true, even if byte 1 is 0, representing false, incorrectly trigger the replace logic, freeing the old value and key opening up for future corruption of the key, now reference freed memory. Fix is to make sure the callback signatures used with hash map callbacks, match expected signatures of underlying container implementation. Fix also adds a checked build assert into hash map’s add implementation on Mono validating that the added key is not already contained in the hash map enforcing callers to check for existence before calling add on hash map.
…ad. (#72517) * Fix infrequent infinite loop on Mono EventPipe streaming thread. As observed by #59296, EventPipe streaming thread could infrequently cause an infinite loop on Mono when cleaning up stack hash map, ep_rt_stack_hash_remove_all called from ep_file_write_sequence_point, flushing buffer memory into file stream. Issue only occurred on Release builds and so far, only observed on OSX, and reproduced in 1 of around 100 runs of the test suite. After debugging the assembler when hitting the hang, it turns out that one item in the hash map has a hash key, that doesn't correspond to its hash bucket, this scenario should not be possible since items get placed into buckets based on hash key value that doesn't change for the lifetime of the item. This indicates that there is some sort of corruption happening to the key, after it has been added to the hash map. After some more instrumentation it turns out that insert into the hash map infrequently triggers a replace, but Mono hash table used in EventPipe is setup to insert without replace, meaning it will keep old key but switch and free old value. Stack has map uses same memory for its key and value, so freeing the old value will also free the key, but since old key is kept, it will point into freed memory and future reuse of that memory region will cause corruption of the hash table key. This scenario should not be possible since EventPipe code will only add to the hash map, if the item is not already in the hash map. After some further investigation it turns out that the call to ep_rt_stack_hash_lookup reports false, while call to ep_rt_stack_hash_add for the same key will hit replace scenario in g_hash_table_insert_replace. g_hash_table_insert_replace finds item in the hash map, using callbacks for hash and equal of hash keys. It turns out that the equal callback is defined to return gboolean, while the callback implementation used in EventPipe is defined to return bool. gboolean is typed as int32_t on Mono and this is the root cause of the complete issue. On optimized OSX build (potential on other platforms) the callback will do a memcmp (updating full eax register) and when returning from callback, callback will only update first byte of eax register to 0/1, keeping upper bits, so if memcmp returns negative value or a positive value bigger than first byte, eax will contains garbage in byte 2, 3 and 4, but since Mono's g_hash_table_insert_replace expects gboolean, it will look at complete eax content meaning if any of the bits in byte 2, 3 or 4 are still set, condition will still be true, even if byte 1 is 0, representing false, incorrectly trigger the replace logic, freeing the old value and key opening up for future corruption of the key, now reference freed memory. Fix is to make sure the callback signatures used with hash map callbacks, match expected signatures of underlying container implementation. Fix also adds a checked build assert into hash map’s add implementation on Mono validating that the added key is not already contained in the hash map enforcing callers to check for existence before calling add on hash map.
Run: runtime 20210917.69
Failed test:
Error message:
Runfo Tracking Issue: tracing/eventpipe/providervalidation/providervalidation/providervalidation.sh
Build Result Summary
The text was updated successfully, but these errors were encountered: