Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increased amount of ANRs after disabling concurrent GC #9365

Open
TimBurik opened this issue Oct 3, 2024 · 23 comments
Open

Increased amount of ANRs after disabling concurrent GC #9365

TimBurik opened this issue Oct 3, 2024 · 23 comments
Assignees
Labels
Area: Mono Runtime Mono-related issues: BCL bugs, AOT issues, etc.

Comments

@TimBurik
Copy link

TimBurik commented Oct 3, 2024

Android framework version

net8.0-android

Affected platform version

.NET 8.0.303

Description

After switching from Xamarin.Android to .Net8 we used to get a lot of native crashes in monosgen, for example 222003f52fa3496385d14a89c778a6e4-symbolicated.txt

After long investigation (and a hint from the issue dotnet/runtime#100311) It turns out that concurrent SGen is actually enabled by default in .net-android

<AndroidEnableSGenConcurrent Condition=" '$(AndroidEnableSGenConcurrent)' == '' ">true</AndroidEnableSGenConcurrent>

so we explicitly disable it - and now the amount of native crashes in monosgen is minimal, but instead we are getting a lot of ANR reports in Sentry and GooglePlay Console.

ANRs seems to be reported using Android's ApplicationExitInfo mechanism, according to stacktraces main thread seems to be blocked by awaiting native mutex somewhere in the monosgen (example: anr_stacktrace.txt)

Additional information, which might be relevant:

  • according to GooglePlay Console, majority of the ANRs related to the broadcast of either Intent { act=android.intent.action.SCREEN_OFF } or Intent { act=android.intent.action.SCREEN_ON };
  • majority of ANRs seems to be happening when application is in the background;

Steps to Reproduce

Unfortunately, we don't have exact steps to reproduce.
The only thing that is sure that it is happening when targeting .net-android34.0 (version for Xamarin.Android doesn't have this issue) and issue started happening after adding the following to the csproj:
<AndroidEnableSGenConcurrent>false</AndroidEnableSGenConcurrent>

Did you find any workaround?

No workaround found yet

Relevant log output

No response

@TimBurik TimBurik added Area: App Runtime Issues in `libmonodroid.so`. needs-triage Issues that need to be assigned. labels Oct 3, 2024
@grendello
Copy link
Contributor

@TimBurik I'm afraid we can't do anything for you here, this appears to be an issue split between MonoVM and the Sentry Native SDK. @lambdageek do you think someone on your side could take a look to see if anything can be done here?

There's an issue being worked on related to Sentry, #9055, perhaps this here issue is also caused by the same problems?

@supervacuus, sorry for tagging you out of the blue, but would you be able to look into the Sentry side of things here?

I realize we have precious little information, but perhaps someone will be able to spot something in their respective areas and help out.

@grendello grendello added Area: Mono Runtime Mono-related issues: BCL bugs, AOT issues, etc. and removed Area: App Runtime Issues in `libmonodroid.so`. needs-triage Issues that need to be assigned. labels Oct 4, 2024
@TimBurik
Copy link
Author

TimBurik commented Oct 4, 2024

@grendello thank you for the response.

Regarding #9055 - we are aware of this issue, and we have a separate native crash group, which is most likely related to this issue. But the ANRs doesn't seem to be related to this issue and to Sentry, because we are seeing exactly the same picture in the GooglePlay Console - increased amount of ANRs, all containing libmonosgen-2.0 in stack traces

@grendello
Copy link
Contributor

The GC ANRs might be related to something inside the GC bridge, hopefully @lambdageek will be able to help out here. However, this ANR looks familiar:

#00  pc 0x0000000000089cf0  /apex/com.android.runtime/lib64/bionic/libc.so (syscall+32)
#01  pc 0x000000000008e6d8  /apex/com.android.runtime/lib64/bionic/libc.so (__futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+148)
#02  pc 0x000000000009c3dc  /apex/com.android.runtime/lib64/bionic/libc.so (sem_wait+112)
#03  pc 0x00000000001eeb20  /data/app/~~x6iHlhee06D2KSrJ3_D8mw==/<app.bundle.id>-UM4qcDy_vV_DZ6DnDt2oCA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so

You are most likely not using marshal methods, since they weren't enabled by default in NET8, but we've fixed an issue recently which was related to them and one of the repros had the above stack trace as well. The problem was related to Java ("native") threads being incorrectly attached to the MonoVM runtime and putting the GC in a bad state. You can find more info about it in this and the following comments. @filipnavara, does this look familiar?

Perhaps something in your app causes a similar corruption of the GC state?

@TimBurik if you're able to reproduce the ANRs locally, would you be able to test with NET9 rc2?

@TimBurik
Copy link
Author

TimBurik commented Oct 4, 2024

@grendello thank you for the pointers!

We are planning to test the app on .NET 9 in the nearest future, we would make sure to use RC2 for those tests (when RC2 would be available).

Also, we have added debug symbols for libmonosgen-2.0 into GooglePlay Console and now we have more detailed logs. For example, this is the reported main thread stack trace from one of the ANRs:

  #00  pc 0x0000000000078dec  /apex/com.android.runtime/lib64/bionic/libc.so (syscall+28)
  #01  pc 0x000000000007d7e0  /apex/com.android.runtime/lib64/bionic/libc.so (__futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+144)
  #02  pc 0x000000000008ba8c  /apex/com.android.runtime/lib64/bionic/libc.so (sem_wait+108)
  #03  pc 0x00000000001eeb20  /data/app/~~BDdfU_FYGQL1DW1U7YrqhQ==/<app.bundle.id>-RkmEjhS8iZlVjNzV7bp1xA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_os_sem_timedwait+204) 
  #04  pc 0x00000000001ee844  /data/app/~~BDdfU_FYGQL1DW1U7YrqhQ==/<app.bundle.id>-RkmEjhS8iZlVjNzV7bp1xA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_threads_wait_pending_operations+323) 
  #05  pc 0x00000000002a50f4  /data/app/~~BDdfU_FYGQL1DW1U7YrqhQ==/<app.bundle.id>-RkmEjhS8iZlVjNzV7bp1xA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (unified_suspend_stop_world+349) 
  #06  pc 0x00000000002a4e84  /data/app/~~BDdfU_FYGQL1DW1U7YrqhQ==/<app.bundle.id>-RkmEjhS8iZlVjNzV7bp1xA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_client_stop_world+159) 
  #07  pc 0x00000000002c5350  /data/app/~~BDdfU_FYGQL1DW1U7YrqhQ==/<app.bundle.id>-RkmEjhS8iZlVjNzV7bp1xA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_stop_world+3974) 
  #08  pc 0x00000000002c21bc  /data/app/~~BDdfU_FYGQL1DW1U7YrqhQ==/<app.bundle.id>-RkmEjhS8iZlVjNzV7bp1xA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_perform_collection+2637) 
  #09  pc 0x00000000002c20ec  /data/app/~~BDdfU_FYGQL1DW1U7YrqhQ==/<app.bundle.id>-RkmEjhS8iZlVjNzV7bp1xA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_ensure_free_space+2616) 
  #10  pc 0x00000000002be8b8  /data/app/~~BDdfU_FYGQL1DW1U7YrqhQ==/<app.bundle.id>-RkmEjhS8iZlVjNzV7bp1xA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_alloc_obj_nolock+279) 
  #11  pc 0x00000000002bed9c  /data/app/~~BDdfU_FYGQL1DW1U7YrqhQ==/<app.bundle.id>-RkmEjhS8iZlVjNzV7bp1xA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_alloc_obj+454) 
  #12  pc 0x00000000002a686c  /data/app/~~BDdfU_FYGQL1DW1U7YrqhQ==/<app.bundle.id>-RkmEjhS8iZlVjNzV7bp1xA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_gc_alloc_obj+690) 
  #13  pc 0x0000000000004438 

@grendello
Copy link
Contributor

@TimBurik this more and more looks like the issue fixed by @filipnavara, he declared he'd take a look next week - it would be great if you were able to test with NET9 RC2 before then :)

@grendello
Copy link
Contributor

@TimBurik if RC2 isn't available by next week, do try with RC1 - it's to see if the issues you're seeing still even exist in NET9

@supervacuus
Copy link

I've been following the thread since you tagged me. It seems this isn't sentry-related. I've informed sentry peeps, just in case.

@grendello
Copy link
Contributor

@supervacuus thanks, I tagged you just in case - this is all low-level stuff, it might be affecting Sentry as well in some ways.

@lambdageek
Copy link
Member

/cc @vitek-karas

@TimBurik
Copy link
Author

@grendello sorry for late response
we've just released .Net9-rc2 build to a small population in production (we still don't have a reliable way to reproduce it), and unfortunately we see the same ANR reports in the GooglePlay Console.
Here are some more examples of the main thread stack traces from those reports:

#00  pc 0x000000000004f85c  /apex/com.android.runtime/lib64/bionic/libc.so (syscall+28)
#01  pc 0x0000000000054330  /apex/com.android.runtime/lib64/bionic/libc.so (__futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+144)
#02  pc 0x0000000000062d4c  /apex/com.android.runtime/lib64/bionic/libc.so (sem_wait+108)
#03  pc 0x00000000001faecc  /data/app/~~a38aiG3R6nUa1cKdk8VPjw==/<app.bundle.id>-mWkT091cROSJn00EYViQrQ==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_os_sem_timedwait+204)
#04  pc 0x00000000001fadf4  /data/app/~~a38aiG3R6nUa1cKdk8VPjw==/<app.bundle.id>-mWkT091cROSJn00EYViQrQ==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_threads_wait_pending_operations+329)
#05  pc 0x00000000002b0e24  /data/app/~~a38aiG3R6nUa1cKdk8VPjw==/<app.bundle.id>-mWkT091cROSJn00EYViQrQ==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (unified_suspend_stop_world+349)
#06  pc 0x00000000002b0bcc  /data/app/~~a38aiG3R6nUa1cKdk8VPjw==/<app.bundle.id>-mWkT091cROSJn00EYViQrQ==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_client_stop_world+159)
#07  pc 0x00000000002c6ec8  /data/app/~~a38aiG3R6nUa1cKdk8VPjw==/<app.bundle.id>-mWkT091cROSJn00EYViQrQ==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_stop_world+3988)
#08  pc 0x00000000002c3e44  /data/app/~~a38aiG3R6nUa1cKdk8VPjw==/<app.bundle.id>-mWkT091cROSJn00EYViQrQ==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_perform_collection+2637)
#09  pc 0x00000000002c3d74  /data/app/~~a38aiG3R6nUa1cKdk8VPjw==/<app.bundle.id>-mWkT091cROSJn00EYViQrQ==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_ensure_free_space+2616)
#10  pc 0x00000000002c0618  /data/app/~~a38aiG3R6nUa1cKdk8VPjw==/<app.bundle.id>-mWkT091cROSJn00EYViQrQ==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_alloc_obj_nolock+279)
#11  pc 0x00000000002b2f84  /data/app/~~a38aiG3R6nUa1cKdk8VPjw==/<app.bundle.id>-mWkT091cROSJn00EYViQrQ==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_gc_alloc_vector+1119)
#12  pc 0x000000000000cb00 
#00  pc 0x000000000001daec  /system/lib64/libc.so (syscall+28)
#01  pc 0x0000000000028a88  /system/lib64/libc.so (sem_wait+88)
#02  pc 0x00000000001fac20  /data/app/<app.bundle.id>-fM0-AVfJ2sE77in08Tz1vA==/lib/arm64/libmonosgen-2.0.so (mono_thread_info_wait_for_resume+204)
#03  pc 0x000000000020035c  /data/app/<app.bundle.id>-fM0-AVfJ2sE77in08Tz1vA==/lib/arm64/libmonosgen-2.0.so (mono_threads_enter_gc_unsafe_region_unbalanced_with_info+503)
#04  pc 0x0000000000283944  /data/app/<app.bundle.id>-fM0-AVfJ2sE77in08Tz1vA==/lib/arm64/libmonosgen-2.0.so (mono_threads_attach_coop_internal+4712)
#05  pc 0x0000000000283978  /data/app/<app.bundle.id>-fM0-AVfJ2sE77in08Tz1vA==/lib/arm64/libmonosgen-2.0.so (mono_threads_attach_coop+4731)
#06  pc 0x000000000000b5d8
#00  pc 0x00000000000902fc  /apex/com.android.runtime/lib64/bionic/libc.so (syscall+28)
#01  pc 0x00000000000950d0  /apex/com.android.runtime/lib64/bionic/libc.so (__futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+144)
#02  pc 0x00000000000a3670  /apex/com.android.runtime/lib64/bionic/libc.so (sem_wait+112)
#03  pc 0x00000000001faecc  /data/app/~~A69RIDgDy0yqESTrZAqEOw==/<app.bundle.id>-d_4yE8at6_OCBu1Yazl1ww==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_os_sem_timedwait+204)
#04  pc 0x00000000001fadf4  /data/app/~~A69RIDgDy0yqESTrZAqEOw==/<app.bundle.id>-d_4yE8at6_OCBu1Yazl1ww==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_threads_wait_pending_operations+329)
#05  pc 0x00000000002b0e24  /data/app/~~A69RIDgDy0yqESTrZAqEOw==/<app.bundle.id>-d_4yE8at6_OCBu1Yazl1ww==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (unified_suspend_stop_world+349)
#06  pc 0x00000000002b0bcc  /data/app/~~A69RIDgDy0yqESTrZAqEOw==/<app.bundle.id>-d_4yE8at6_OCBu1Yazl1ww==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_client_stop_world+159)
#07  pc 0x00000000002c6ec8  /data/app/~~A69RIDgDy0yqESTrZAqEOw==/<app.bundle.id>-d_4yE8at6_OCBu1Yazl1ww==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_stop_world+3988)
#08  pc 0x00000000002c3e44  /data/app/~~A69RIDgDy0yqESTrZAqEOw==/<app.bundle.id>-d_4yE8at6_OCBu1Yazl1ww==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_perform_collection+2637)
#09  pc 0x00000000002c3d74  /data/app/~~A69RIDgDy0yqESTrZAqEOw==/<app.bundle.id>-d_4yE8at6_OCBu1Yazl1ww==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_ensure_free_space+2616)
#10  pc 0x00000000002c0618  /data/app/~~A69RIDgDy0yqESTrZAqEOw==/<app.bundle.id>-d_4yE8at6_OCBu1Yazl1ww==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_alloc_obj_nolock+279)
#11  pc 0x00000000002c0afc  /data/app/~~A69RIDgDy0yqESTrZAqEOw==/<app.bundle.id>-d_4yE8at6_OCBu1Yazl1ww==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_alloc_obj+454)
#12  pc 0x00000000002b2588  /data/app/~~A69RIDgDy0yqESTrZAqEOw==/<app.bundle.id>-d_4yE8at6_OCBu1Yazl1ww==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_gc_alloc_obj+690)
#13  pc 0x0000000000274ea0  /data/app/~~A69RIDgDy0yqESTrZAqEOw==/<app.bundle.id>-d_4yE8at6_OCBu1Yazl1ww==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_object_new_specific_checked+5494)
#14  pc 0x00000000002746e4  /data/app/~~A69RIDgDy0yqESTrZAqEOw==/<app.bundle.id>-d_4yE8at6_OCBu1Yazl1ww==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (extract_this_ptr+5341)
#15  pc 0x0000000000274b70  /data/app/~~A69RIDgDy0yqESTrZAqEOw==/<app.bundle.id>-d_4yE8at6_OCBu1Yazl1ww==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_runtime_try_invoke_byrefs+5202)
#16  pc 0x000000000022fa28  /data/app/~~A69RIDgDy0yqESTrZAqEOw==/<app.bundle.id>-d_4yE8at6_OCBu1Yazl1ww==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (ves_icall_InternalInvoke+3629)
#17  pc 0x0000000000238e40  /data/app/~~A69RIDgDy0yqESTrZAqEOw==/<app.bundle.id>-d_4yE8at6_OCBu1Yazl1ww==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (ves_icall_InternalInvoke_raw+370)
#18  pc 0x0000000000004494
#00  pc 0x000000000008921c  /apex/com.android.runtime/lib64/bionic/libc.so (syscall+28)
#01  pc 0x000000000008db00  /apex/com.android.runtime/lib64/bionic/libc.so (__futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+144)
#02  pc 0x000000000009c5a8  /apex/com.android.runtime/lib64/bionic/libc.so (sem_wait+104)
#03  pc 0x00000000001faecc  /data/app/~~4fJB5MlMs7MjKdsK0oDbeA==/<app.bundle.id>-ZBgF1F4scx3dFx_v9ZdwHA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_os_sem_timedwait+204)
#04  pc 0x00000000001fadf4  /data/app/~~4fJB5MlMs7MjKdsK0oDbeA==/<app.bundle.id>-ZBgF1F4scx3dFx_v9ZdwHA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_threads_wait_pending_operations+329)
#05  pc 0x00000000002b0e24  /data/app/~~4fJB5MlMs7MjKdsK0oDbeA==/<app.bundle.id>-ZBgF1F4scx3dFx_v9ZdwHA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (unified_suspend_stop_world+349)
#06  pc 0x00000000002b0bcc  /data/app/~~4fJB5MlMs7MjKdsK0oDbeA==/<app.bundle.id>-ZBgF1F4scx3dFx_v9ZdwHA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_client_stop_world+159)
#07  pc 0x00000000002c6ec8  /data/app/~~4fJB5MlMs7MjKdsK0oDbeA==/<app.bundle.id>-ZBgF1F4scx3dFx_v9ZdwHA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_stop_world+3988)
#08  pc 0x00000000002c3e44  /data/app/~~4fJB5MlMs7MjKdsK0oDbeA==/<app.bundle.id>-ZBgF1F4scx3dFx_v9ZdwHA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_perform_collection+2637)
#09  pc 0x00000000002c3d74  /data/app/~~4fJB5MlMs7MjKdsK0oDbeA==/<app.bundle.id>-ZBgF1F4scx3dFx_v9ZdwHA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_ensure_free_space+2616)
#10  pc 0x00000000002c0618  /data/app/~~4fJB5MlMs7MjKdsK0oDbeA==/<app.bundle.id>-ZBgF1F4scx3dFx_v9ZdwHA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (sgen_alloc_obj_nolock+279)
#11  pc 0x00000000002b3168  /data/app/~~4fJB5MlMs7MjKdsK0oDbeA==/<app.bundle.id>-ZBgF1F4scx3dFx_v9ZdwHA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_gc_alloc_string+1208)
#12  pc 0x000000000026da44  /data/app/~~4fJB5MlMs7MjKdsK0oDbeA==/<app.bundle.id>-ZBgF1F4scx3dFx_v9ZdwHA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_string_new_size_checked+6308)
#13  pc 0x000000000027690c  /data/app/~~4fJB5MlMs7MjKdsK0oDbeA==/<app.bundle.id>-ZBgF1F4scx3dFx_v9ZdwHA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (mono_string_new_size_handle+6282)
#14  pc 0x000000000023d97c  /data/app/~~4fJB5MlMs7MjKdsK0oDbeA==/<app.bundle.id>-ZBgF1F4scx3dFx_v9ZdwHA==/split_config.arm64_v8a.apk!libmonosgen-2.0.so (ves_icall_System_String_FastAllocateString_raw+550)
#15  pc 0x0000000000008024 

@BrzVlad
Copy link
Member

BrzVlad commented Oct 22, 2024

I don't know how the ANR system works, I would imagine they happen if the app is non responsive for a while. If this is the case, I'm not sure whether this issue can actually be considered to be a bug. We seem to have a problem with the concurrent GC in the runtime and this issue needs to be investigated somehow. Disabling concurrent GC will just lead to longer GC pause times which seem to trigger this ANR. Is a GC pause time of 0.5 - 1s supposed to trigger this ANR ? If this is true, it seems to me that the best way to move forward would be to share the application so we can see if the serial GC takes longer than expected for whatever reason and mainly to try some debug flags to see if we detect some problems with the concurrent collector.

@TimBurik
Copy link
Author

@BrzVlad according to the Android documentation, one of the reasons for ANR is when BroadcastReceiver has failed to process an event within 5 seconds - this seems to be our case, as there's an additional description in the GooglePlay Console like "Broadcast of Intent { act=android.intent.action.SCREEN_ON }"

By the way, let me share a full report from the GooglePlay Console, which contains stack traces from all the threads: anr_all_stacktraces.txt
Maybe it contains some hints to what the main thread may be waiting for.

I could also try to gather additional information regarding the GC performance, but in general we didn't notice any significant performance degradation after disabling concurrent GC

@grendello
Copy link
Contributor

@TimBurik is the BroadcastReceiver in your case running in an isolated process?

@BrzVlad if it's a problem with BroadcastReceiver, then perhaps it's "simply" a matter of a Java thread incorrectly attached to MonoVM?

@TimBurik
Copy link
Author

TimBurik commented Oct 22, 2024

@grendello in our case most of the ANRs are related to the SystemEventsBroadcastReceiver from Sentry, which I believe is run in the same process by default
https://github.com/getsentry/sentry-java/blob/4988d5bf9e0a5be708853cf53d07b8503e4c77be/sentry-android-core/src/main/java/io/sentry/android/core/SystemEventsBreadcrumbsIntegration.java#L124-L142

But this is not the only source for ANRs for us, as we also have reports:

All of those reports have very similar main thread stack traces, related to GC.

It's even seems that those broadcast events, service calls and input events are not really causing the issue, and the main thread is already blocked by the time these events occur.

@TimBurik
Copy link
Author

@grendello @BrzVlad is there anything else we could do to help you shed the light on this issue?
Unfortunately, I don't believe we would be able to share the app, and, as I've mentioned before, we don't really have a reliable way to reproduce the issue yet.. But we sure could check other theories or gather additional information, if you have any ideas.

@BrzVlad
Copy link
Member

BrzVlad commented Oct 23, 2024

Seing that this issue happens after disabling concurrent GC I was biased into assuming that the GCs simply take longer to run and this is causing issues. After a second look at the stacktraces, it indeed looks like the suspend machinery is hanging. It makes no sense to me why this issue would only happen with serial GC though. We have an env var MONO_SLEEP_ABORT_LIMIT which makes it such that the application will abort if the suspend takes longer than the provided limit in ms, while also printing each thread state to the console. I think you should set this var to something that wouldn't trigger ANRs, like 1000ms maybe (suspending threads shouldn't take more than a few milliseconds). If we get explicit crashes in the runtime then this should confirm that the issue is in our runtime and maybe also make it easier to reproduce if we set a more aggressive suspend limit locally.

@TimBurik
Copy link
Author

TimBurik commented Nov 7, 2024

@BrzVlad could you also give a hint to where the threads dump could be found?
We did manage to find a pretty consistent way to reproduce the "stop the world hang" (although with an extreme abort limit of 40ms). Hang is always happening after a few minutes when application is backgrounded, but never happens while the app is in use.

In logcat we see the message:
E suspend_thread suspend took 40 ms, which is more than the allowed 40 ms
but the threads dump, which should be printed before, is nowhere to be found.

We are using the following values as Android environment variables:

MONO_SLEEP_ABORT_LIMIT=40
MONO_LOG_LEVEL=info
MONO_LOG_MASK=gc

and also we are using the adb shell setprop debug.mono.log gc command.

@BrzVlad
Copy link
Member

BrzVlad commented Nov 7, 2024

I would have expected that the suspend state of each thread to be printed but it looks like that logging path doesn't go through the android log but rather to the console. It would be useful for us to try to reproduce this failure but given the abort limit is not excessive and it only happens while in background, it makes me wonder whether it just the os scheduling the app less frequently and the suspend process just taking a bit longer than usual, because threads reach GC safepoint location slower.

@TimBurik
Copy link
Author

TimBurik commented Nov 7, 2024

This issue doesn't seem to affect every GC in the background, in the log cat we also see a couple of GC-related logs:

2024-11-07 16:08:11.722 19003-19011 <app.id>  I  Background concurrent copying GC freed 5559KB AllocSpace bytes, 57(1168KB) LOS objects, 49% free, 11MB/23MB, paused 3.069ms,1.275ms total 242.239ms
2024-11-07 16:08:55.162 19003-19003 <app.id>  I  Explicit concurrent copying GC freed 12MB AllocSpace bytes, 20(400KB) LOS objects, 49% free, 8040KB/15MB, paused 442us,230us total 150.742ms
.......
2024-11-07 16:09:31.504 19003-19011 <app.id>  I  Background young concurrent copying GC freed 6048KB AllocSpace bytes, 0(0B) LOS objects, 38% free, 9960KB/15MB, paused 577us,289us total 106.764ms
2024-11-07 16:09:58.109 19003-19011 <app.id>  I  Background concurrent copying GC freed 4932KB AllocSpace bytes, 0(0B) LOS objects, 49% free, 10MB/21MB, paused 196us,144us total 100.312ms
.......
2024-11-07 16:10:18.513 19003-19003 <app.id>  E  suspend_thread suspend took 40 ms, which is more than the allowed 40 ms

I'm also gonna try to reproduce this issue with bigger values for MONO_SLEEP_ABORT_LIMIT, just to make sure that this is a proper "hanging", not just performance degradation spike of the background process. But it would be sure useful to collect as much information as possible when this reproduction happens. Do you know if there is any way to extract those threads dump messages from the Android device?

@TimBurik
Copy link
Author

TimBurik commented Nov 8, 2024

@BrzVlad Just managed to reproduce the issue with limit value of 1s:

E suspend_thread suspend took 1000 ms, which is more than the allowed 1000 ms

The reproduction steps are the same - put application to the background and wait, although this time it required also several tries (foregrounding/bacgrounding app back) and considerable amount of waiting (30m+ after each backgrounding)

@BrzVlad
Copy link
Member

BrzVlad commented Nov 12, 2024

Would it be possible to provide the sample app with some reproduction steps ?

@TimBurik
Copy link
Author

Let me try to reproduce the issue on the sample app, and also discuss with the legal team possibility to share our production code.

Meanwhile, I have found a similar issue with a deadlock in GC stopping the world mono/mono#21582, which ended up being fixed by changing the value of the MONO_THREADS_SUSPEND. @BrzVlad do you know what is the default thread suspension mode for .net-android? Could this mode be related to this issue?

@BrzVlad
Copy link
Member

BrzVlad commented Nov 12, 2024

That is a good observation. The default suspend mode should by hybrid. It definitely looks like it could be related, you should try using the preemptive mode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Mono Runtime Mono-related issues: BCL bugs, AOT issues, etc.
Projects
None yet
Development

No branches or pull requests

6 participants