Skip to content

Commit

Permalink
fixed problems with how sampling is done and how we suspend to change…
Browse files Browse the repository at this point in the history
… heap count in DATAS (#91712)

+ Moved the sample recording into when we are suspended. The way we were calculating the throughput cost was in check_heap_count (which is called right after we restart EE on heap0), we record the msl_wait_time (and reset it to 0 for soh/uoh). This is not synchronized with the allocating threads (which are already running at this point). So what can happen is the allocating threads are already accumulated more wait time which is attributed to this GC but it's not within the period we are counting for this GC (and we lose this part for the next GC). For BGC this is incorrect. If an ephemeral GC did happen before the BGC starts, we'd be adding a sample for that GC which is basically correct for that eph GC. But if an eph GC did not happen, we are just adding a random sample which is calculating the tcp as (msl wait + whatever GC that was finished before this BGC) so obviously incorrect.

+ Added gen2 sampling - this was adapted from Peter's gen2 sampling changes. This serves as a backstop in case the existing sampling doesn't ever pick gen2 GC costs. I made the following fixes -

1) changed the way we calculated the median

2) moved where this is calculated to again avoid timing issues

3) made the gen2 samples actually count instead of losing that info if we happen to sample when a gen2 didn't just occur.

+ Changed when check_heap_count is called - the previous place is right after a suspension which does not help with spacing the suspension time out (it was "suspend for GC" then "immediately suspend to change heap count"). And it caused a problem with BGC which was it always tried to change heap count when it couldn't because BGC was in progress. I changed this to be on a timeout to intentionally space the suspensions out. Now most of the time, heap count changes happen due to this time out. If we are really in a situation where GCs are happening too quickly and we return from waiting on the ee_suspend_event due to a GC started, we change the heap count right before we do a GC. So this also helps with the BGC problem.
  • Loading branch information
Maoni0 authored Sep 18, 2023
1 parent 0ad4e69 commit e1ca02f
Show file tree
Hide file tree
Showing 2 changed files with 434 additions and 262 deletions.
Loading

0 comments on commit e1ca02f

Please sign in to comment.