-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix G1 GC default IHOP #46169
Fix G1 GC default IHOP #46169
Conversation
G1 GC were setup to use an `InitiatingHeapOccupancyPercent` of 75. This could leave used memory at a very high level for an extended duration, triggering the real memory circuit breaker even at low activity levels. The value is a threshold for old generation usage relative to total heap size and thus it should leave room for the new generation. Default in G1 is to allow up to 60 percent and this could mean that the threshold was effectively at 135% heap usage. The JVM has adaptive setting of the IHOP, but this does not kick in until it has sampled a few collections. A newly started, relatively quiet server with primarily new generation activity could thus experience heap above 95% frequently for a duration. The changes here are two-fold: 1. Use 30% default for IHOP (the JVM default of 45 could still mean 105% heap usage threshold and did not fully ensure not to hit the circuit breaker with low activity) 2. Set G1ReservePercent=25. This is used by the adaptive IHOP mechanism, meaning old/mixed GC should kick in no later than at 75% heap. This ensures IHOP stays compatible with the real memory circuit breaker also after being adjusted by adaptive IHOP.
Pinging @elastic/es-core-infra |
Following esrally command results in circuit breaking more than half the time:
Using default G1 settings (which has IHOP 45) still results in circuit breaker failures around 1/3 of the runs. Using the changed JVM settings from this PR does not result in any circuit breaking. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
To validate that performance does not degrade with this change, I ran the geonames append-no-conflicts track with original G1 settings (IHOP 75) and new settings from this PR (on my local machine). This was run twice. Following are pair-wise comparisons of the two runs (baseline is the original, contender is the new settings). The total run time of the test runs were: baseline 1: 4894 s The java version used were:
and
|
@dliappis here are the results of comparing G1 using original (baseline) to new settings (contender) when using the PMC rally track:
I ran the test with both settings twice and did pair-wise comparison. Total runtimes were: baseline 1: 891s
|
@henningandersen Thank you for running these. Judging by the names of the operations listed in the results in your last comment, you were using the pmc track as discussed earlier, right? From the results it looks to me there's very small difference with the IHOP change in the median indexing throughput (~ |
@dliappis yes, sorry for not including it, this was run against PMC with a 40 percent Regarding median indexing throughput, it looks like the four numbers are: baseline 1: 661.413 the numbers seem to be equal enough to be considered equal. The cumulative indexing times also add op to near identical numbers. |
With this commit we align the G1-related settings to the new ES defaults that will be introduced with elastic/elasticsearch#46169
With this commit we align the G1-related settings to the new ES defaults that will be introduced with elastic/elasticsearch#46169. Relates #30 Relates elastic/elasticsearch#46169
With this commit we align the G1-related settings to the new ES defaults that will be introduced with elastic/elasticsearch#46169. Relates #30 Relates elastic/elasticsearch#46169
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary
We see a slight decrease in throughput metrics for a 4GB heap. Overall I'd argue that this change improves the situation especially w.r.t. effectiveness of the real-memory circuit breaker. Thanks for the change. LGTM
Detailed results
I've benchmarked several combinations on our nightly benchmarking environment. Baseline is always G1GC with the previous settings, the contender is G1GC with the settings from this PR. All benchmarks were run on OpenJDK 12 (build 12+32).
nyc_taxis
heap size: 4GB
We see a decrease in median indexing throughput from 79195 docs/s to 77990 docs/s (1.5% decrease).
I also noticed that the range
query returns in both configurations with a varying but non-zero error rate (less than 100%) and distance_amount_agg
always leads to 100% error rate in both configurations (with CMS this works just fine). Digging further, this is due to the parent circuit breaker tripping.
On the positive side autohisto_agg
and date_histogram_agg
finish now without any errors and performance is also on par with CMS (the baseline configuration had a 100% error rate).
In the GC logs we see an increase in the number of GC pauses by 55% (from 3183 pauses up to 4946). Total time in GC went up from 37:43 minutes to 43:28 minutes. The maximum pause time went down from 133ms to 90ms.
heap size: 16GB
Median indexing throughput is roughly identical (83576 docs/s for baseline vs. 83841 docs/s with the new configuration). Overall there were no errors returned in any of the configurations. The 99th percentile latency for range
drops from 754 ms to 576 ms. Also the 99th percentile latency for autohisto_agg
drops from 550ms to 497 ms. The other query latencies are roughly identical.
In the GC logs we see an increase in the number of GC pauses by 27% (from 721 pauses up to 995). Total time in GC went down from 37:59 minutes to 35:56 minutes. The maximum pause time went up from 100ms to 174ms.
pmc
heap size: 4GB
We see a decrease in median indexing throughput from 1272 docs/s to 1246 docs/s (2% decrease). Query latencies are roughly identical but the 99th percentile latency for scroll
went up from 1070ms to 1243ms (roughly a 14% decrease).
In the GC logs we see an increase in the number of pauses by 49% (from 560 pauses up to 834). Total time in GC was roughly constant (16:22 minutes before and 16:32 minutes after). The maximum pause time went down from 43ms to 36ms.
Thanks for reviewing @danielmitterdorfer, @jasontedor and @dliappis. |
G1 GC were setup to use an `InitiatingHeapOccupancyPercent` of 75. This could leave used memory at a very high level for an extended duration, triggering the real memory circuit breaker even at low activity levels. The value is a threshold for old generation usage relative to total heap size and thus it should leave room for the new generation. Default in G1 is to allow up to 60 percent for new generation and this could mean that the threshold was effectively at 135% heap usage. GC would still kick in of course and eventually enough mixed collections would take place such that adaptive adjustment of IHOP kicks in. The JVM has adaptive setting of the IHOP, but this does not kick in until it has sampled a few collections. A newly started, relatively quiet server with primarily new generation activity could thus experience heap above 95% frequently for a duration. The changes here are two-fold: 1. Use 30% default for IHOP (the JVM default of 45 could still mean 105% heap usage threshold and did not fully ensure not to hit the circuit breaker with low activity) 2. Set G1ReservePercent=25. This is used by the adaptive IHOP mechanism, meaning old/mixed GC should kick in no later than at 75% heap. This ensures IHOP stays compatible with the real memory circuit breaker also after being adjusted by adaptive IHOP.
G1 GC were setup to use an `InitiatingHeapOccupancyPercent` of 75. This could leave used memory at a very high level for an extended duration, triggering the real memory circuit breaker even at low activity levels. The value is a threshold for old generation usage relative to total heap size and thus it should leave room for the new generation. Default in G1 is to allow up to 60 percent for new generation and this could mean that the threshold was effectively at 135% heap usage. GC would still kick in of course and eventually enough mixed collections would take place such that adaptive adjustment of IHOP kicks in. The JVM has adaptive setting of the IHOP, but this does not kick in until it has sampled a few collections. A newly started, relatively quiet server with primarily new generation activity could thus experience heap above 95% frequently for a duration. The changes here are two-fold: 1. Use 30% default for IHOP (the JVM default of 45 could still mean 105% heap usage threshold and did not fully ensure not to hit the circuit breaker with low activity) 2. Set G1ReservePercent=25. This is used by the adaptive IHOP mechanism, meaning old/mixed GC should kick in no later than at 75% heap. This ensures IHOP stays compatible with the real memory circuit breaker also after being adjusted by adaptive IHOP.
Does it make sense to have this setting too?
basically disable the AdaptiveIHOP feature. Otherwise what's the point of setting IHOP in the first place? I'm debugging a CircuitBreaker issue with my cluster We do have the settings mentioned by you'll in our cluster & things were fine for quite some time. However the CB tripping is back. We suspect a couple of things and it is totally possible that we are over-allocating & need to tweak something else (we've observed Humongous allocations happening in our logs) |
@fhalde I think not as a general recommendation, but for very specific scenarios it might make sense after careful benchmarking and tuning. The original IHOP is just that the outset for the value until the adaptive algorithm of G1 kicks in. I should be interested in seeing more details on this like GC logs and information on the workload. I would like to redirect you to the discuss forums for this investigation, we prefer to use github only for confirmed bugs and enhancements. Feel free to mention me on the discuss posting (HenningAndersen). |
Sure @henningandersen |
G1 GC were setup to use an
InitiatingHeapOccupancyPercent
of 75. Thiscould leave used memory at a very high level for an extended duration,
triggering the real memory circuit breaker even at low activity levels.
The value is a threshold for old generation usage relative to total heap
size and thus it should leave room for the new generation. Default in
G1 is to allow up to 60 percent for new generation and this could mean that the
threshold was effectively at 135% heap usage. GC would still kick in of course and
eventually enough mixed collections would take place such that adaptive adjustment
of IHOP kicks in.
The JVM has adaptive setting of the IHOP, but this does not kick in
until it has sampled a few collections. A newly started, relatively
quiet server with primarily new generation activity could thus
experience heap above 95% frequently for a duration.
The changes here are two-fold:
105% heap usage threshold and did not fully ensure not to hit the
circuit breaker with low activity)
meaning old/mixed GC should kick in no later than at 75% heap. This
ensures IHOP stays compatible with the real memory circuit breaker also
after being adjusted by adaptive IHOP.