-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add safeguards to prevent file cache over-subscription #7713
Add safeguards to prevent file cache over-subscription #7713
Conversation
986528e
to
2002a00
Compare
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
Codecov Report
@@ Coverage Diff @@
## main #7713 +/- ##
============================================
+ Coverage 70.67% 70.69% +0.01%
- Complexity 56095 56143 +48
============================================
Files 4680 4680
Lines 266079 266122 +43
Branches 39074 39084 +10
============================================
+ Hits 188062 188137 +75
+ Misses 62029 62022 -7
+ Partials 15988 15963 -25
|
Gradle Check (Jenkins) Run Completed with:
|
Signed-off-by: Kunal Kotwani <[email protected]>
2002a00
to
dee5300
Compare
Gradle Check (Jenkins) Run Completed with:
|
@@ -63,9 +65,10 @@ public class ClusterInfo implements ToXContentFragment, Writeable { | |||
public static final ClusterInfo EMPTY = new ClusterInfo(); | |||
final Map<ShardRouting, String> routingToDataPath; | |||
final Map<NodeAndPath, ReservedSpace> reservedSpace; | |||
final Map<String, FileCacheStats> nodeFileCacheStats; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory, the cache size (across all nodes) could be pretty large but we don't need the whole FileCacheStats
, we basically just need a single long total
out of it, what if we introduce much smaller FileCacheUsage
instead:
public class FileCacheUsage {
final long total;
}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry @reta! I missed this comment. That's fair. I was concerned of the unnecessary data transfer between nodes.
Let me try to cook up something more optimized.
@@ -154,6 +160,7 @@ public class RestoreService implements ClusterStateApplier { | |||
|
|||
// It's OK to change some settings, but we shouldn't allow simply removing them | |||
private static final Set<String> UNREMOVABLE_SETTINGS; | |||
private static final int REMOTE_DATA_TO_FILE_CACHE_SIZE_RATIO = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to be a cluster setting, with the default to "no limit". Otherwise it is a backwards incompatible behavior change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The concern that I had with that was it could lead to a messy situation with random limits set, leading to performance impact. But given the BWC nature and it's a power user knob, it makes sense.
long totalRestoredRemoteIndexesSize = 0; | ||
for (IndexService indexService : indicesService) { | ||
if (indexService.getIndexSettings().isRemoteSnapshot()) { | ||
for (IndexShard indexShard : indexService) { | ||
if (indexShard.routingEntry().primary()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we move this to RoutingTable to ensure this is more re-usable for fetching all shards with a particular settings like remote snapshot, segrep, remote store etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something that should be placed in an allocation decider more like DiskThresholdAllocationDecider that factors in local node stats on which restore is to start. Not sure if one already exists
We do have some deciders specific for remote shards, and surely can add another one for this check. The intention of this issue and the PR is to prevent restores all together if it's breaching the threshold. With only the decider, we will have the scenario where the shard is unassigned, and would require an additional step for user to call explain and figure out what went wrong, but will help get rid of the transport calls across nodes. |
Few questions
I am all in for pre-emptive blocking but just worried if that alone should be good enough to prevent over-subscription and unassigned shards |
Thats a good point. It is possible to trigger concurrent restores without retrieving the shard info of the parallel, accepted restore since shards might not be retrieved.
The allocation logic does a simple balanced shard count based allocation, and in theory, it is possible for some nodes to have a few more shards/hot shards than others.
|
Please correct me if I am wrong, but the |
Thanks for the feedback @reta and @Bukhtawar.
|
@kotwanikunal I think the existing behavior (for normal indexes) for parallel restores is that there is no protection against filling up the disk and eventually you'll end up with yellow or red indexes if there isn't enough disk space. Is that right? Assuming so, I'd hold off on making changes to the |
The scope of this PR has become too large. Breaking it down into smaller chunks for easier reviews.
I will quickly follow up with another PR once the above gets merged. That will have -
|
Description
sum(shards to be restored) + sum(shards restored as searchable snapshot indices) < 5 * (total cache size)
Related Issues
Resolves #7033
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.