Add safeguards to prevent file cache over-subscription #7713

kotwanikunal · 2023-05-23T23:40:47Z

Description

Adds safeguards to prevent file cache over-subscription
It fetches the cached filecache stats from all the nodes to calculate total filesize shard across nodes, takes into account the shard sizes of the indexes currently restored as searchable snapshots, and the total size of the shards that will be restored with the current request
sum(shards to be restored) + sum(shards restored as searchable snapshot indices) < 5 * (total cache size)

Related Issues

Resolves #7033

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed per the DCO using --signoff
Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

github-actions · 2023-05-24T00:03:23Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/16131/
CommitID: 2002a00
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

github-actions · 2023-05-24T00:11:12Z

Gradle Check (Jenkins) Run Completed with:

RESULT: UNSTABLE ❕
TEST FAILURES:

      1 org.opensearch.remotestore.RemoteStoreRefreshListenerIT.testRemoteRefreshRetryOnFailure

URL: https://build.ci.opensearch.org/job/gradle-check/16130/
CommitID: 986528e
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

codecov · 2023-05-24T00:13:22Z

Codecov Report

Merging #7713 (dee5300) into main (8e2d059) will increase coverage by 0.01%.
The diff coverage is 69.38%.

@@             Coverage Diff              @@
##               main    #7713      +/-   ##
============================================
+ Coverage     70.67%   70.69%   +0.01%     
- Complexity    56095    56143      +48     
============================================
  Files          4680     4680              
  Lines        266079   266122      +43     
  Branches      39074    39084      +10     
============================================
+ Hits         188062   188137      +75     
+ Misses        62029    62022       -7     
+ Partials      15988    15963      -25

Impacted Files	Coverage Δ
...ster/snapshots/restore/RestoreSnapshotRequest.java	`71.74% <ø> (-1.35%)`	⬇️
server/src/main/java/org/opensearch/node/Node.java	`86.13% <ø> (-0.02%)`	⬇️
.../main/java/org/opensearch/cluster/ClusterInfo.java	`59.42% <62.50%> (-0.13%)`	⬇️
.../java/org/opensearch/snapshots/RestoreService.java	`57.47% <64.70%> (+2.78%)`	⬆️
...opensearch/cluster/InternalClusterInfoService.java	`77.68% <100.00%> (+2.01%)`	⬆️

... and 465 files with indirect coverage changes

github-actions · 2023-05-24T15:13:28Z

Gradle Check (Jenkins) Run Completed with:

RESULT: SUCCESS ✅
URL: https://build.ci.opensearch.org/job/gradle-check/16196/
CommitID: 2002a00

Signed-off-by: Kunal Kotwani <[email protected]>

github-actions · 2023-05-24T22:03:48Z

Gradle Check (Jenkins) Run Completed with:

RESULT: UNSTABLE ❕
TEST FAILURES:

      1 org.opensearch.search.backpressure.SearchBackpressureIT.testSearchShardTaskCancellationWithHighCpu

URL: https://build.ci.opensearch.org/job/gradle-check/16247/
CommitID: dee5300
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

reta · 2023-05-26T18:31:23Z

server/src/main/java/org/opensearch/cluster/ClusterInfo.java

@@ -63,9 +65,10 @@ public class ClusterInfo implements ToXContentFragment, Writeable {
    public static final ClusterInfo EMPTY = new ClusterInfo();
    final Map<ShardRouting, String> routingToDataPath;
    final Map<NodeAndPath, ReservedSpace> reservedSpace;
+    final Map<String, FileCacheStats> nodeFileCacheStats;


In theory, the cache size (across all nodes) could be pretty large but we don't need the whole FileCacheStats, we basically just need a single long total out of it, what if we introduce much smaller FileCacheUsage instead:

public class FileCacheUsage { final long total; }

?

Sorry @reta! I missed this comment. That's fair. I was concerned of the unnecessary data transfer between nodes.
Let me try to cook up something more optimized.

andrross · 2023-06-08T21:34:40Z

server/src/main/java/org/opensearch/snapshots/RestoreService.java

@@ -154,6 +160,7 @@ public class RestoreService implements ClusterStateApplier {

    // It's OK to change some settings, but we shouldn't allow simply removing them
    private static final Set<String> UNREMOVABLE_SETTINGS;
+    private static final int REMOTE_DATA_TO_FILE_CACHE_SIZE_RATIO = 5;


I think this needs to be a cluster setting, with the default to "no limit". Otherwise it is a backwards incompatible behavior change.

The concern that I had with that was it could lead to a messy situation with random limits set, leading to performance impact. But given the BWC nature and it's a power user knob, it makes sense.

Bukhtawar · 2023-06-09T17:45:05Z

server/src/main/java/org/opensearch/snapshots/RestoreService.java

+                        long totalRestoredRemoteIndexesSize = 0;
+                        for (IndexService indexService : indicesService) {
+                            if (indexService.getIndexSettings().isRemoteSnapshot()) {
+                                for (IndexShard indexShard : indexService) {
+                                    if (indexShard.routingEntry().primary()) {


Should we move this to RoutingTable to ensure this is more re-usable for fetching all shards with a particular settings like remote snapshot, segrep, remote store etc?

Bukhtawar

Is this something that should be placed in an allocation decider more like DiskThresholdAllocationDecider that factors in local node stats on which restore is to start. Not sure if one already exists

kotwanikunal · 2023-06-09T19:47:50Z

Is this something that should be placed in an allocation decider more like DiskThresholdAllocationDecider that factors in local node stats on which restore is to start. Not sure if one already exists

We do have some deciders specific for remote shards, and surely can add another one for this check. The intention of this issue and the PR is to prevent restores all together if it's breaching the threshold.

With only the decider, we will have the scenario where the shard is unassigned, and would require an additional step for user to call explain and figure out what went wrong, but will help get rid of the transport calls across nodes.

@andrross @reta thoughts?

Bukhtawar · 2023-06-09T20:03:12Z

Is this something that should be placed in an allocation decider more like DiskThresholdAllocationDecider that factors in local node stats on which restore is to start. Not sure if one already exists

We do have some deciders specific for remote shards, and surely can add another one for this check. The intention of this issue and the PR is to prevent restores all together if it's breaching the threshold.

With only the decider, we will have the scenario where the shard is unassigned, and would require an additional step for user to call explain and figure out what went wrong, but will help get rid of the transport calls across nodes.

@andrross @reta thoughts?

Few questions

Is it possible to have concurrent restore requests across indices? In which case will in-flight restores by-pass these checks
Is it possible for some nodes to be over-subscribed than the others. In which it might still be possible for shards to be unassigned

I am all in for pre-emptive blocking but just worried if that alone should be good enough to prevent over-subscription and unassigned shards

kotwanikunal · 2023-06-09T22:11:47Z

Is this something that should be placed in an allocation decider more like DiskThresholdAllocationDecider that factors in local node stats on which restore is to start. Not sure if one already exists

We do have some deciders specific for remote shards, and surely can add another one for this check. The intention of this issue and the PR is to prevent restores all together if it's breaching the threshold.
With only the decider, we will have the scenario where the shard is unassigned, and would require an additional step for user to call explain and figure out what went wrong, but will help get rid of the transport calls across nodes.
@andrross @reta thoughts?

Few questions

Is it possible to have concurrent restore requests across indices? In which case will in-flight restores by-pass these checks

Thats a good point. It is possible to trigger concurrent restores without retrieving the shard info of the parallel, accepted restore since shards might not be retrieved.

Is it possible for some nodes to be over-subscribed than the others. In which it might still be possible for shards to be unassigned

The allocation logic does a simple balanced shard count based allocation, and in theory, it is possible for some nodes to have a few more shards/hot shards than others.

I am all in for pre-emptive blocking but just worried if that alone should be good enough to prevent over-subscription and unassigned shards

reta · 2023-06-12T16:05:38Z

The allocation logic does a simple balanced shard count based allocation, and in theory, it is possible for some nodes to have a few more shards/hot shards than others.

Please correct me if I am wrong, but the DiskThresholdDecider we have now does not account for REMOTE_STORE recovery type. If that is the case, I tend to agree with @Bukhtawar that it would make sense to factor that. May be my thinking is not right, but remote snapshots are on demand anyway however the amount of the reserved disk space required would be bound by the min(file cache, sum of all remote snapshot shards). We could probably factor the over-subscription at 50% max, not all remote shards might be requested, but keeping shards unassigned would clearly indicate that cluster is heavily over provisioned.

kotwanikunal · 2023-06-13T17:00:56Z

Thanks for the feedback @reta and @Bukhtawar.
I plan to update the solution with the following approach -

Add the shard(s) restore size to the cluster state using the RestoreInProgress entry and read at restore to take care of parallel restore requests.
Add in logic within the allocator/decider (DiskThreshold or TBD) to bake in an additional barrier with the shard allocation process to ensure the cluster cannot be oversubscribed.

andrross · 2023-06-20T16:36:51Z

I plan to update the solution with the following approach -

Add the shard(s) restore size to the cluster state using the RestoreInProgress entry and read at restore to take care of parallel restore requests.

Add in logic within the allocator/decider (DiskThreshold or TBD) to bake in an additional barrier with the shard allocation process to ensure the cluster cannot be oversubscribed.

@kotwanikunal I think the existing behavior (for normal indexes) for parallel restores is that there is no protection against filling up the disk and eventually you'll end up with yellow or red indexes if there isn't enough disk space. Is that right? Assuming so, I'd hold off on making changes to the RestoreInProgress entry. That does sound like a good thing to do, but we should probably do it as a separate PR/issue since it should be implemented as more general protection for any time of snapshot restore. I would recommend starting with the second point (Add in logic within the allocator/decider...) for this task.

kotwanikunal · 2023-06-21T23:40:05Z

The scope of this PR has become too large. Breaking it down into smaller chunks for easier reviews.
The first half is at #8208 which takes care of the following -

Introduce filecache stats to the clusterinfo response
Add logic for individual decider to check for node level threshold

I will quickly follow up with another PR once the above gets merged. That will have -

RestoreService related changes from this PR
Configuring the ratio value as a setting instead of a constant
Additional integ tests

kotwanikunal requested review from reta, anasalkouz, andrross, Bukhtawar, CEHENKLE, dblock, gbbafna, setiah, kartg, mch2, nknize, owaiskazi19, Rishikesh1159, ryanbogan, saratvemulapalli, shwetathareja, dreamer-89, tlfeng, VachaShah and dbwiddis as code owners May 23, 2023 23:40

kotwanikunal force-pushed the cluster-capacity-snapshots branch from 986528e to 2002a00 Compare May 23, 2023 23:41

Add safeguards to prevent file cache over-subscription

dee5300

Signed-off-by: Kunal Kotwani <[email protected]>

kotwanikunal force-pushed the cluster-capacity-snapshots branch from 2002a00 to dee5300 Compare May 24, 2023 21:33

reta reviewed May 26, 2023

View reviewed changes

andrross reviewed Jun 8, 2023

View reviewed changes

Bukhtawar reviewed Jun 9, 2023

View reviewed changes

kotwanikunal mentioned this pull request Jun 21, 2023

Add safeguard limits for file cache during node level allocation #8208

Merged

6 tasks

kotwanikunal closed this Jun 21, 2023

kotwanikunal mentioned this pull request Jul 11, 2023

Add restore level safeguards to prevent file cache oversubscription #8606

Merged

6 tasks

kotwanikunal mentioned this pull request Jul 29, 2023

[Backport 2.x] Add safeguards limits for file cache #8989

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add safeguards to prevent file cache over-subscription #7713

Add safeguards to prevent file cache over-subscription #7713

kotwanikunal commented May 23, 2023

github-actions bot commented May 24, 2023

github-actions bot commented May 24, 2023

codecov bot commented May 24, 2023 •

edited

Loading

github-actions bot commented May 24, 2023

github-actions bot commented May 24, 2023

reta May 26, 2023

kotwanikunal Jun 9, 2023

andrross Jun 8, 2023

kotwanikunal Jun 9, 2023

Bukhtawar Jun 9, 2023

Bukhtawar left a comment

kotwanikunal commented Jun 9, 2023

Bukhtawar commented Jun 9, 2023

kotwanikunal commented Jun 9, 2023

reta commented Jun 12, 2023 •

edited

Loading

kotwanikunal commented Jun 13, 2023

andrross commented Jun 20, 2023

kotwanikunal commented Jun 21, 2023

Add safeguards to prevent file cache over-subscription #7713

Add safeguards to prevent file cache over-subscription #7713

Conversation

kotwanikunal commented May 23, 2023

Description

Related Issues

Check List

github-actions bot commented May 24, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented May 24, 2023

Gradle Check (Jenkins) Run Completed with:

codecov bot commented May 24, 2023 • edited Loading

Codecov Report

github-actions bot commented May 24, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented May 24, 2023

Gradle Check (Jenkins) Run Completed with:

reta May 26, 2023

Choose a reason for hiding this comment

kotwanikunal Jun 9, 2023

Choose a reason for hiding this comment

andrross Jun 8, 2023

Choose a reason for hiding this comment

kotwanikunal Jun 9, 2023

Choose a reason for hiding this comment

Bukhtawar Jun 9, 2023

Choose a reason for hiding this comment

Bukhtawar left a comment

Choose a reason for hiding this comment

kotwanikunal commented Jun 9, 2023

Bukhtawar commented Jun 9, 2023

kotwanikunal commented Jun 9, 2023

reta commented Jun 12, 2023 • edited Loading

kotwanikunal commented Jun 13, 2023

andrross commented Jun 20, 2023

kotwanikunal commented Jun 21, 2023

codecov bot commented May 24, 2023 •

edited

Loading

reta commented Jun 12, 2023 •

edited

Loading