Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add safeguards to prevent file cache over-subscription #7713

Conversation

kotwanikunal
Copy link
Member

Description

  • Adds safeguards to prevent file cache over-subscription
  • It fetches the cached filecache stats from all the nodes to calculate total filesize shard across nodes, takes into account the shard sizes of the indexes currently restored as searchable snapshots, and the total size of the shards that will be restored with the current request
  • sum(shards to be restored) + sum(shards restored as searchable snapshot indices) < 5 * (total cache size)

Related Issues

Resolves #7033

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.remotestore.RemoteStoreRefreshListenerIT.testRemoteRefreshRetryOnFailure

@codecov
Copy link

codecov bot commented May 24, 2023

Codecov Report

Merging #7713 (dee5300) into main (8e2d059) will increase coverage by 0.01%.
The diff coverage is 69.38%.

@@             Coverage Diff              @@
##               main    #7713      +/-   ##
============================================
+ Coverage     70.67%   70.69%   +0.01%     
- Complexity    56095    56143      +48     
============================================
  Files          4680     4680              
  Lines        266079   266122      +43     
  Branches      39074    39084      +10     
============================================
+ Hits         188062   188137      +75     
+ Misses        62029    62022       -7     
+ Partials      15988    15963      -25     
Impacted Files Coverage Δ
...ster/snapshots/restore/RestoreSnapshotRequest.java 71.74% <ø> (-1.35%) ⬇️
server/src/main/java/org/opensearch/node/Node.java 86.13% <ø> (-0.02%) ⬇️
.../main/java/org/opensearch/cluster/ClusterInfo.java 59.42% <62.50%> (-0.13%) ⬇️
.../java/org/opensearch/snapshots/RestoreService.java 57.47% <64.70%> (+2.78%) ⬆️
...opensearch/cluster/InternalClusterInfoService.java 77.68% <100.00%> (+2.01%) ⬆️

... and 465 files with indirect coverage changes

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.search.backpressure.SearchBackpressureIT.testSearchShardTaskCancellationWithHighCpu

@@ -63,9 +65,10 @@ public class ClusterInfo implements ToXContentFragment, Writeable {
public static final ClusterInfo EMPTY = new ClusterInfo();
final Map<ShardRouting, String> routingToDataPath;
final Map<NodeAndPath, ReservedSpace> reservedSpace;
final Map<String, FileCacheStats> nodeFileCacheStats;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, the cache size (across all nodes) could be pretty large but we don't need the whole FileCacheStats, we basically just need a single long total out of it, what if we introduce much smaller FileCacheUsage instead:

public class FileCacheUsage {
    final long total;
}

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @reta! I missed this comment. That's fair. I was concerned of the unnecessary data transfer between nodes.
Let me try to cook up something more optimized.

@@ -154,6 +160,7 @@ public class RestoreService implements ClusterStateApplier {

// It's OK to change some settings, but we shouldn't allow simply removing them
private static final Set<String> UNREMOVABLE_SETTINGS;
private static final int REMOTE_DATA_TO_FILE_CACHE_SIZE_RATIO = 5;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be a cluster setting, with the default to "no limit". Otherwise it is a backwards incompatible behavior change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concern that I had with that was it could lead to a messy situation with random limits set, leading to performance impact. But given the BWC nature and it's a power user knob, it makes sense.

Comment on lines +846 to +850
long totalRestoredRemoteIndexesSize = 0;
for (IndexService indexService : indicesService) {
if (indexService.getIndexSettings().isRemoteSnapshot()) {
for (IndexShard indexShard : indexService) {
if (indexShard.routingEntry().primary()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move this to RoutingTable to ensure this is more re-usable for fetching all shards with a particular settings like remote snapshot, segrep, remote store etc?

Copy link
Collaborator

@Bukhtawar Bukhtawar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something that should be placed in an allocation decider more like DiskThresholdAllocationDecider that factors in local node stats on which restore is to start. Not sure if one already exists

@kotwanikunal
Copy link
Member Author

Is this something that should be placed in an allocation decider more like DiskThresholdAllocationDecider that factors in local node stats on which restore is to start. Not sure if one already exists

We do have some deciders specific for remote shards, and surely can add another one for this check. The intention of this issue and the PR is to prevent restores all together if it's breaching the threshold.

With only the decider, we will have the scenario where the shard is unassigned, and would require an additional step for user to call explain and figure out what went wrong, but will help get rid of the transport calls across nodes.

@andrross @reta thoughts?

@Bukhtawar
Copy link
Collaborator

Is this something that should be placed in an allocation decider more like DiskThresholdAllocationDecider that factors in local node stats on which restore is to start. Not sure if one already exists

We do have some deciders specific for remote shards, and surely can add another one for this check. The intention of this issue and the PR is to prevent restores all together if it's breaching the threshold.

With only the decider, we will have the scenario where the shard is unassigned, and would require an additional step for user to call explain and figure out what went wrong, but will help get rid of the transport calls across nodes.

@andrross @reta thoughts?

Few questions

  1. Is it possible to have concurrent restore requests across indices? In which case will in-flight restores by-pass these checks
  2. Is it possible for some nodes to be over-subscribed than the others. In which it might still be possible for shards to be unassigned

I am all in for pre-emptive blocking but just worried if that alone should be good enough to prevent over-subscription and unassigned shards

@kotwanikunal
Copy link
Member Author

Is this something that should be placed in an allocation decider more like DiskThresholdAllocationDecider that factors in local node stats on which restore is to start. Not sure if one already exists

We do have some deciders specific for remote shards, and surely can add another one for this check. The intention of this issue and the PR is to prevent restores all together if it's breaching the threshold.
With only the decider, we will have the scenario where the shard is unassigned, and would require an additional step for user to call explain and figure out what went wrong, but will help get rid of the transport calls across nodes.
@andrross @reta thoughts?

Few questions

  1. Is it possible to have concurrent restore requests across indices? In which case will in-flight restores by-pass these checks

Thats a good point. It is possible to trigger concurrent restores without retrieving the shard info of the parallel, accepted restore since shards might not be retrieved.

  1. Is it possible for some nodes to be over-subscribed than the others. In which it might still be possible for shards to be unassigned

The allocation logic does a simple balanced shard count based allocation, and in theory, it is possible for some nodes to have a few more shards/hot shards than others.

I am all in for pre-emptive blocking but just worried if that alone should be good enough to prevent over-subscription and unassigned shards

@reta
Copy link
Collaborator

reta commented Jun 12, 2023

The allocation logic does a simple balanced shard count based allocation, and in theory, it is possible for some nodes to have a few more shards/hot shards than others.

Please correct me if I am wrong, but the DiskThresholdDecider we have now does not account for REMOTE_STORE recovery type. If that is the case, I tend to agree with @Bukhtawar that it would make sense to factor that. May be my thinking is not right, but remote snapshots are on demand anyway however the amount of the reserved disk space required would be bound by the min(file cache, sum of all remote snapshot shards). We could probably factor the over-subscription at 50% max, not all remote shards might be requested, but keeping shards unassigned would clearly indicate that cluster is heavily over provisioned.

@kotwanikunal
Copy link
Member Author

Thanks for the feedback @reta and @Bukhtawar.
I plan to update the solution with the following approach -

  • Add the shard(s) restore size to the cluster state using the RestoreInProgress entry and read at restore to take care of parallel restore requests.
  • Add in logic within the allocator/decider (DiskThreshold or TBD) to bake in an additional barrier with the shard allocation process to ensure the cluster cannot be oversubscribed.

@andrross
Copy link
Member

I plan to update the solution with the following approach -

  • Add the shard(s) restore size to the cluster state using the RestoreInProgress entry and read at restore to take care of parallel restore requests.

  • Add in logic within the allocator/decider (DiskThreshold or TBD) to bake in an additional barrier with the shard allocation process to ensure the cluster cannot be oversubscribed.

@kotwanikunal I think the existing behavior (for normal indexes) for parallel restores is that there is no protection against filling up the disk and eventually you'll end up with yellow or red indexes if there isn't enough disk space. Is that right? Assuming so, I'd hold off on making changes to the RestoreInProgress entry. That does sound like a good thing to do, but we should probably do it as a separate PR/issue since it should be implemented as more general protection for any time of snapshot restore. I would recommend starting with the second point (Add in logic within the allocator/decider...) for this task.

@kotwanikunal
Copy link
Member Author

The scope of this PR has become too large. Breaking it down into smaller chunks for easier reviews.
The first half is at #8208 which takes care of the following -

  • Introduce filecache stats to the clusterinfo response
  • Add logic for individual decider to check for node level threshold

I will quickly follow up with another PR once the above gets merged. That will have -

  • RestoreService related changes from this PR
  • Configuring the ratio value as a setting instead of a constant
  • Additional integ tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Searchable Remote Index] Add safeguards to ensure a cluster cannot be over-subscribed
4 participants