Sync up snapshot shard status on a master restart #11450

imotov · 2015-06-02T00:10:50Z

When a snapshot operation on a particular shard finishes, the data node where this shard resides sends an update shard status request to the master node to indicate that the operation on the shard is done. When the master node receives the command it queues cluster state update task and acknowledges the receipt of the command to the data node.

The update snapshot shard status tasks have relatively low priority, so during cluster instability they tend to get stuck at the end of the queue. If the master node gets restarted before processing these tasks the information about the shards can be lost and the new master assumes that they are still in process while the data node thinks that these shards are already done.

This commit add a retry mechanism that checks compares cluster state of a newly elected master and the current state of snapshot shards and updates the cluster state on the master again if needed.

Closes #11314

s1monw · 2015-06-03T10:08:45Z

src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

@@ -830,7 +836,17 @@ private void processIndexShardSnapshots(SnapshotMetaData snapshotMetaData) {
                        for (Map.Entry<ShardId, SnapshotMetaData.ShardSnapshotStatus> shard : entry.shards().entrySet()) {
                            IndexShardSnapshotStatus snapshotStatus = snapshotShards.shards.get(shard.getKey());
                            if (snapshotStatus != null) {
-                                snapshotStatus.abort();
+                                if (snapshotStatus.stage() == IndexShardSnapshotStatus.Stage.STARTED) {


can we use switch case statements for this it seems to be easier to read?

s1monw · 2015-06-03T10:13:06Z

left a minor comment LGTM otherwise

When a snapshot operation on a particular shard finishes, the data node where this shard resides sends an update shard status request to the master node to indicate that the operation on the shard is done. When the master node receives the command it queues cluster state update task and acknowledges the receipt of the command to the data node. The update snapshot shard status tasks have relatively low priority, so during cluster instability they tend to get stuck at the end of the queue. If the master node gets restarted before processing these tasks the information about the shards can be lost and the new master assumes that they are still in process while the data node thinks that these shards are already done. This commit add a retry mechanism that checks compares cluster state of a newly elected master and the current state of snapshot shards and updates the cluster state on the master again if needed. Closes elastic#11314

imotov added >bug v2.0.0-beta1 review :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v1.6.0 labels Jun 2, 2015

s1monw reviewed Jun 3, 2015
View reviewed changes

imotov force-pushed the issue-11314-update-snapshot-shards-on-master-change branch from 5286c18 to f0e6add Compare June 3, 2015 19:14

imotov merged commit f0e6add into elastic:master Jun 3, 2015

imotov removed the review label Jun 3, 2015

clintongormley changed the title ~~Snapshot/Restore: sync up snapshot shard status on a master restart~~ Sync up snapshot shard status on a master restart Jun 8, 2015

imotov mentioned this pull request Aug 20, 2015

Snapshot/Restore: snapshot during rolling restart of a 2 node cluster might get stuck #9924

Closed

allthedrones mentioned this pull request Jan 6, 2016

Snapshot operation stuck, delete command doesn't work #10564

Closed

imotov mentioned this pull request Jan 7, 2016

Broken test? imotov/elasticsearch-snapshot-cleanup#5

Closed

imotov mentioned this pull request Dec 4, 2016

Need this utility for ES 2.3.1 imotov/elasticsearch-snapshot-cleanup#6

Closed

desagar mentioned this pull request Dec 6, 2016

Snapshot in ABORTED state after rolling restart of nodes #22000

Closed

imotov deleted the issue-11314-update-snapshot-shards-on-master-change branch May 1, 2020 22:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync up snapshot shard status on a master restart #11450

Sync up snapshot shard status on a master restart #11450

imotov commented Jun 2, 2015

s1monw Jun 3, 2015

s1monw commented Jun 3, 2015

Sync up snapshot shard status on a master restart #11450

Sync up snapshot shard status on a master restart #11450

Conversation

imotov commented Jun 2, 2015

s1monw Jun 3, 2015

Choose a reason for hiding this comment

s1monw commented Jun 3, 2015