Snapshot in ABORTED state after rolling restart of nodes #22000

desagar · 2016-12-06T15:47:49Z

Elasticsearch version: 2.3.1

Plugins installed: [a custom repository plugin]

JVM version: 1.8.0_101

OS version: Oracle Enterprise Linux 6 with Redhat kernel

Description of the problem including expected versus actual behavior:
We have a 2 node Elasticsearch cluster, and we have installed a custom repository plugin that is used for storing Elasticsearch snapshots. The custom plugin has a bug that occasionally causes it to hang indefinitely waiting for a connection to the back-end store for our snapshots. When this happened, we performed a rolling restart of the Elasticsearch cluster to clear the hanging thread. After the restart, we ended up with a state where the snapshot is in ABORTED status according to ES cluster state. However when querying the snapshot using the snapshot API, it reports that the snapshot is still in progress. As a result we are unable to take any further snapshots.
According to this link, snapshots in ABORTED status should be cleaned up when the master node is restarted.

Steps to reproduce:
Working on a reproducer - will provide I have one.

Provide logs (if relevant):
Please see attached files of cluster state and snapshot status.
snapshot_status.txt : output of /_snapshot/ppmgmt1645/snapshot_20161130_042001?pretty=true

abeyad · 2016-12-06T18:38:26Z

When retrieving the snapshot status, that action looks in the cluster state and retrieves the current snapshots. So its very strange that the snapshot status is showing an IN_PROGRESS snapshot, whereas the cluster state is showing the snapshot as ABORTED. Did you get the cluster state from the non-master node? (i.e. did you run /_cluster/state?local=true). Also, I'm assuming you retrieved the cluster state and the snapshot status after the rolling restart had completed?

desagar · 2016-12-06T18:57:15Z

I retrieved both cluster state and snapshot status after the rolling restart fully completed and the cluster health was back to green.
I unfortunately can't remember which node I got the cluster state from, but I did not use local=true.

abeyad · 2016-12-06T20:51:10Z

If you didn't use local=true, then it would've retrieved the cluster state from the master node.

Did you explicitly try deleting the snapshot before (or after) the rolling restart?

abeyad · 2016-12-06T21:23:22Z

@desagar BTW, you inadvertently pasted us the entire cluster state which included your repository credentials. I removed the link from the ticket, but you should also update your security settings immediately so as to not have your repository account compromised.

desagar · 2016-12-06T21:37:54Z

Thank you for removing the link.

desagar · 2016-12-06T21:48:37Z

I attempted deleting the snapshot prior to the restart, and at that point it was just hanging. That could possibly have been due to the bug in the plugin - I did not take a thread dump at that point so I am unsure.

I just attempted deleting it again after the restart, and the delete fails with since the snapshot is not fully written to the repository. However, the delete apparently removed the aborted snapshot, and it is no longer present in the cluster state output. Snapshot status reports that the snapshot is missing.

abeyad · 2016-12-06T23:27:33Z

You should be able to take snapshots now, correct?

I believe I know what is happening. When you issued a delete snapshot request, the master node marked the snapshot as ABORTED in the cluster state, and propagated this cluster state to the other node so that each node can abort individual snapshots running on their respective node. Because your plugin was in a stuck state, no reads of snapshot data were taking place, so the abort never kicked in (the abort kicks in on the input stream) to fail the shard. So, your snapshot remained in this ABORTED state, without progressing to the FAILED state whereby the snapshot could be removed from the cluster state.

In this case, the full cluster restart is your main option. We opened #21759 to look at better options for aborting, and once that is completed, situations like the one you encountered would be properly handled.

I'm closing this for now. If you encounter other issues, please feel free to reopen. Thank you for reporting this!

Elasticsearch 6.0 removes support for lenient booleans (see elastic#22000). With this commit we deprecate all usages of non-strict booleans in Elasticsearch 5.x so users can already spot improper usages. Relates elastic#22000 Relates elastic#22696

Elasticsearch 6.0 removes support for lenient booleans (see #22000). With this commit we deprecate all usages of non-strict booleans in Elasticsearch 5.x so users can already spot improper usages. Relates #22000 Relates #22696

abeyad closed this as completed Dec 6, 2016

danielmitterdorfer mentioned this issue Jan 20, 2017

Deprecate lenient booleans #22716

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snapshot in ABORTED state after rolling restart of nodes #22000

Snapshot in ABORTED state after rolling restart of nodes #22000

desagar commented Dec 6, 2016 •

edited by abeyad

Loading

abeyad commented Dec 6, 2016

desagar commented Dec 6, 2016

abeyad commented Dec 6, 2016

abeyad commented Dec 6, 2016

desagar commented Dec 6, 2016

desagar commented Dec 6, 2016

abeyad commented Dec 6, 2016

Snapshot in ABORTED state after rolling restart of nodes #22000

Snapshot in ABORTED state after rolling restart of nodes #22000

Comments

desagar commented Dec 6, 2016 • edited by abeyad Loading

abeyad commented Dec 6, 2016

desagar commented Dec 6, 2016

abeyad commented Dec 6, 2016

abeyad commented Dec 6, 2016

desagar commented Dec 6, 2016

desagar commented Dec 6, 2016

abeyad commented Dec 6, 2016

desagar commented Dec 6, 2016 •

edited by abeyad

Loading