Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot in ABORTED state after rolling restart of nodes #22000

Closed
desagar opened this issue Dec 6, 2016 · 7 comments
Closed

Snapshot in ABORTED state after rolling restart of nodes #22000

desagar opened this issue Dec 6, 2016 · 7 comments

Comments

@desagar
Copy link

desagar commented Dec 6, 2016

Elasticsearch version: 2.3.1

Plugins installed: [a custom repository plugin]

JVM version: 1.8.0_101

OS version: Oracle Enterprise Linux 6 with Redhat kernel

Description of the problem including expected versus actual behavior:
We have a 2 node Elasticsearch cluster, and we have installed a custom repository plugin that is used for storing Elasticsearch snapshots. The custom plugin has a bug that occasionally causes it to hang indefinitely waiting for a connection to the back-end store for our snapshots. When this happened, we performed a rolling restart of the Elasticsearch cluster to clear the hanging thread. After the restart, we ended up with a state where the snapshot is in ABORTED status according to ES cluster state. However when querying the snapshot using the snapshot API, it reports that the snapshot is still in progress. As a result we are unable to take any further snapshots.
According to this link, snapshots in ABORTED status should be cleaned up when the master node is restarted.

Steps to reproduce:
Working on a reproducer - will provide I have one.

Provide logs (if relevant):
Please see attached files of cluster state and snapshot status.
snapshot_status.txt : output of /_snapshot/ppmgmt1645/snapshot_20161130_042001?pretty=true

@abeyad
Copy link

abeyad commented Dec 6, 2016

When retrieving the snapshot status, that action looks in the cluster state and retrieves the current snapshots. So its very strange that the snapshot status is showing an IN_PROGRESS snapshot, whereas the cluster state is showing the snapshot as ABORTED. Did you get the cluster state from the non-master node? (i.e. did you run /_cluster/state?local=true). Also, I'm assuming you retrieved the cluster state and the snapshot status after the rolling restart had completed?

@desagar
Copy link
Author

desagar commented Dec 6, 2016

I retrieved both cluster state and snapshot status after the rolling restart fully completed and the cluster health was back to green.
I unfortunately can't remember which node I got the cluster state from, but I did not use local=true.

@abeyad
Copy link

abeyad commented Dec 6, 2016

If you didn't use local=true, then it would've retrieved the cluster state from the master node.

Did you explicitly try deleting the snapshot before (or after) the rolling restart?

@abeyad
Copy link

abeyad commented Dec 6, 2016

@desagar BTW, you inadvertently pasted us the entire cluster state which included your repository credentials. I removed the link from the ticket, but you should also update your security settings immediately so as to not have your repository account compromised.

@desagar
Copy link
Author

desagar commented Dec 6, 2016

Thank you for removing the link.

@desagar
Copy link
Author

desagar commented Dec 6, 2016

I attempted deleting the snapshot prior to the restart, and at that point it was just hanging. That could possibly have been due to the bug in the plugin - I did not take a thread dump at that point so I am unsure.

I just attempted deleting it again after the restart, and the delete fails with since the snapshot is not fully written to the repository. However, the delete apparently removed the aborted snapshot, and it is no longer present in the cluster state output. Snapshot status reports that the snapshot is missing.

@abeyad
Copy link

abeyad commented Dec 6, 2016

You should be able to take snapshots now, correct?

I believe I know what is happening. When you issued a delete snapshot request, the master node marked the snapshot as ABORTED in the cluster state, and propagated this cluster state to the other node so that each node can abort individual snapshots running on their respective node. Because your plugin was in a stuck state, no reads of snapshot data were taking place, so the abort never kicked in (the abort kicks in on the input stream) to fail the shard. So, your snapshot remained in this ABORTED state, without progressing to the FAILED state whereby the snapshot could be removed from the cluster state.

In this case, the full cluster restart is your main option. We opened #21759 to look at better options for aborting, and once that is completed, situations like the one you encountered would be properly handled.

I'm closing this for now. If you encounter other issues, please feel free to reopen. Thank you for reporting this!

@abeyad abeyad closed this as completed Dec 6, 2016
danielmitterdorfer added a commit to danielmitterdorfer/elasticsearch that referenced this issue Jan 20, 2017
Elasticsearch 6.0 removes support for lenient
booleans (see elastic#22000). With this commit we
deprecate all usages of non-strict booleans in
Elasticsearch 5.x so users can already spot
improper usages.

Relates elastic#22000
Relates elastic#22696
danielmitterdorfer added a commit that referenced this issue Jan 23, 2017
Elasticsearch 6.0 removes support for lenient
booleans (see #22000). With this commit we
deprecate all usages of non-strict booleans in
Elasticsearch 5.x so users can already spot
improper usages.

Relates #22000
Relates #22696
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants