-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Snapshot and restore queue #27353
Comments
This is an interesting idea. We should definitely consider it for snapshot redesign, unless we go with the continues backup idea, in which case this is not going to be applicable. On the other side, a restore queue doesn't make much sense to me. So, we should probably rename the issue to Snapshot Queue. |
we spoke about this and decided that we don't want to add a queue since it's nice to give feedback to the user immediately that we can't run more than one snapshot. There is also the fact that we don't queue long running processes and this one would be an exception. |
There is still interest in this issue from time to time, particularly with users wanting to enqueue restore jobs. Snapshots are normally a periodic cluster-wide activity so it doesn't make so much sense to queue them up, but restores are more ad-hoc and typically only focus on a few indices so a bit of asynchrony might be useful. As of 6.6.0 Elasticsearch now supports concurrent restores. A user can trigger multiple the restores at once and Elasticsearch will throttle the corresponding recoveries, effectively enqueueing the restoration of each shard until it has the capacity to handle it. However Elasticsearch still forbids restores occurring concurrently with either taking or deleting a snapshot. We recommend taking frequent incremental snapshots, which can make it challenging to find a good time to perform a restore. |
@DaveCTurner these are good news. I'd have to ask, though. I'd expect only the relevant shards to be locked for concurrent snapshot actions. Is there a technical limitation to performing snapshots/restores which do not touch the same shards, or is it just plain code? |
"Just" is just the worst word 😁 There are some quite significant obstacles. |
@DaveCTurner has there been any new discussion on the theme of a snapshot queue? I can see several people on the internet argue that it is better to simply take one large snapshot. However, one might as well argue that one large snapshot of the whole cluster leaves you vulnerable to user errors, for example someone manually deleting the container with snapshots, or to errors such as the snapshot or container being corrupt. If you implement different snapshots for different bulks of indexes, this would help to even further mitigate potential risks. One could of course avoid this issue by simply settings up the policies with different timeslots, however this grows hard to manage as the amount of desired snapshot policies grow. |
@johanmha I think the implementation is complete now so there's nothing left to discuss. It's generally best to put all your snapshots in one repository rather than spreading them around, since it's going to be painful to reconstruct your cluster from multiple snapshots in different repositories after a disaster. There's no technical reason not to do this, it's just a bad idea. We can't really offer much advice on protecting a repository from damage (whether malicious or accidental) - it's up to you to apply appropriate access controls and monitor your disks for errors and so on. |
The lack of some sort of snapshot queue is an ongoing problem for us. It is to the point that if there was a third party plugin available we would buy it. On average if we use a single big snapshot job it takes about 7 hours and runs at about 2TB per hour. Different sets of data have regulatory or legal requirements for different sets of data which require different snapshot jobs and in actuality we are running snapshot jobs nearly 24x7. Keeping several sets of jobs scheduled so they don’t stomp on each other is not fun. Dedicating 2 hours of staff time per day and a big Excel sheet to try and keep it all sorted, plus automated scripting to retry failed jobs. Time consuming and poorly working hack, but not many other options. |
... or necessary? You can run snapshots in parallel these days. edit: Also, if you want to keep different indices for different lengths of time to satisfy regulatory requirements, it's probably simplest to take frequent whole-cluster snapshots and then use the clone snapshot API to make snapshots containing just the indices you want to retain for longer. Cloning is a zero-copy operation so it's pretty cheap. |
@DaveCTurner I wasn't aware you could run snapshots in parallel! How is this possible? And thanks for the prompt reply! |
This conversation isn't really on-topic for a Github issue so I suggest we continue it over on the discussion forum. I won't be replying here any more, but feel free to link to your forum thread below. |
Link to discussion forum as sugested: |
Hey
We've been trying to automate index snapshot creation through the elasticseasrch snapshot API. However, we get an exception when trying to run more than one snapshot in parallel:
elasticsearch.exceptions.TransportError: TransportError(503, 'concurrent_snapshot_execution_exception', 'a snapshot is already running')
This is an odd behavior, as elasticsearch already has many queues for many of its operations - and I'd expect the same for snapshots and restores. I understand there is a technical limitation for actually running two snapshots concurrently, but I think it would be a good idea to have elasticsearch add snapshots requests to a queue, automatically executing each snapshot as the previous finished running.
The text was updated successfully, but these errors were encountered: