-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add _source
-only snapshot repository
#32844
Conversation
This change adds a `_source` only snapshot repository that allows to wrap any existing repository as a _backend_ to snapshot only the `_source` part including live docs markers. Snapshots taken with the `source` repository won't include any index structures. The snapshot will be reduced in size and functionality such that it requires in-place reindexing during restore. The restore process will copy the `_source` data locally and reindexing all data during the recovery from snapshot phase. Users have 2 options for re-indexing: * full reindex: where the data will be reindexed with the original mapping * minimal reindex: where the data will be reindexed with a disabled mapping that results in an index that can only be accessed via `_id`. Both options allow using and updating the index while the latter is mainly for scan/scroll purposes and re-indexing after the fact. This feature aims mainly for disaster recovery use-cases where snapshot size is a concern or where time to restore is less of an issue.
@@ -0,0 +1,39 @@ | |||
[[repository-src-only]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@clintongormley @debadair I'd love to get input where this should be linked from and where it should be located. at this point it's stand-alone.
import static org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.FIELDS_EXTENSION; | ||
import static org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.FIELDS_INDEX_EXTENSION; | ||
|
||
public final class SourceOnlySnapshot { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could be generally useful and moved to lucene land. I will do this after the fact.
public static final Setting<Boolean> RESTORE_MINIMAL = Setting.boolSetting("restore_minimal", | ||
false, Setting.Property.NodeScope); | ||
|
||
public static final String SNAPSHOT_DIR_NAME = "_snapshot"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bleskes we use tmp dirs for restore and snapshot. I wonder if that is ok or if there are any concerns. Our shard deletion mechanism should take care of cleaning up.
BytesReference source = rootFieldsVisitor.source(); | ||
if (source != null) { // nested fields don't have source. in this case we should be fine. | ||
// TODO we should have a dedicated origin for this LOCAL_TRANSLOG_RECOVERY is misleading. | ||
Engine.Result result = shard.applyTranslogOperation(new Translog.Index(uid.type(), uid.id(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some initial benchmarks using our
the snapshots are all taken to a local disk ie. no network involved here. I will follow up with restore times which I expect to be much better for full backups (
|
…ssary, skip translog and use append only optimization
here are some updated numbers:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* Queries other than `match_all` will return no results. | ||
|
||
* `_get` requests are not supported. | ||
* Queries other than `match_all` and `_get` requests are not supported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this reads in two ways - you can also read this as if _get
works (if you don't understand that _get
is not what we see as a query)
|
||
@Override | ||
public Terms terms(String field) { | ||
throw new UnsupportedOperationException("_source only indices can't be searched or filtered"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
This change adds a `_source` only snapshot repository that allows to wrap any existing repository as a _backend_ to snapshot only the `_source` part including live docs markers. Snapshots taken with the `source` repository won't include any indices, doc-values or points. The snapshot will be reduced in size and functionality such that it requires full re-indexing after it's successfully restored. The restore process will copy the `_source` data locally starts a special shard and engine to allow `match_all` scrolls and searches. Any other query, or get call will fail with and unsupported operation exception. The restored index is also marked as read-only. This feature aims mainly for disaster recovery use-cases where snapshot size is a concern or where time to restore is less of an issue. **NOTE**: The snapshot produced by this repository is still a valid lucene index. This change doesn't allow for any longer retention policies which is out of scope for this change.
This change adds a `_source` only snapshot repository that allows to wrap any existing repository as a _backend_ to snapshot only the `_source` part including live docs markers. Snapshots taken with the `source` repository won't include any indices, doc-values or points. The snapshot will be reduced in size and functionality such that it requires full re-indexing after it's successfully restored. The restore process will copy the `_source` data locally starts a special shard and engine to allow `match_all` scrolls and searches. Any other query, or get call will fail with and unsupported operation exception. The restored index is also marked as read-only. This feature aims mainly for disaster recovery use-cases where snapshot size is a concern or where time to restore is less of an issue. **NOTE**: The snapshot produced by this repository is still a valid lucene index. This change doesn't allow for any longer retention policies which is out of scope for this change.
@s1monw What's the meaning of minimal vs full reindex in this comment? |
@jhalterman that was an early version of the change that I reverted. These numbers are meaningless now. |
We can't rely on the leaf reader ordinal in a wrapped reader since it might not correspond to the ordinal in the SegmentInfos for it's SegmentCommitInfo. Relates to elastic#32844 Closes elastic#33689
This change adds a
_source
only snapshot repository that allows to wrapany existing repository as a backend to snapshot only the
_source
partincluding live docs markers. Snapshots taken with the
source
repositorywon't include any indices, doc-values or points. The snapshot will be reduced in size and
functionality such that it requires full re-indexing after it's successfully restored.
The restore process will copy the
_source
data locally starts a special shard and engineto allow
match_all
scrolls and searches. Any other query, or get call will fail with and unsupported operation exception. The restored index is also marked as read-only.This feature aims mainly for disaster recovery use-cases where snapshot size is
a concern or where time to restore is less of an issue.
NOTE: The snapshot produced by this repository is still a valid lucene index. This change doesn't allow for any longer retention policies which is out of scope for this change.