Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add _source-only snapshot repository #32844

Merged
merged 52 commits into from
Sep 12, 2018
Merged

Conversation

s1monw
Copy link
Contributor

@s1monw s1monw commented Aug 14, 2018

This change adds a _source only snapshot repository that allows to wrap
any existing repository as a backend to snapshot only the _source part
including live docs markers. Snapshots taken with the source repository
won't include any indices, doc-values or points. The snapshot will be reduced in size and
functionality such that it requires full re-indexing after it's successfully restored.

The restore process will copy the _source data locally starts a special shard and engine
to allow match_all scrolls and searches. Any other query, or get call will fail with and unsupported operation exception. The restored index is also marked as read-only.

This feature aims mainly for disaster recovery use-cases where snapshot size is
a concern or where time to restore is less of an issue.

NOTE: The snapshot produced by this repository is still a valid lucene index. This change doesn't allow for any longer retention policies which is out of scope for this change.

This change adds a `_source` only snapshot repository that allows to wrap
any existing repository as a _backend_ to snapshot only the `_source` part
including live docs markers. Snapshots taken with the `source` repository
won't include any index structures. The snapshot will be reduced in size and
functionality such that it requires in-place reindexing during restore.
The restore process will copy the `_source` data locally and reindexing all
data during the recovery from snapshot phase. Users have 2 options for re-indexing:
 * full reindex: where the data will be reindexed with the original mapping
 * minimal reindex: where the data will be reindexed with a disabled mapping that
   results in an index that can only be accessed via `_id`.

Both options allow using and updating the index while the latter is mainly for
scan/scroll purposes and re-indexing after the fact.

This feature aims mainly for disaster recovery use-cases where snapshot size is
a concern or where time to restore is less of an issue.
@s1monw s1monw added >enhancement WIP :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v7.0.0 v6.5.0 labels Aug 14, 2018
@s1monw s1monw requested a review from bleskes August 14, 2018 13:10
@@ -0,0 +1,39 @@
[[repository-src-only]]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clintongormley @debadair I'd love to get input where this should be linked from and where it should be located. at this point it's stand-alone.

import static org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.FIELDS_EXTENSION;
import static org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.FIELDS_INDEX_EXTENSION;

public final class SourceOnlySnapshot {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be generally useful and moved to lucene land. I will do this after the fact.

public static final Setting<Boolean> RESTORE_MINIMAL = Setting.boolSetting("restore_minimal",
false, Setting.Property.NodeScope);

public static final String SNAPSHOT_DIR_NAME = "_snapshot";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bleskes we use tmp dirs for restore and snapshot. I wonder if that is ok or if there are any concerns. Our shard deletion mechanism should take care of cleaning up.

BytesReference source = rootFieldsVisitor.source();
if (source != null) { // nested fields don't have source. in this case we should be fine.
// TODO we should have a dedicated origin for this LOCAL_TRANSLOG_RECOVERY is misleading.
Engine.Result result = shard.applyTranslogOperation(new Translog.Index(uid.type(), uid.id(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a pretty terrible abuse of the Translog phase. we should rename it to Ops phase or so. I spoke with @bleskes about this. This is a shortcut and needs some discussion. also with @ywelsch

@s1monw
Copy link
Contributor Author

s1monw commented Aug 15, 2018

I did some initial benchmarks using our geonames and http_logs dataset we use for our benchmarks:

repository type dataset snapshot time taken snapshot size
fs geonames 82 sec 3.2 GB
source delegate to fs geonames 28 sec 922 MB
fs http_logs 7 min 6 sec 16 GB
source delegate to fs http_logs 3 min 44 sec 7.5 GB

the snapshots are all taken to a local disk ie. no network involved here. I will follow up with restore times which I expect to be much better for full backups (fs) since source needs to reindex. Yet, I already have some numbers for the geonames dataset:

repository type dataset restore time taken snapshot size num docs reindexed num shards
fs geonames 1 min 10 sec 3.2 GB 0 5
source full reindex geonames 3 min 15 sec 922 MB 11396505 5
source minimal reindex geonames 1 min 24 sec 922 MB 11396505 5

@s1monw
Copy link
Contributor Author

s1monw commented Aug 16, 2018

here are some updated numbers:

repository type dataset restore time taken snapshot size num docs reindexed num shards
fs http_logs 6.7 min 16 GB 0 4
source full reindex http_logs 33.6 min 7.1 GB 181463624 4
source minimal reindex http_logs 16.9 min 7.1 GB 181463624 4

@s1monw
Copy link
Contributor Author

s1monw commented Sep 12, 2018

@bleskes @debadair I pushed changes. Thanks for the reviews.

Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

* Queries other than `match_all` will return no results.

* `_get` requests are not supported.
* Queries other than `match_all` and `_get` requests are not supported.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this reads in two ways - you can also read this as if _get works (if you don't understand that _get is not what we see as a query)


@Override
public Terms terms(String field) {
throw new UnsupportedOperationException("_source only indices can't be searched or filtered");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

@s1monw s1monw merged commit c783488 into elastic:master Sep 12, 2018
s1monw added a commit to s1monw/elasticsearch that referenced this pull request Sep 13, 2018
This change adds a `_source` only snapshot repository that allows to wrap
any existing repository as a _backend_ to snapshot only the `_source` part
including live docs markers. Snapshots taken with the `source` repository
won't include any indices,  doc-values or points. The snapshot will be reduced in size and
functionality such that it requires full re-indexing after it's successfully restored.

The restore process will copy the `_source` data locally starts a special shard and engine
to allow `match_all` scrolls and searches. Any other query, or get call will fail with and unsupported operation exception.  The restored index is also marked as read-only.

This feature aims mainly for disaster recovery use-cases where snapshot size is
a concern or where time to restore is less of an issue.

**NOTE**: The snapshot produced by this repository is still a valid lucene index. This change doesn't allow for any longer retention policies which is out of scope for this change.
s1monw added a commit that referenced this pull request Sep 13, 2018
This change adds a `_source` only snapshot repository that allows to wrap
any existing repository as a _backend_ to snapshot only the `_source` part
including live docs markers. Snapshots taken with the `source` repository
won't include any indices,  doc-values or points. The snapshot will be reduced in size and
functionality such that it requires full re-indexing after it's successfully restored.

The restore process will copy the `_source` data locally starts a special shard and engine
to allow `match_all` scrolls and searches. Any other query, or get call will fail with and unsupported operation exception.  The restored index is also marked as read-only.

This feature aims mainly for disaster recovery use-cases where snapshot size is
a concern or where time to restore is less of an issue.

**NOTE**: The snapshot produced by this repository is still a valid lucene index. This change doesn't allow for any longer retention policies which is out of scope for this change.
@jhalterman
Copy link
Contributor

@s1monw What's the meaning of minimal vs full reindex in this comment?

@s1monw
Copy link
Contributor Author

s1monw commented Sep 13, 2018

@jhalterman that was an early version of the change that I reverted. These numbers are meaningless now.

s1monw added a commit to s1monw/elasticsearch that referenced this pull request Sep 17, 2018
We can't rely on the leaf reader ordinal in a wrapped reader since
it might not correspond to the ordinal in the SegmentInfos for it's
SegmentCommitInfo.

Relates to elastic#32844
Closes elastic#33689
s1monw added a commit that referenced this pull request Sep 17, 2018
We can't rely on the leaf reader ordinal in a wrapped reader since
it might not correspond to the ordinal in the SegmentInfos for it's
SegmentCommitInfo.

Relates to #32844
Closes #33689
Closes #33755
s1monw added a commit that referenced this pull request Sep 18, 2018
We can't rely on the leaf reader ordinal in a wrapped reader since
it might not correspond to the ordinal in the SegmentInfos for it's
SegmentCommitInfo.

Relates to #32844
Closes #33689
s1monw added a commit that referenced this pull request Sep 18, 2018
We can't rely on the leaf reader ordinal in a wrapped reader since
it might not correspond to the ordinal in the SegmentInfos for it's
SegmentCommitInfo.

Relates to #32844
Closes #33689
Closes #33755
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants