Add `_source`-only snapshot repository #32844

s1monw · 2018-08-14T13:10:53Z

This change adds a _source only snapshot repository that allows to wrap
any existing repository as a backend to snapshot only the _source part
including live docs markers. Snapshots taken with the source repository
won't include any indices, doc-values or points. The snapshot will be reduced in size and
functionality such that it requires full re-indexing after it's successfully restored.

The restore process will copy the _source data locally starts a special shard and engine
to allow match_all scrolls and searches. Any other query, or get call will fail with and unsupported operation exception. The restored index is also marked as read-only.

This feature aims mainly for disaster recovery use-cases where snapshot size is
a concern or where time to restore is less of an issue.

NOTE: The snapshot produced by this repository is still a valid lucene index. This change doesn't allow for any longer retention policies which is out of scope for this change.

This change adds a `_source` only snapshot repository that allows to wrap any existing repository as a _backend_ to snapshot only the `_source` part including live docs markers. Snapshots taken with the `source` repository won't include any index structures. The snapshot will be reduced in size and functionality such that it requires in-place reindexing during restore. The restore process will copy the `_source` data locally and reindexing all data during the recovery from snapshot phase. Users have 2 options for re-indexing: * full reindex: where the data will be reindexed with the original mapping * minimal reindex: where the data will be reindexed with a disabled mapping that results in an index that can only be accessed via `_id`. Both options allow using and updating the index while the latter is mainly for scan/scroll purposes and re-indexing after the fact. This feature aims mainly for disaster recovery use-cases where snapshot size is a concern or where time to restore is less of an issue.

s1monw · 2018-08-14T13:29:22Z

docs/plugins/repository-source-only.asciidoc

@@ -0,0 +1,39 @@
+[[repository-src-only]]


@clintongormley @debadair I'd love to get input where this should be linked from and where it should be located. at this point it's stand-alone.

s1monw · 2018-08-14T13:30:13Z

x-pack/plugin/core/src/main/java/org/elasticsearch/snapshots/SourceOnlySnapshot.java

+import static org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.FIELDS_EXTENSION;
+import static org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.FIELDS_INDEX_EXTENSION;
+
+public final class SourceOnlySnapshot {


this could be generally useful and moved to lucene land. I will do this after the fact.

s1monw · 2018-08-14T13:31:20Z

x-pack/plugin/core/src/main/java/org/elasticsearch/snapshots/SourceOnlySnapshotRepository.java

+    public static final Setting<Boolean> RESTORE_MINIMAL = Setting.boolSetting("restore_minimal",
+        false, Setting.Property.NodeScope);
+
+    public static final String SNAPSHOT_DIR_NAME = "_snapshot";


@bleskes we use tmp dirs for restore and snapshot. I wonder if that is ok or if there are any concerns. Our shard deletion mechanism should take care of cleaning up.

s1monw · 2018-08-14T13:33:01Z

x-pack/plugin/core/src/main/java/org/elasticsearch/snapshots/SourceOnlySnapshotRepository.java

+                            BytesReference source = rootFieldsVisitor.source();
+                            if (source != null) { // nested fields don't have source. in this case we should be fine.
+                                // TODO we should have a dedicated origin for this LOCAL_TRANSLOG_RECOVERY is misleading.
+                                Engine.Result result = shard.applyTranslogOperation(new Translog.Index(uid.type(), uid.id(),


this is a pretty terrible abuse of the Translog phase. we should rename it to Ops phase or so. I spoke with @bleskes about this. This is a shortcut and needs some discussion. also with @ywelsch

s1monw · 2018-08-15T08:23:31Z

I did some initial benchmarks using our geonames and http_logs dataset we use for our benchmarks:

repository type	dataset	snapshot time taken	snapshot size
`fs`	geonames	82 sec	3.2 GB
`source` delegate to `fs`	geonames	28 sec	922 MB
`fs`	http_logs	7 min 6 sec	16 GB
`source` delegate to `fs`	http_logs	3 min 44 sec	7.5 GB

the snapshots are all taken to a local disk ie. no network involved here. I will follow up with restore times which I expect to be much better for full backups (fs) since source needs to reindex. Yet, I already have some numbers for the geonames dataset:

repository type	dataset	restore time taken	snapshot size	num docs reindexed	num shards
`fs`	geonames	1 min 10 sec	3.2 GB	0	5
`source` full reindex	geonames	3 min 15 sec	922 MB	11396505	5
`source` minimal reindex	geonames	1 min 24 sec	922 MB	11396505	5

…ssary, skip translog and use append only optimization

s1monw · 2018-08-16T13:28:39Z

here are some updated numbers:

repository type	dataset	restore time taken	snapshot size	num docs reindexed	num shards
`fs`	http_logs	6.7 min	16 GB	0	4
`source` full reindex	http_logs	33.6 min	7.1 GB	181463624	4
`source` minimal reindex	http_logs	16.9 min	7.1 GB	181463624	4

s1monw · 2018-09-12T08:10:41Z

@bleskes @debadair I pushed changes. Thanks for the reviews.

bleskes

LGTM

bleskes · 2018-09-12T14:12:42Z

docs/reference/modules/snapshots.asciidoc

- * Queries other than `match_all` will return no results.
-
- * `_get` requests are not supported.
+ * Queries other than `match_all` and `_get` requests are not supported.


this reads in two ways - you can also read this as if _get works (if you don't understand that _get is not what we see as a query)

bleskes · 2018-09-12T14:14:25Z

x-pack/plugin/core/src/main/java/org/elasticsearch/snapshots/SeqIdGeneratingFilterReader.java

+
+                @Override
+                public Terms terms(String field) {
+                    throw new UnsupportedOperationException("_source only indices can't be searched or filtered");


This change adds a `_source` only snapshot repository that allows to wrap any existing repository as a _backend_ to snapshot only the `_source` part including live docs markers. Snapshots taken with the `source` repository won't include any indices, doc-values or points. The snapshot will be reduced in size and functionality such that it requires full re-indexing after it's successfully restored. The restore process will copy the `_source` data locally starts a special shard and engine to allow `match_all` scrolls and searches. Any other query, or get call will fail with and unsupported operation exception. The restored index is also marked as read-only. This feature aims mainly for disaster recovery use-cases where snapshot size is a concern or where time to restore is less of an issue. **NOTE**: The snapshot produced by this repository is still a valid lucene index. This change doesn't allow for any longer retention policies which is out of scope for this change.

jhalterman · 2018-09-13T19:09:21Z

@s1monw What's the meaning of minimal vs full reindex in this comment?

s1monw · 2018-09-13T21:00:32Z

@jhalterman that was an early version of the change that I reverted. These numbers are meaningless now.

We can't rely on the leaf reader ordinal in a wrapped reader since it might not correspond to the ordinal in the SegmentInfos for it's SegmentCommitInfo. Relates to elastic#32844 Closes elastic#33689

We can't rely on the leaf reader ordinal in a wrapped reader since it might not correspond to the ordinal in the SegmentInfos for it's SegmentCommitInfo. Relates to #32844 Closes #33689 Closes #33755

We can't rely on the leaf reader ordinal in a wrapped reader since it might not correspond to the ordinal in the SegmentInfos for it's SegmentCommitInfo. Relates to #32844 Closes #33689

We can't rely on the leaf reader ordinal in a wrapped reader since it might not correspond to the ordinal in the SegmentInfos for it's SegmentCommitInfo. Relates to #32844 Closes #33689 Closes #33755

s1monw added >enhancement WIP :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v7.0.0 v6.5.0 labels Aug 14, 2018

s1monw requested a review from bleskes August 14, 2018 13:10

s1monw commented Aug 14, 2018

View reviewed changes

s1monw added 2 commits August 14, 2018 16:04

add license

3c5a0ed

fix docs test setup

0645159

s1monw added 5 commits August 15, 2018 16:41

fix imports

b28407f

make sure on local reindex we don't parse the source if it's not nece…

5cca4c2

…ssary, skip translog and use append only optimization

remove dead code

6fc9664

Merge branch 'master' into source_only_snap

d32bc05

fix imports

ea806bc

s1monw added 10 commits August 20, 2018 13:42

status quo

02aecd7

Merge branch 'master' into source_only_snap

8e612f6

iteration

7dcc25d

Restore from a soruce only snap by copying only the source

81c8127

Merge branch 'master' into source_only_snap

6dcf9e5

fix imports and docs

ee8a9d5

fix constant

40cd45d

add license headers

deff31d

fix javadocs

eea9f6f

Merge branch 'master' into source_only_snap

61c2ff2

s1monw added 5 commits September 12, 2018 09:26

reword docs

12473d5

fix nit

71d2e58

apply feedback

5be8844

Make sure all queries and get requests other than match_all fail

18c4680

fix comment

f384028

s1monw added 4 commits September 12, 2018 13:10

Merge branch 'master' into source_only_snap

9930e08

fix tests to expect exception on query

8877497

fix imports

78de6b5

add test that slices work too

5f6529f

bleskes approved these changes Sep 12, 2018

View reviewed changes

s1monw added the release highlight label Sep 12, 2018

s1monw merged commit c783488 into elastic:master Sep 12, 2018

s1monw added the backport pending label Sep 12, 2018

s1monw mentioned this pull request Sep 12, 2018

Add _source-only snapshot repository (#32844) #33652

Merged

s1monw removed the backport pending label Sep 13, 2018

s1monw mentioned this pull request Sep 17, 2018

Ensure fully deleted segments are accounted for correctly #33757

Merged

Mpdreamz mentioned this pull request Dec 13, 2018

[meta] 6.5.0 Release elastic/elasticsearch-net#3457

Closed

Mpdreamz mentioned this pull request Jan 2, 2019

Add support for source only snapshot repository elastic/elasticsearch-net#3531

Merged

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `_source`-only snapshot repository #32844

Add `_source`-only snapshot repository #32844

s1monw commented Aug 14, 2018 •

edited

Loading

s1monw Aug 14, 2018

s1monw Aug 14, 2018

s1monw Aug 14, 2018

s1monw Aug 14, 2018

s1monw commented Aug 15, 2018

s1monw commented Aug 16, 2018

s1monw commented Sep 12, 2018

bleskes left a comment

bleskes Sep 12, 2018

bleskes Sep 12, 2018

jhalterman commented Sep 13, 2018

s1monw commented Sep 13, 2018

Add _source-only snapshot repository #32844

Add _source-only snapshot repository #32844

Conversation

s1monw commented Aug 14, 2018 • edited Loading

s1monw Aug 14, 2018

Choose a reason for hiding this comment

s1monw Aug 14, 2018

Choose a reason for hiding this comment

s1monw Aug 14, 2018

Choose a reason for hiding this comment

s1monw Aug 14, 2018

Choose a reason for hiding this comment

s1monw commented Aug 15, 2018

s1monw commented Aug 16, 2018

s1monw commented Sep 12, 2018

bleskes left a comment

Choose a reason for hiding this comment

bleskes Sep 12, 2018

Choose a reason for hiding this comment

bleskes Sep 12, 2018

Choose a reason for hiding this comment

jhalterman commented Sep 13, 2018

s1monw commented Sep 13, 2018

Add `_source`-only snapshot repository #32844

Add `_source`-only snapshot repository #32844

s1monw commented Aug 14, 2018 •

edited

Loading