Add CcrRestoreSourceService to track sessions #36578

Tim-Brooks · 2018-12-13T00:49:34Z

This commit is related to #36127. It adds a CcrRestoreSourceService to
track Engine.IndexCommitRef need for in-process file restores. When a
follower starts restoring a shard through the CcrRepository it opens a
session with the leader through the PutCcrRestoreSessionAction. The
leader responds to the request by telling the follower what files it
needs to fetch for a restore. This is not yet implemented.

Once, the restore is complete, the follower closes the session with the
DeleteCcrRestoreSessionAction action.

elasticmachine · 2018-12-13T00:49:36Z

Pinging @elastic/es-distributed

Tim-Brooks · 2018-12-13T00:59:40Z

The CcrRestoreSourceService implements IndexEventListener, however it is not currently registered with the IndicesClusterStateService. It looks like there is not a way to register listeners with that. They are just created:

        this.buildInIndexListener =
                Arrays.asList(
                        peerRecoverySourceService,
                        recoveryTargetService,
                        searchService,
                        syncedFlushService,
                        snapshotShardsService);

You can register listeners in an ad-hoc manner with indexes. But there might be some catches with that (like it looks like it has to be registered when the IndexService is created). So I guess I'm saying, we might need to talk about how to get the CcrRestoreSourceService into the IndicesClusterStateService.

ywelsch · 2018-12-13T11:57:28Z

The CcrRestoreSourceService implements IndexEventListener, however it is not currently registered with the IndicesClusterStateService. It looks like there is not a way to register listeners with that

The Plugin class offers a hook to register IndexEventListeners on index creation, using the following method:

@Override
public void onIndexModule(IndexModule indexModule) {
  indexModule.addIndexEventListener(yourSingletonListenerInstance);
}

The MockIndexEventListener.TestPlugin class should provide a good example.

ywelsch

I've added some initial thoughts.

...main/java/org/elasticsearch/xpack/ccr/action/repositories/DeleteCcrRestoreSessionAction.java

ywelsch · 2018-12-13T12:05:59Z

...main/java/org/elasticsearch/xpack/ccr/action/repositories/DeleteCcrRestoreSessionAction.java

+    public static class TransportDeleteCcrRestoreSessionAction
+        extends TransportSingleShardAction<DeleteCcrRestoreSessionRequest, DeleteCcrRestoreSessionResponse> {
+
+        private final IndicesService indicesService;


perhaps it's nicer to have CcrRestoreSourceService have a reference to IndicesService instead of having it here in the TransportAction class.

I'm not sure how to do this? CcrRestoreSourceService is created in createComponents. And we do not have IndicesService there.

I see two other options to the current one:

Pass IndicesService to createComponents.

Create CcrRestoreSourceService using Guice, by overriding Collection<Module> createGuiceModules().

Neither sounds really great so let's keep the current model for now.

...main/java/org/elasticsearch/xpack/ccr/action/repositories/DeleteCcrRestoreSessionAction.java

ywelsch · 2018-12-13T12:26:41Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/repository/CcrRepository.java

+        Client remoteClient = client.getRemoteClusterClient(remoteClusterAlias);
+        String sessionUUID = UUIDs.randomBase64UUID();
+        PutCcrRestoreSessionAction.PutCcrRestoreSessionResponse response = remoteClient.execute(PutCcrRestoreSessionAction.INSTANCE,
+            new PutCcrRestoreSessionRequest(sessionUUID, shardId, recoveryMetadata)).actionGet();


should we have timeouts on these calls (similar as we do for peer recovery within a cluster)? Perhaps something to mark as a follow-up item?

I added a todo. I will also add timeout tasks to the meta issue.

ywelsch · 2018-12-13T12:27:18Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/repository/CcrRepository.java

+        Client remoteClient = client.getRemoteClusterClient(remoteClusterAlias);
+        String sessionUUID = UUIDs.randomBase64UUID();
+        PutCcrRestoreSessionAction.PutCcrRestoreSessionResponse response = remoteClient.execute(PutCcrRestoreSessionAction.INSTANCE,
+            new PutCcrRestoreSessionRequest(sessionUUID, shardId, recoveryMetadata)).actionGet();


we can derive the correct remote shard id by using indexShard.indexSettings().getIndexMetaData().getCustomData(Ccr.CCR_CUSTOM_METADATA_KEY)

Done. Thanks.

ywelsch · 2018-12-13T12:35:47Z

...plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/repository/CcrRestoreSourceService.java

+        Engine.IndexCommitRef commit;
+        if (onGoingRestores.containsKey(sessionUUID)) {
+            logger.debug("session [{}] already exists", sessionUUID);
+            commit = onGoingRestores.get(sessionUUID);


should we be so lenient here, or rather reject opening a session which is already supposed to exist? It depends on how we want to handle failures / retries

I would kind of prefer the put be idempotent. This is also why I have the session uuid is generated on the follower node. I'll explain more about my design in a top-level comment. Maybe we should add validation to prevent (unlikely) uuid conflicts (ensure that the put session request comes from the same follower node)?

ywelsch · 2018-12-13T12:38:11Z

...plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/repository/CcrRestoreSourceService.java

+            logger.debug("session [{}] already exists", sessionUUID);
+            commit = onGoingRestores.get(sessionUUID);
+        } else {
+            commit = indexShard.acquireSafeIndexCommit();


if anything goes wrong in this method later, should we release the index commit?

Made some changes to release. However, I think local timeouts for the index commit being held should also be a future meta task to also help here.

...rc/main/java/org/elasticsearch/xpack/ccr/action/repositories/PutCcrRestoreSessionAction.java

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/repository/CcrRepository.java

...plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/repository/CcrRestoreSourceService.java

Tim-Brooks · 2018-12-13T22:35:12Z

Thanks @ywelsch I've made changes. Here are my high-level design thoughts:

A put session request opens a session using a uuid generated on the follower node. It includes the follower's store metadata. This is a transport shard request so it is on the node with the shard.
When it is opened the leader node initiates a local timeout for the index commit ref. (To prevent it from leaking due to disruption of this process).
The leader responds with the identical files (follower to keep) and recovery files (follower to fetch).
The follower starts fetching files using a TransportNodesAction type of request.
If the index commit times out on the leader side or the afterIndexShardClosed is called the leader will respond to the file chunk request with something like SESSION_NOT_FOUND.
If the follower's local timeout (not yet implemented) has not yet expired, it can go back to step 1.
Once the follower has recovered all of the files through file chunk requests it will ClearCcrRestoreSessionAction.

In this model the CcrRestoreSourceService is kind of like a cache for the Engine.IndexCommitRef. Obviously if a new Engine.IndexCommitRef is acquired (because the "cache" timed out on the leader) there might be new files the follower needs to recover or delete. But the leader sends this information in the response to the put session request. New put session requests can be made as long as the follower wants to. The process will continue until the follower locally times out. Or some other type of error is encountered.

ywelsch

I've left mostly smaller comments. Thanks for the high-level design description? Can you also outline how you want to handle (temporary) network disconnects?

ywelsch · 2018-12-14T14:05:51Z

.../main/java/org/elasticsearch/xpack/ccr/action/repositories/ClearCcrRestoreSessionAction.java

+        return new ClearCcrRestoreSessionResponse();
+    }
+
+    public static class TransportDeleteCcrRestoreSessionAction extends TransportNodesAction<ClearCcrRestoreSessionRequest,


TransportNodesAction is only truly useful if you intend on sending something to multiple nodes. I think it might be simpler here to directly use HandledTransportAction?

I think I would like to implement a specific node request for the file chunks and delete session in a follow-up? I added a meta task.

ywelsch · 2018-12-14T15:03:23Z

...rc/main/java/org/elasticsearch/xpack/ccr/action/repositories/PutCcrRestoreSessionAction.java

+
+    public static class PutCcrRestoreSessionResponse extends ActionResponse {
+
+        private String nodeId;


can this be made final? I see that you both implemented a constructor with StreamInput and the readFrom method?

I don't think so. Unfortunately you must implement this:

@Override protected PutCcrRestoreSessionResponse newResponse() { return new PutCcrRestoreSessionResponse(); }

on TransportSingleShardAction.

ywelsch · 2018-12-14T15:25:30Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/repository/CcrRepository.java

+
+        Map<String, String> ccrMetaData = indexShard.indexSettings().getIndexMetaData().getCustomData(Ccr.CCR_CUSTOM_METADATA_KEY);
+        String leaderUUID = ccrMetaData.get(Ccr.CCR_CUSTOM_METADATA_LEADER_INDEX_UUID_KEY);
+        ShardId leaderShardId = new ShardId(shardId.getIndexName(), leaderUUID, shardId.getId());


do we need to get the leader index name from Ccr.CCR_CUSTOM_METADATA_LEADER_INDEX_NAME_KEY?

No. We don't need that because the index name provided by the args to restoreShard is correct. It is only the uuid that is not.

...plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/repository/CcrRestoreSourceService.java

ywelsch · 2018-12-14T15:33:08Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/repository/CcrRepository.java

+        ClearCcrRestoreSessionAction.ClearCcrRestoreSessionResponse response =
+            remoteClient.execute(ClearCcrRestoreSessionAction.INSTANCE, clearRequest).actionGet();
+        if (response.hasFailures()) {
+            throw response.failures().get(0);


by not making this a BaseNodesResponse, we will not need this weird unwrapping.

I think I would like to implement a specific node request for the file chunks and delete session in a follow-up? I added a meta task.

...c/main/java/org/elasticsearch/xpack/ccr/action/repositories/PutCcrRestoreSessionRequest.java

...n/ccr/src/test/java/org/elasticsearch/xpack/ccr/repository/CcrRestoreSourceServiceTests.java

Tim-Brooks · 2018-12-15T00:07:57Z

@ywelsch - I think I would like to implement the mechanism to direct a request to a specific node on the remote cluster in a follow-up. I added a task for that on the meta issue.

…ource_service

ywelsch · 2018-12-18T14:25:41Z

...plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/repository/CcrRestoreSourceService.java

+        }
+
+        private void removeSessionForShard(String sessionUUID, IndexShard indexShard) {
+            logger.debug("closing session [{}] for shard [{}]", sessionUUID, indexShard);


IndexShard does not have a toString implementation AFAICS

ywelsch · 2018-12-18T14:27:45Z

...plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/repository/CcrRestoreSourceService.java

+            } else {
+                logger.debug("opening session [{}] for shard [{}]", sessionUUID, indexShard);
+                if (indexShard.state() == IndexShardState.CLOSED) {
+                    throw new IllegalIndexShardStateException(indexShard.shardId(), IndexShardState.CLOSED,


preferably throw IndexShardClosedException

…ource_service

* elastic/master: (31 commits) enable bwc tests and switch transport serialization version to 6.6.0 for CAS features [DOCs] Adds ml-cpp PRs to alpha release notes (elastic#36790) Synchronize WriteReplicaResult callbacks (elastic#36770) Add CcrRestoreSourceService to track sessions (elastic#36578) [Painless] Add tests for boxed return types (elastic#36747) Internal: Remove originalSettings from Node (elastic#36569) [ILM][DOCS] Update ILM API authorization docs (elastic#36749) Core: Deprecate use of scientific notation in epoch time parsing (elastic#36691) [ML] Merge the Jindex master feature branch (elastic#36702) Tests: Mute SnapshotDisruptionIT.testDisruptionOnSnapshotInitialization Update versions in SearchSortValues transport serialization Update version in SearchHits transport serialization [Geo] Integrate Lucene's LatLonShape (BKD Backed GeoShapes) as default `geo_shape` indexing approach (elastic#36751) [Docs] Fix error in Common Grams Token Filter (elastic#36774) Fix rollup search statistics (elastic#36674) SQL: Fix wrong appliance of StackOverflow limit for IN (elastic#36724) [TEST] Added more logging Invalidate Token API enhancements - HLRC (elastic#36362) Deprecate types in index API (elastic#36575) Disable bwc tests until elastic#36555 backport is complete (elastic#36737) ...

This commit is related to elastic#36127. It adds a CcrRestoreSourceService to track Engine.IndexCommitRef need for in-process file restores. When a follower starts restoring a shard through the CcrRepository it opens a session with the leader through the PutCcrRestoreSessionAction. The leader responds to the request by telling the follower what files it needs to fetch for a restore. This is not yet implemented. Once, the restore is complete, the follower closes the session with the DeleteCcrRestoreSessionAction action.

This commit is related to #36127. It adds a CcrRestoreSourceService to track Engine.IndexCommitRef need for in-process file restores. When a follower starts restoring a shard through the CcrRepository it opens a session with the leader through the PutCcrRestoreSessionAction. The leader responds to the request by telling the follower what files it needs to fetch for a restore. This is not yet implemented. Once, the restore is complete, the follower closes the session with the DeleteCcrRestoreSessionAction action.

Tim-Brooks added 5 commits December 12, 2018 11:47

Bring over start session work

b1ee4fd

Change file relationship

2479f94

WIP

9fac474

WIP

7b79a5e

WIP

fe3ef84

Tim-Brooks added >non-issue v7.0.0 :Distributed/CCR Issues around the Cross Cluster State Replication features v6.6.0 labels Dec 13, 2018

Tim-Brooks requested review from bleskes and ywelsch December 13, 2018 00:49

ywelsch suggested changes Dec 13, 2018

View reviewed changes

Tim-Brooks added 6 commits December 13, 2018 10:20

Changes

e85bfa1

Change delete request type

71db9aa

Changes

8d1a151

Cleanup

500350c

WIP

5f9c1f3

Add listener

17a5f24

cleanups

be2e6e8

Tim-Brooks mentioned this pull request Dec 14, 2018

Implement CCR bootstrap from remote #35975

Closed

32 tasks

ywelsch suggested changes Dec 14, 2018

View reviewed changes

Changes from review

e275823

Tim-Brooks added 2 commits December 17, 2018 10:41

Merge remote-tracking branch 'upstream/master' into add_ccr_restore_s…

6e4a59d

…ource_service

Merge remote-tracking branch 'upstream/master' into add_ccr_restore_s…

592bdc0

…ource_service

Tim-Brooks requested a review from ywelsch December 18, 2018 01:27

Merge remote-tracking branch 'upstream/master' into add_ccr_restore_s…

d0798ca

…ource_service

ywelsch approved these changes Dec 18, 2018

View reviewed changes

Tim-Brooks added 2 commits December 18, 2018 09:02

Changes

8a6ae8d

Merge remote-tracking branch 'upstream/master' into add_ccr_restore_s…

b781cb0

…ource_service

Tim-Brooks merged commit 1fa1056 into elastic:master Dec 18, 2018

Tim-Brooks added backport pending v6.7.0 and removed v6.6.0 labels Dec 18, 2018

Tim-Brooks removed the backport pending label Dec 20, 2018

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Tim-Brooks deleted the add_ccr_restore_source_service branch December 18, 2019 14:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CcrRestoreSourceService to track sessions #36578

Add CcrRestoreSourceService to track sessions #36578

Tim-Brooks commented Dec 13, 2018

elasticmachine commented Dec 13, 2018

Tim-Brooks commented Dec 13, 2018

ywelsch commented Dec 13, 2018 •

edited

Loading

ywelsch left a comment

ywelsch Dec 13, 2018

Tim-Brooks Dec 13, 2018

ywelsch Dec 14, 2018

ywelsch Dec 13, 2018

Tim-Brooks Dec 13, 2018

ywelsch Dec 13, 2018

Tim-Brooks Dec 13, 2018

ywelsch Dec 13, 2018

Tim-Brooks Dec 13, 2018

ywelsch Dec 13, 2018

Tim-Brooks Dec 13, 2018

Tim-Brooks commented Dec 13, 2018

ywelsch left a comment

ywelsch Dec 14, 2018

Tim-Brooks Dec 15, 2018

ywelsch Dec 14, 2018

Tim-Brooks Dec 14, 2018

ywelsch Dec 14, 2018

Tim-Brooks Dec 14, 2018

ywelsch Dec 14, 2018

Tim-Brooks Dec 15, 2018

Tim-Brooks commented Dec 15, 2018

ywelsch Dec 18, 2018

ywelsch Dec 18, 2018


		public static class PutCcrRestoreSessionResponse extends ActionResponse {

		private String nodeId;

Add CcrRestoreSourceService to track sessions #36578

Add CcrRestoreSourceService to track sessions #36578

Conversation

Tim-Brooks commented Dec 13, 2018

elasticmachine commented Dec 13, 2018

Tim-Brooks commented Dec 13, 2018

ywelsch commented Dec 13, 2018 • edited Loading

ywelsch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Tim-Brooks commented Dec 13, 2018

ywelsch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Tim-Brooks commented Dec 15, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ywelsch commented Dec 13, 2018 •

edited

Loading