Propagate max_auto_id_timestamp in peer recovery #33693

dnhatn · 2018-09-14T02:24:33Z

Today we don't store the auto-generated timestamp of append-only
operations in Lucene; and assign -1 to every index operations
constructed from LuceneChangesSnapshot. This looks innocent but it
generates duplicate documents on a replica if a retry append-only
arrives first via peer-recovery; then an original append-only arrives
via replication. Since the retry append-only (delivered via recovery)
does not have timestamp, the replica will happily optimize the original
request while it should not.

This change transmits the max auto-generated timestamp from the primary
to replicas before translog phase in peer recovery. This timestamp will
prevent replicas from optimizing append-only requests if retry
counterparts have been processed.

Relates #33656
Relates #33222

I labeled this non-issue since this is an unreleased bug.

Today we don't store the auto-generated timestamp of append-only operations in Lucene; and assign -1 to every index operations constructed from LuceneChangesSnapshot. This looks innocent but it generates duplicate documents on a replica if a retry append-only arrives first via peer-recovery; then an original append-only arrives via replication. Since the retry append-only (delivered via recovery) does not have timestamp, the replica will happily optimizes the original request while it should not. This change transmits the max auto-generated timestamp from the primary to replicas before translog phase in peer recovery. This timestamp will prevent replicas from optimizing append-only requests if retry counterparts have been processed.

elasticmachine · 2018-09-14T02:24:34Z

Pinging @elastic/es-distributed

dnhatn · 2018-09-14T02:24:57Z

We also need to transmit timestamp to replicas in the primary-replica resync. I will make it in a separate PR.

dnhatn · 2018-09-14T02:25:44Z

server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java

@@ -201,6 +201,8 @@ public RecoveryResponse recoverToTarget() throws IOException {
            runUnderPrimaryPermit(() -> shard.initiateTracking(request.targetAllocationId()),
                shardId + " initiating tracking of " + request.targetAllocationId(), shard, cancellableThreads, logger);

+            // DISCUSS: Is it possible to have an operation gets delivered via recovery first, then delivered via replication?


@bleskes and @ywelsch Could you please check this?

Yes, I think this is possible. We add the target to the replication group, and then collect the operations from the translog (or Lucene with soft deletes) to send to the target. It's possible that an operation can arrive on the primary, enter the translog on the primary, and then an evil OS scheduler puts the thread handling the replication to the target to sleep. At this moment recovery can execute copying the operation to the target. Then, our thread can wake up and the operation arrive by replication.

Thanks @jasontedor. I had a same thought but was not sure because I failed to come up with a test. Now I have an idea to write that test.

We have a few sinister tests along these lines, where we latch operations in the engine to stall them for nefarious purposes. 😇

Yeah, I have a test now. However, this corner case is protected by SeqNo. We are all good now.

elasticsearch/server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

Line 871 in 0b4960f

if (appendOnlyRequest && mayHaveBeenIndexedBefore(index) == false && index.seqNo() > maxSeqNoOfNonAppendOnlyOperations.get()) {

.

s1monw

LGTM left a suggestion

s1monw · 2018-09-14T06:14:14Z

server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

@@ -2531,4 +2530,16 @@ void updateRefreshedCheckpoint(long checkpoint) {
            assert refreshedCheckpoint.get() >= checkpoint : refreshedCheckpoint.get() + " < " + checkpoint;
        }
    }
+
+    @Override
+    public long getMaxAutoIdTimestamp() {


maybe just push the maxUnsafeAutoIdTimestamp up to engine and make the methods final

ywelsch

I wonder, if related to this change, we should also stop storing the autoid timestamp on a per-operation basis in the translog and instead just put the max timestamp into the checkpoint file. This would somehow realign Lucene and translog, where translog is just the stuff that's not fsynced to Lucene yet.

ywelsch · 2018-09-14T07:15:46Z

server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java

-        final CancellableThreads.IOInterruptable sendBatch =
-                () -> targetLocalCheckpoint.set(recoveryTarget.indexTranslogOperations(operations, expectedTotalOps));
+        final CancellableThreads.IOInterruptable sendBatch = () ->
+            targetLocalCheckpoint.set(recoveryTarget.indexTranslogOperations(operations, expectedTotalOps, shard.getMaxAutoIdTimestamp()));


Instead of using a new one for every batch that is to be sent, I would prefer to capture this after we call cancellableThreads.execute(() -> shard.waitForOpsToComplete(endingSeqNo)); in RecoverySourceHandler, and then only pass that same value. You could also add a comment then and there saying why we do it.

We have to do this after the snapshot was captured. That said, I'm +1 on explicitly capturing it once at the right moment and use the same value.

bleskes

I think we have to be super careful with naming, java docs and comments to explain exactly what we are doing and why. If we don't do this, it will easily lead to bugs as this is super tricky.

For example - and I might be missing something here - I don't think the implementation does what we intend it to do. The semantics we're after is: since any of the ops we ship may collide with an optimized write we need to mark them as retry and make sure that after an op was indexed the maxUnsafeTimestamp marker on the target engine is at least as high as the original time stamp of operation we're indexing. Since we don't know what the later is (as we don't store it), we planned to use the maximum timestamp of any append only operation indexed by the source engine when the snapshot was captured. That doesn't match to the maxUnsafeAutoIdTimestamp.get() returned by getMaxAutoIdTimestamp(). maxUnsafeAutoIdTimestamp only tracks append only ops that were marked as retry.

I also agree with @ywelsch that if we change semantics for the ops coming from lucene, we should also change semantics for the ops from the translog (i.e., store recovery should also just set a maximum value as unsafe when it starts) and change the translog to just not store individual ops's timestamp. This means that all recoveries (local or remote) work the same and we don't have to keep two models in our heads.

bleskes · 2018-09-14T08:30:50Z

server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

+    }
+
+    @Override
+    public void updateMaxAutoIdTimestamp(long newTimestamp) {


nit - updateMaxUnsafeAutoIdTimestamp

bleskes · 2018-09-14T08:35:52Z

server/src/main/java/org/elasticsearch/index/engine/Engine.java

+    public abstract long getMaxAutoIdTimestamp();
+
+    /**
+     * Sets the maximum auto-generated timestamp of append-only requests tracked by this engine to {@code newTimestamp}.


I think we want to speak about updating the unsafe marker here?

bleskes · 2018-09-14T08:45:46Z

server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java

-        final CancellableThreads.IOInterruptable sendBatch =
-                () -> targetLocalCheckpoint.set(recoveryTarget.indexTranslogOperations(operations, expectedTotalOps));
+        final CancellableThreads.IOInterruptable sendBatch = () ->
+            targetLocalCheckpoint.set(recoveryTarget.indexTranslogOperations(operations, expectedTotalOps, shard.getMaxAutoIdTimestamp()));


We have to do this after the snapshot was captured. That said, I'm +1 on explicitly capturing it once at the right moment and use the same value.

dnhatn · 2018-09-15T03:15:53Z

@s1monw @ywelsch @bleskes
I've updated the PR to propagate the max_seen_timestamp instead of max_unsafe_timestamp. Could you please have another look? Thank you!

dnhatn · 2018-09-15T03:25:44Z

I also agree with @ywelsch that if we change semantics for the ops coming from lucene, we should also change semantics for the ops from the translog (i.e., store recovery should also just set a maximum value as unsafe when it starts) and change the translog to just not store individual ops's timestamp. This means that all recoveries (local or remote) work the same and we don't have to keep two models in our heads.

I will make a follow-up to remove auto_id_timestamp from translog's operations.

bleskes

Looking good. I left some minor comments.

bleskes · 2018-09-15T14:07:44Z

server/src/main/java/org/elasticsearch/index/engine/Engine.java

@@ -1690,6 +1691,21 @@ public boolean isRecovering() {
     */
    public abstract void maybePruneDeletes();

+    /**
+     * Returns the maximum auto_id_timestamp of all append-only have been processed (or force-updated) by this engine.


what is force updated?

I see now - can you link to the method in the java docs?

bleskes · 2018-09-15T14:12:09Z

server/src/main/java/org/elasticsearch/index/engine/AutoIdTimestamp.java

+/**
+ * Tracks auto_id_timestamp of append-only index requests have been processed in an {@link Engine}.
+ */
+final class AutoIdTimestamp {


do we really need a component for this? it only hides very trivial behavior without much added value?

There are two reasons that I added this class:

Make sure that we always update two markers when we have a new timestamp

Having two markers in an Engine may be confusing

I am fine to remove it (e6a929a).

bleskes · 2018-09-19T09:07:50Z

server/src/main/java/org/elasticsearch/index/engine/Engine.java

+     * Forces this engine to advance its max_unsafe_auto_id_timestamp marker to at least the given timestamp.
+     * The engine will disable optimization for all append-only whose timestamp at most {@code newTimestamp}.
+     */
+    public abstract void forceUpdateMaxUnsafeAutoIdTimestamp(long newTimestamp);


force suggests to me that whatever value you give this method will be set as the new value. It seems that the intended semantics are different. Maybe just call this updateMaxUnsafe...

bleskes · 2018-09-19T09:11:43Z

server/src/test/java/org/elasticsearch/index/replication/IndexLevelReplicationTests.java

@@ -141,10 +142,80 @@ public void cleanFiles(int totalTranslogOps, Store.MetadataSnapshot sourceMetaDa
        }
    }

+    public void testRetryAppendOnlyWhileRecovering() throws Exception {


I think you mean after recovering here.

bleskes · 2018-09-19T09:12:49Z

...ework/src/main/java/org/elasticsearch/index/replication/ESIndexLevelReplicationTestCase.java

@@ -428,6 +441,12 @@ public synchronized void flush() {
        public synchronized void close() throws Exception {
            if (closed == false) {
                closed = true;
+                for (IndexShard replica : replicas) {
+                    try {
+                        assertThat(replica.getMaxSeenAutoIdTimestamp(), equalTo(primary.getMaxSeenAutoIdTimestamp()));


# Conflicts: # server/src/main/java/org/elasticsearch/index/engine/ReadOnlyEngine.java

dnhatn · 2018-09-19T12:50:10Z

@bleskes I've addressed your comments. Could you please have another look? Thank you!

bleskes

LGTM. I sadly think this is not enough and we need a follow up to put this information into lucene so it will survive a restart of the primary.

bleskes · 2018-09-20T14:32:36Z

server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

+    private void updateAutoIdTimestamp(long newTimestamp, boolean unsafe) {
+        assert newTimestamp >= -1 : "invalid timestamp [" + newTimestamp + "]";
+        maxSeenAutoIdTimestamp.updateAndGet(curr -> Math.max(curr, newTimestamp));
+        assert maxSeenAutoIdTimestamp.get() >= newTimestamp;


I don't think this add much value, we just did a max operation with it.

bleskes · 2018-09-20T14:36:54Z

server/src/main/java/org/elasticsearch/indices/recovery/RecoveryTarget.java

        final RecoveryState.Translog translog = state().getTranslog();
        translog.totalOperations(totalTranslogOps);
        assert indexShard().recoveryState() == state();
        if (indexShard().state() != IndexShardState.RECOVERING) {
            throw new IndexShardNotRecoveringException(shardId, indexShard().state());
        }
+        indexShard().updateMaxUnsafeAutoIdTimestamp(maxSeenAutoIdTimestampOnPrimary);


can you add a comment as to why we set this and how don't know what timestamp is associated with the operation so we use an upper bound?

dnhatn · 2018-09-20T23:52:03Z

Thank you so much for reviewing @bleskes, @s1monw, @ywelsch and @jasontedor.

A follow-up of elastic#33693 to propagate max_seen_auto_id_timestamp in a primary-replica resync.

A follow-up of #33693 to propagate max_seen_auto_id_timestamp in a primary-replica resync. Relates #33693

Today we don't store the auto-generated timestamp of append-only operations in Lucene; and assign -1 to every index operations constructed from LuceneChangesSnapshot. This looks innocent but it generates duplicate documents on a replica if a retry append-only arrives first via peer-recovery; then an original append-only arrives via replication. Since the retry append-only (delivered via recovery) does not have timestamp, the replica will happily optimizes the original request while it should not. This change transmits the max auto-generated timestamp from the primary to replicas before translog phase in peer recovery. This timestamp will prevent replicas from optimizing append-only requests if retry counterparts have been processed. Relates elastic#33656 Relates elastic#33222

A follow-up of #33693 to propagate max_seen_auto_id_timestamp in a primary-replica resync. Relates #33693

Relates #33693

Today we don't store the auto-generated timestamp of append-only operations in Lucene; and assign -1 to every index operations constructed from LuceneChangesSnapshot. This looks innocent but it generates duplicate documents on a replica if a retry append-only arrives first via peer-recovery; then an original append-only arrives via replication. Since the retry append-only (delivered via recovery) does not have timestamp, the replica will happily optimizes the original request while it should not. This change transmits the max auto-generated timestamp from the primary to replicas before translog phase in peer recovery. This timestamp will prevent replicas from optimizing append-only requests if retry counterparts have been processed. Relates #33656 Relates #33222

A follow-up of #33693 to propagate max_seen_auto_id_timestamp in a primary-replica resync. Relates #33693

Relates #33693

dnhatn added >non-issue blocker :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. v7.0.0 v6.5.0 labels Sep 14, 2018

dnhatn requested review from s1monw, bleskes and ywelsch September 14, 2018 02:24

dnhatn commented Sep 14, 2018

View reviewed changes

add testAppendOnlyRecoveryThenReplication

27cf199

s1monw approved these changes Sep 14, 2018

View reviewed changes

ywelsch reviewed Sep 14, 2018

View reviewed changes

bleskes suggested changes Sep 14, 2018

View reviewed changes

use max_seen_timestamp

d08cdd0

dnhatn requested review from bleskes and s1monw September 15, 2018 03:16

dnhatn added 3 commits September 15, 2018 02:24

ignore AlreadyClosedException

e0c48da

Merge branch 'master' into propagate-timestamp-recovery

be3c3bd

Merge branch 'master' into propagate-timestamp-recovery

360adb2

s1monw approved these changes Sep 18, 2018

View reviewed changes

dnhatn requested a review from ywelsch September 18, 2018 15:05

bleskes suggested changes Sep 19, 2018

View reviewed changes

dnhatn added 3 commits September 19, 2018 08:05

Merge branch 'master' into propagate-timestamp-recovery

8975a1e

# Conflicts: # server/src/main/java/org/elasticsearch/index/engine/ReadOnlyEngine.java

boaz’s feedback

377267a

remove class

e6a929a

dnhatn requested a review from bleskes September 19, 2018 12:50

bleskes approved these changes Sep 20, 2018

View reviewed changes

dnhatn added 2 commits September 20, 2018 15:37

Merge branch 'master' into propagate-timestamp-recovery

b30ff8e

feedback

07e1621

dnhatn merged commit 5f7f793 into elastic:master Sep 20, 2018

dnhatn deleted the propagate-timestamp-recovery branch September 20, 2018 23:53

dnhatn added the backport pending label Sep 20, 2018

dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Sep 22, 2018

Propagate auto_id_timestamp in primary-replica resync

f3d21a2

A follow-up of elastic#33693 to propagate max_seen_auto_id_timestamp in a primary-replica resync.

dnhatn mentioned this pull request Sep 22, 2018

Propagate auto_id_timestamp in primary-replica resync #33964

Merged

dnhatn added a commit that referenced this pull request Sep 22, 2018

Propagate auto_id_timestamp in primary-replica resync (#33964)

e7ae2f9

A follow-up of #33693 to propagate max_seen_auto_id_timestamp in a primary-replica resync. Relates #33693

dnhatn added a commit that referenced this pull request Sep 23, 2018

Propagate auto_id_timestamp in primary-replica resync (#33964)

96e2b6c

A follow-up of #33693 to propagate max_seen_auto_id_timestamp in a primary-replica resync. Relates #33693

dnhatn added a commit that referenced this pull request Sep 23, 2018

Adjust bwc for recovery request (#33693)

f2f08dd

Relates #33693

dnhatn removed the backport pending label Sep 23, 2018

kcm pushed a commit that referenced this pull request Oct 30, 2018

Propagate auto_id_timestamp in primary-replica resync (#33964)

ea7bca8

A follow-up of #33693 to propagate max_seen_auto_id_timestamp in a primary-replica resync. Relates #33693

kcm pushed a commit that referenced this pull request Oct 30, 2018

Adjust bwc for recovery request (#33693)

570f4a5

Relates #33693

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

dnhatn mentioned this pull request Jun 26, 2019

Enable indexing optimization using sequence numbers on replicas #43616

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Propagate max_auto_id_timestamp in peer recovery #33693

Propagate max_auto_id_timestamp in peer recovery #33693

dnhatn commented Sep 14, 2018 •

edited

Loading

elasticmachine commented Sep 14, 2018

dnhatn commented Sep 14, 2018 •

edited

Loading

dnhatn Sep 14, 2018

jasontedor Sep 14, 2018

dnhatn Sep 14, 2018

jasontedor Sep 14, 2018

dnhatn Sep 14, 2018

s1monw left a comment

s1monw Sep 14, 2018

ywelsch left a comment

ywelsch Sep 14, 2018

bleskes Sep 14, 2018

bleskes left a comment

bleskes Sep 14, 2018

bleskes Sep 14, 2018

bleskes Sep 14, 2018

dnhatn commented Sep 15, 2018

dnhatn commented Sep 15, 2018

bleskes left a comment

bleskes Sep 15, 2018

bleskes Sep 15, 2018

bleskes Sep 15, 2018

dnhatn Sep 19, 2018

bleskes Sep 19, 2018

bleskes Sep 19, 2018

bleskes Sep 19, 2018

dnhatn commented Sep 19, 2018

bleskes left a comment

bleskes Sep 20, 2018

bleskes Sep 20, 2018

dnhatn commented Sep 20, 2018 •

edited

Loading

Propagate max_auto_id_timestamp in peer recovery #33693

Propagate max_auto_id_timestamp in peer recovery #33693

Conversation

dnhatn commented Sep 14, 2018 • edited Loading

elasticmachine commented Sep 14, 2018

dnhatn commented Sep 14, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

s1monw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ywelsch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bleskes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dnhatn commented Sep 15, 2018

dnhatn commented Sep 15, 2018

bleskes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dnhatn commented Sep 19, 2018

bleskes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dnhatn commented Sep 20, 2018 • edited Loading

dnhatn commented Sep 14, 2018 •

edited

Loading

dnhatn commented Sep 14, 2018 •

edited

Loading

dnhatn commented Sep 20, 2018 •

edited

Loading