Prune only gc deletes below the local checkpoint #28790

dnhatn · 2018-02-22T18:26:07Z

Once a document is deleted and Lucene is refreshed, we will not be able to look up the version/seq# associated with that delete in Lucene. As conflicting operations can still be indexed, we need another mechanism to remember these deletes. Therefore deletes should still be stored in the Version Map, even after Lucene is refreshed. Obviously, we can't remember all deletes forever so a trimming mechanism is needed. Currently, we remember deletes for at least 1 minute (the default GC deletes cycle) and clean them periodically. This is, at the moment, the best we can do on the primary for user facing APIs but this arbitrary time limit is problematic for replicas. Furthermore, we can't rely on the primary and replicas doing the trimming in a synchronized manner, and failing to do so results in the replica and primary making different decisions. The following scenario can cause inconsistency between primary and replica.

Primary index doc (index, id=1, v2)
Network packet issue causes index operation to back off and wait
Primary deletes doc (delete, id=1, v3)
Replica processes delete (delete, id=1, v3)
1+ minute passes (GC deletes runs replica)
Indexing op is finally sent to the replica which no processes it because it forgot about the delete.

We can reply on sequence-numbers to prevent this issue. If we prune only deletes whose seqno at most the local checkpoint, a replica will correctly remember what it needs. The correctness is explained as follows:

Suppose o1 and o2 are two operations on the same document with seq#(o1) < seq#(o2), and o2 arrives before o1 on the replica. o2 is processed normally since it arrives first; when o1 arrives it should be discarded:

If seq#(o1) <= LCP, then it will be not be added to Lucene, as it was already previously added.
If seq#(o1) > LCP, then it depends on the nature of o2:
- If o2 is a delete then its seq# is recorded in the VersionMap, since seq#(o2) > seq#(o1) > LCP,
  so a lookup can find it and determine that o1 is stale.
- If o2 is an indexing then its seq# is either in Lucene (if refreshed) or the VersionMap (if not refreshed yet), so a real-time lookup can find it and determine that o1 is stale.

In this PR, we prefer to deploy a single trimming strategy, which satisfies both requirements, on primary and replicas because:

It's simpler - no need to distinguish if an engine is running at primary mode or replica mode or being promoted.
If a replica subsequently is promoted, user experience is fully maintained as that replica remembers deletes for the last GC cycle.

However, the version map may consume less memory if we deploy two different trimming strategies for primary and replicas.

bleskes

Thx @dnhatn . I left some questions on the tests. Otherwise looks good. Like #28787 , let's wait for @DaveCTurner .

Also - I'm not comfortable with the fact that we skip the local checkpoint check (i.e., not adding to lucene if seq# <= local checkpoint) for optimized operations. It's technically correct but I rather have that as a hard rule that's visible in the code. Can you do a follow up for that?

bleskes · 2018-03-03T01:20:13Z

server/src/test/java/org/elasticsearch/index/engine/InternalEngineTests.java

+            equalTo(Sets.union(trimmedDeletes, rememberedDeletes)));
+        engine.refresh("test");
+        // Only prune deletes below the local checkpoint.
+        engine.maybePruneDeletes();


how does this relate to clock? also refresh already maybe prunes deletes

dnhatn · 2018-03-04T06:25:47Z

@bleskes You're correct. We don't need a manual clock here. I've removed the clock and also the prune call. I will make a follow-up as you suggested.

# Conflicts: # server/src/main/java/org/elasticsearch/index/engine/LiveVersionMap.java

dnhatn · 2018-03-09T21:11:47Z

@bleskes I've mixed time-based and sequence-based conditions into testPruneOnlyDeletesAtMostLocalCheckpoint as we discussed. Can you take a look? Thank you.

bleskes

Baring discussions around the model with @DaveCTurner about the TLA model, this LGTM. I left some nits that don't need another review.

bleskes · 2018-03-12T16:02:56Z

server/src/test/java/org/elasticsearch/index/engine/InternalEngineTests.java

@@ -4572,4 +4577,67 @@ public void testStressUpdateSameDocWhileGettingIt() throws IOException, Interrup
            }
        }
    }
+
+    public void testPruneOnlyDeletesAtMostLocalCheckpoint() throws Exception {
+        IOUtils.close(engine, store);


can we try not to mess up with the class's engine ? I rather have a local one enclosed in a try with resources

bleskes · 2018-03-12T16:05:17Z

server/src/test/java/org/elasticsearch/index/engine/InternalEngineTests.java

+        clock.set(randomLongBetween(gcInterval, deleteBatch + gcInterval));
+        engine.refresh("test");
+        tombstones.removeIf(v -> v.seqNo < gapSeqNo && v.time < clock.get() - gcInterval);
+        assertThat(engine.getDeletedTombstones(), containsInAnyOrder(tombstones.toArray()));


don't we need to check the size too?

We don't have to as containsInAnyOrder also does a size-check.

bleskes · 2018-03-12T16:09:04Z

server/src/test/java/org/elasticsearch/index/replication/IndexLevelReplicationTests.java

+            shards.startAll();
+            final IndexShard primary = shards.getPrimary();
+            final IndexShard replica = shards.getReplicas().get(0);
+            final TimeValue gcInterval = TimeValue.timeValueMillis(scaledRandomIntBetween(1, 1000));


I think we can just set this to something very small (10ms?) and also set ThreadPool#ESTIMATED_TIME_INTERVAL_SETTING to 0?

DaveCTurner · 2018-03-13T09:27:03Z

server/src/main/java/org/elasticsearch/index/engine/LiveVersionMap.java

-        final boolean isTooOld = currentTime - versionValue.time > pruneInterval;
+    private boolean canRemoveTombstone(long maxTimestampToPrune, long maxSeqNoToPrune, DeleteVersionValue versionValue) {
+        // check if the value is old enough and safe to be removed
+        final boolean isTooOld = versionValue.time < maxTimestampToPrune;


Why not <=?

Because we would like to keep delete tombstones for at least one GC cycle.

- final boolean isTooOld = currentTime - versionValue.time > pruneInterval; + final boolean isTooOld = versionValue.time < maxTimestampToPrune;

s1monw

LGTM too!

Models the fix implemented in elastic/elasticsearch#28790

DaveCTurner

LGTM.

# Conflicts: # server/src/test/java/org/elasticsearch/index/engine/LiveVersionMapTests.java # server/src/test/java/org/elasticsearch/index/replication/ESIndexLevelReplicationTestCase.java

dnhatn · 2018-03-26T17:40:04Z

Thanks @bleskes, @s1monw and @DaveCTurner for reviewing.

* master: Do not optimize append-only if seen normal op with higher seqno (elastic#28787) [test] packaging: gradle tasks for groovy tests (elastic#29046) Prune only gc deletes below local checkpoint (elastic#28790)

* master: (40 commits) Do not optimize append-only if seen normal op with higher seqno (elastic#28787) [test] packaging: gradle tasks for groovy tests (elastic#29046) Prune only gc deletes below local checkpoint (elastic#28790) remove testUnassignedShardAndEmptyNodesInRoutingTable elastic#28745: remove extra option in the composite rest tests Fold EngineDiskUtils into Store, for better lock semantics (elastic#29156) Add file permissions checks to precommit task Remove execute mode bit from source files Optimize the composite aggregation for match_all and range queries (elastic#28745) [Docs] Add rank_eval size parameter k (elastic#29218) [DOCS] Remove ignore_z_value parameter link Docs: Update docs/index_.asciidoc (elastic#29172) Docs: Link C++ client lib elasticlient (elastic#28949) [DOCS] Unregister repository instead of deleting it (elastic#29206) Docs: HighLevelRestClient#multiSearch (elastic#29144) Add Z value support to geo_shape Remove type casts in logging in server component (elastic#28807) Change BroadcastResponse from ToXContentFragment to ToXContentObject (elastic#28878) REST : Split `RestUpgradeAction` into two actions (elastic#29124) Add error file docs to important settings ...

* es/master: (22 commits) Fix building Javadoc JARs on JDK for client JARs (#29274) Require JDK 10 to build Elasticsearch (#29174) Decouple NamedXContentRegistry from ElasticsearchException (#29253) Docs: Update generating test coverage reports (#29255) [TEST] Fix issue with HttpInfo passed invalid parameter Remove all dependencies from XContentBuilder (#29225) Fix sporadic failure in CompositeValuesCollectorQueueTests Propagate ignore_unmapped to inner_hits (#29261) TEST: Increase timeout for testPrimaryReplicaResyncFailed REST client: hosts marked dead for the first time should not be immediately retried (#29230) TEST: Use different translog dir for a new engine Make SearchStats implement Writeable (#29258) [Docs] Spelling and grammar changes to reindex.asciidoc (#29232) Do not optimize append-only if seen normal op with higher seqno (#28787) [test] packaging: gradle tasks for groovy tests (#29046) Prune only gc deletes below local checkpoint (#28790) remove testUnassignedShardAndEmptyNodesInRoutingTable #28745: remove extra option in the composite rest tests Fold EngineDiskUtils into Store, for better lock semantics (#29156) Add file permissions checks to precommit task ...

This models how indexing and deletion operations are handled on the replica, including the optimisations for append-only operations and the interaction with Lucene commits and the version map. It incorporates - elastic/elasticsearch#28787 - elastic/elasticsearch#28790 - elastic/elasticsearch#29276 - a proposal to always prune tombstones

Once a document is deleted and Lucene is refreshed, we will not be able to look up the `version/seq#` associated with that delete in Lucene. As conflicting operations can still be indexed, we need another mechanism to remember these deletes. Therefore deletes should still be stored in the Version Map, even after Lucene is refreshed. Obviously, we can't remember all deletes forever so a trimming mechanism is needed. Currently, we remember deletes for at least 1 minute (the default GC deletes cycle) and clean them periodically. This is, at the moment, the best we can do on the primary for user facing APIs but this arbitrary time limit is problematic for replicas. Furthermore, we can't rely on the primary and replicas doing the trimming in a synchronized manner, and failing to do so results in the replica and primary making different decisions. The following scenario can cause inconsistency between primary and replica. 1. Primary index doc (index, id=1, v2) 2. Network packet issue causes index operation to back off and wait 3. Primary deletes doc (delete, id=1, v3) 4. Replica processes delete (delete, id=1, v3) 5. 1+ minute passes (GC deletes runs replica) 6. Indexing op is finally sent to the replica which no processes it because it forgot about the delete. We can reply on sequence-numbers to prevent this issue. If we prune only deletes whose seqno at most the local checkpoint, a replica will correctly remember what it needs. The correctness is explained as follows: Suppose o1 and o2 are two operations on the same document with seq#(o1) < seq#(o2), and o2 arrives before o1 on the replica. o2 is processed normally since it arrives first; when o1 arrives it should be discarded: 1. If seq#(o1) <= LCP, then it will be not be added to Lucene, as it was already previously added. 2. If seq#(o1) > LCP, then it depends on the nature of o2: - If o2 is a delete then its seq# is recorded in the VersionMap, since seq#(o2) > seq#(o1) > LCP, so a lookup can find it and determine that o1 is stale. - If o2 is an indexing then its seq# is either in Lucene (if refreshed) or the VersionMap (if not refreshed yet), so a real-time lookup can find it and determine that o1 is stale. In this PR, we prefer to deploy a single trimming strategy, which satisfies both requirements, on primary and replicas because: - It's simpler - no need to distinguish if an engine is running at primary mode or replica mode or being promoted. - If a replica subsequently is promoted, user experience is fully maintained as that replica remembers deletes for the last GC cycle. However, the version map may consume less memory if we deploy two different trimming strategies for primary and replicas.

dnhatn added 3 commits February 22, 2018 12:16

Add deleteOperation to replication test case

bfd2574

Add out of order replication test

6bc190c

Add seqno constraint when prune deletes tombstone

f38cbbb

dnhatn added >enhancement :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. v7.0.0 v6.3.0 :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. labels Feb 22, 2018

dnhatn requested review from s1monw, bleskes, ywelsch and DaveCTurner February 22, 2018 18:26

dnhatn removed the :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. label Feb 22, 2018

dnhatn mentioned this pull request Feb 23, 2018

Do not optimize append-only operation if normal operation with higher seq# was seen #28787

Merged

bleskes suggested changes Mar 3, 2018

View reviewed changes

remove clock + manual prune

f9c930d

dnhatn added 4 commits March 6, 2018 22:29

Merge branch 'master' into gc-delete

cd3d5d2

# Conflicts: # server/src/main/java/org/elasticsearch/index/engine/LiveVersionMap.java

Merge branch 'master' into gc-delete

a1efe07

more test

8737b52

Mixed test

0c5f601

dnhatn added 2 commits March 9, 2018 16:17

remove unused imports

c2a62b6

comment

2bc3785

bleskes approved these changes Mar 12, 2018

View reviewed changes

Merge branch 'master' into gc-delete

d342e85

DaveCTurner reviewed Mar 13, 2018

View reviewed changes

s1monw approved these changes Mar 13, 2018

View reviewed changes

DaveCTurner added a commit to DaveCTurner/elasticsearch-formal-models that referenced this pull request Mar 26, 2018

Preserve GC deletes according to local checkpoint

33cbeda

Models the fix implemented in elastic/elasticsearch#28790

DaveCTurner mentioned this pull request Mar 26, 2018

Introduce ReplicaEngine model elastic/elasticsearch-formal-models#29

Merged

DaveCTurner approved these changes Mar 26, 2018

View reviewed changes

dnhatn added 4 commits March 26, 2018 10:09

Merge branch 'master' into gc-delete

cc4bafe

# Conflicts: # server/src/test/java/org/elasticsearch/index/engine/LiveVersionMapTests.java # server/src/test/java/org/elasticsearch/index/replication/ESIndexLevelReplicationTestCase.java

do not override the test’s engine

6cbde7a

Engine test feedback

0bf6ffa

address shard level test

388a983

dnhatn merged commit 8795760 into elastic:master Mar 26, 2018

dnhatn deleted the gc-delete branch March 26, 2018 17:42

dnhatn added the backport pending label Mar 26, 2018

dnhatn removed the backport pending label Mar 28, 2018

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prune only gc deletes below the local checkpoint #28790

Prune only gc deletes below the local checkpoint #28790

dnhatn commented Feb 22, 2018 •

edited

Loading

bleskes left a comment

bleskes Mar 3, 2018

dnhatn commented Mar 4, 2018

dnhatn commented Mar 9, 2018

bleskes left a comment

bleskes Mar 12, 2018

dnhatn Mar 26, 2018

bleskes Mar 12, 2018

dnhatn Mar 26, 2018

bleskes Mar 12, 2018

dnhatn Mar 26, 2018

DaveCTurner Mar 13, 2018

dnhatn Mar 13, 2018

s1monw left a comment

DaveCTurner left a comment

dnhatn commented Mar 26, 2018

Prune only gc deletes below the local checkpoint #28790

Prune only gc deletes below the local checkpoint #28790

Conversation

dnhatn commented Feb 22, 2018 • edited Loading

bleskes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dnhatn commented Mar 4, 2018

dnhatn commented Mar 9, 2018

bleskes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

s1monw left a comment

Choose a reason for hiding this comment

DaveCTurner left a comment

Choose a reason for hiding this comment

dnhatn commented Mar 26, 2018

dnhatn commented Feb 22, 2018 •

edited

Loading