Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prune only gc deletes below the local checkpoint #28790

Merged
merged 15 commits into from
Mar 26, 2018
Merged

Conversation

dnhatn
Copy link
Member

@dnhatn dnhatn commented Feb 22, 2018

Once a document is deleted and Lucene is refreshed, we will not be able to look up the version/seq# associated with that delete in Lucene. As conflicting operations can still be indexed, we need another mechanism to remember these deletes. Therefore deletes should still be stored in the Version Map, even after Lucene is refreshed. Obviously, we can't remember all deletes forever so a trimming mechanism is needed. Currently, we remember deletes for at least 1 minute (the default GC deletes cycle) and clean them periodically. This is, at the moment, the best we can do on the primary for user facing APIs but this arbitrary time limit is problematic for replicas. Furthermore, we can't rely on the primary and replicas doing the trimming in a synchronized manner, and failing to do so results in the replica and primary making different decisions. The following scenario can cause inconsistency between primary and replica.

  1. Primary index doc (index, id=1, v2)
  2. Network packet issue causes index operation to back off and wait
  3. Primary deletes doc (delete, id=1, v3)
  4. Replica processes delete (delete, id=1, v3)
  5. 1+ minute passes (GC deletes runs replica)
  6. Indexing op is finally sent to the replica which no processes it because it forgot about the delete.

We can reply on sequence-numbers to prevent this issue. If we prune only deletes whose seqno at most the local checkpoint, a replica will correctly remember what it needs. The correctness is explained as follows:

Suppose o1 and o2 are two operations on the same document with seq#(o1) < seq#(o2), and o2 arrives before o1 on the replica. o2 is processed normally since it arrives first; when o1 arrives it should be discarded:

  • If seq#(o1) <= LCP, then it will be not be added to Lucene, as it was already previously added.
  • If seq#(o1) > LCP, then it depends on the nature of o2:
    • If o2 is a delete then its seq# is recorded in the VersionMap, since seq#(o2) > seq#(o1) > LCP,
      so a lookup can find it and determine that o1 is stale.
    • If o2 is an indexing then its seq# is either in Lucene (if refreshed) or the VersionMap (if not refreshed yet), so a real-time lookup can find it and determine that o1 is stale.

In this PR, we prefer to deploy a single trimming strategy, which satisfies both requirements, on primary and replicas because:

  • It's simpler - no need to distinguish if an engine is running at primary mode or replica mode or being promoted.
  • If a replica subsequently is promoted, user experience is fully maintained as that replica remembers deletes for the last GC cycle.

However, the version map may consume less memory if we deploy two different trimming strategies for primary and replicas.

@dnhatn dnhatn added >enhancement :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. v7.0.0 v6.3.0 :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. labels Feb 22, 2018
@dnhatn dnhatn removed the :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. label Feb 22, 2018
Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx @dnhatn . I left some questions on the tests. Otherwise looks good. Like #28787 , let's wait for @DaveCTurner .

Also - I'm not comfortable with the fact that we skip the local checkpoint check (i.e., not adding to lucene if seq# <= local checkpoint) for optimized operations. It's technically correct but I rather have that as a hard rule that's visible in the code. Can you do a follow up for that?

equalTo(Sets.union(trimmedDeletes, rememberedDeletes)));
engine.refresh("test");
// Only prune deletes below the local checkpoint.
engine.maybePruneDeletes();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this relate to clock? also refresh already maybe prunes deletes

@dnhatn
Copy link
Member Author

dnhatn commented Mar 4, 2018

@bleskes You're correct. We don't need a manual clock here. I've removed the clock and also the prune call. I will make a follow-up as you suggested.

# Conflicts:
#	server/src/main/java/org/elasticsearch/index/engine/LiveVersionMap.java
@dnhatn
Copy link
Member Author

dnhatn commented Mar 9, 2018

@bleskes I've mixed time-based and sequence-based conditions into testPruneOnlyDeletesAtMostLocalCheckpoint as we discussed. Can you take a look? Thank you.

Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Baring discussions around the model with @DaveCTurner about the TLA model, this LGTM. I left some nits that don't need another review.

@@ -4572,4 +4577,67 @@ public void testStressUpdateSameDocWhileGettingIt() throws IOException, Interrup
}
}
}

public void testPruneOnlyDeletesAtMostLocalCheckpoint() throws Exception {
IOUtils.close(engine, store);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we try not to mess up with the class's engine ? I rather have a local one enclosed in a try with resources

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

clock.set(randomLongBetween(gcInterval, deleteBatch + gcInterval));
engine.refresh("test");
tombstones.removeIf(v -> v.seqNo < gapSeqNo && v.time < clock.get() - gcInterval);
assertThat(engine.getDeletedTombstones(), containsInAnyOrder(tombstones.toArray()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we need to check the size too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have to as containsInAnyOrder also does a size-check.

shards.startAll();
final IndexShard primary = shards.getPrimary();
final IndexShard replica = shards.getReplicas().get(0);
final TimeValue gcInterval = TimeValue.timeValueMillis(scaledRandomIntBetween(1, 1000));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just set this to something very small (10ms?) and also set ThreadPool#ESTIMATED_TIME_INTERVAL_SETTING to 0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

final boolean isTooOld = currentTime - versionValue.time > pruneInterval;
private boolean canRemoveTombstone(long maxTimestampToPrune, long maxSeqNoToPrune, DeleteVersionValue versionValue) {
// check if the value is old enough and safe to be removed
final boolean isTooOld = versionValue.time < maxTimestampToPrune;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not <=?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we would like to keep delete tombstones for at least one GC cycle.

-        final boolean isTooOld = currentTime - versionValue.time > pruneInterval;
+        final boolean isTooOld = versionValue.time < maxTimestampToPrune;

Copy link
Contributor

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM too!

DaveCTurner added a commit to DaveCTurner/elasticsearch-formal-models that referenced this pull request Mar 26, 2018
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

# Conflicts:
#	server/src/test/java/org/elasticsearch/index/engine/LiveVersionMapTests.java
#	server/src/test/java/org/elasticsearch/index/replication/ESIndexLevelReplicationTestCase.java
@dnhatn
Copy link
Member Author

dnhatn commented Mar 26, 2018

Thanks @bleskes, @s1monw and @DaveCTurner for reviewing.

@dnhatn dnhatn merged commit 8795760 into elastic:master Mar 26, 2018
@dnhatn dnhatn deleted the gc-delete branch March 26, 2018 17:42
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Mar 27, 2018
* master:
  Do not optimize append-only if seen normal op with higher seqno (elastic#28787)
  [test] packaging: gradle tasks for groovy tests (elastic#29046)
  Prune only gc deletes below local checkpoint (elastic#28790)
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Mar 27, 2018
* master: (40 commits)
  Do not optimize append-only if seen normal op with higher seqno (elastic#28787)
  [test] packaging: gradle tasks for groovy tests (elastic#29046)
  Prune only gc deletes below local checkpoint (elastic#28790)
  remove testUnassignedShardAndEmptyNodesInRoutingTable
  elastic#28745: remove extra option in the composite rest tests
  Fold EngineDiskUtils into Store, for better lock semantics (elastic#29156)
  Add file permissions checks to precommit task
  Remove execute mode bit from source files
  Optimize the composite aggregation for match_all and range queries (elastic#28745)
  [Docs] Add rank_eval size parameter k (elastic#29218)
  [DOCS] Remove ignore_z_value parameter link
  Docs: Update docs/index_.asciidoc (elastic#29172)
  Docs: Link C++ client lib elasticlient (elastic#28949)
  [DOCS] Unregister repository instead of deleting it (elastic#29206)
  Docs: HighLevelRestClient#multiSearch (elastic#29144)
  Add Z value support to geo_shape
  Remove type casts in logging in server component (elastic#28807)
  Change BroadcastResponse from ToXContentFragment to ToXContentObject (elastic#28878)
  REST : Split `RestUpgradeAction` into two actions (elastic#29124)
  Add error file docs to important settings
  ...
martijnvg added a commit that referenced this pull request Mar 28, 2018
* es/master: (22 commits)
  Fix building Javadoc JARs on JDK for client JARs (#29274)
  Require JDK 10 to build Elasticsearch (#29174)
  Decouple NamedXContentRegistry from ElasticsearchException (#29253)
  Docs: Update generating test coverage reports (#29255)
  [TEST] Fix issue with HttpInfo passed invalid parameter
  Remove all dependencies from XContentBuilder (#29225)
  Fix sporadic failure in CompositeValuesCollectorQueueTests
  Propagate ignore_unmapped to inner_hits (#29261)
  TEST: Increase timeout for testPrimaryReplicaResyncFailed
  REST client: hosts marked dead for the first time should not be immediately retried (#29230)
  TEST: Use different translog dir for a new engine
  Make SearchStats implement Writeable (#29258)
  [Docs] Spelling and grammar changes to reindex.asciidoc (#29232)
  Do not optimize append-only if seen normal op with higher seqno (#28787)
  [test] packaging: gradle tasks for groovy tests (#29046)
  Prune only gc deletes below local checkpoint (#28790)
  remove testUnassignedShardAndEmptyNodesInRoutingTable
  #28745: remove extra option in the composite rest tests
  Fold EngineDiskUtils into Store, for better lock semantics (#29156)
  Add file permissions checks to precommit task
  ...
DaveCTurner added a commit to elastic/elasticsearch-formal-models that referenced this pull request Mar 28, 2018
This models how indexing and deletion operations are handled on the replica,
including the optimisations for append-only operations and the interaction with
Lucene commits and the version map.

It incorporates

- elastic/elasticsearch#28787
- elastic/elasticsearch#28790
- elastic/elasticsearch#29276
- a proposal to always prune tombstones
dnhatn added a commit that referenced this pull request Mar 28, 2018
Once a document is deleted and Lucene is refreshed, we will not be able 
to look up the `version/seq#` associated with that delete in Lucene. As
conflicting operations can still be indexed, we need another mechanism
to remember these deletes. Therefore deletes should still be stored in
the Version Map, even after Lucene is refreshed. Obviously, we can't
remember all deletes forever so a trimming mechanism is needed.
Currently, we remember deletes for at least 1 minute (the default GC
deletes cycle) and clean them periodically. This is, at the moment, the
best we can do on the primary for user facing APIs but this arbitrary
time limit is problematic for replicas. Furthermore, we can't rely on
the primary and replicas doing the trimming in a synchronized manner,
and failing to do so results in the replica and primary making different
decisions. 

The following scenario can cause inconsistency between
primary and replica.

1. Primary index doc (index, id=1, v2)
2. Network packet issue causes index operation to back off and wait
3. Primary deletes doc (delete, id=1, v3)
4. Replica processes delete (delete, id=1, v3)
5. 1+ minute passes (GC deletes runs replica)
6. Indexing op is finally sent to the replica which no processes it 
   because it forgot about the delete.

We can reply on sequence-numbers to prevent this issue. If we prune only 
deletes whose seqno at most the local checkpoint, a replica will
correctly remember what it needs. The correctness is explained as
follows:

Suppose o1 and o2 are two operations on the same document with seq#(o1) 
< seq#(o2), and o2 arrives before o1 on the replica. o2 is processed
normally since it arrives first; when o1 arrives it should be discarded:
 
1. If seq#(o1) <= LCP, then it will be not be added to Lucene, as it was
  already previously added.

2. If seq#(o1)  > LCP, then it depends on the nature of o2:
  - If o2 is a delete then its seq# is recorded in the VersionMap,
    since seq#(o2) > seq#(o1) > LCP, so a lookup can find it and
    determine that o1 is stale.
  
  - If o2 is an indexing then its seq# is either in Lucene (if
    refreshed) or the VersionMap (if not refreshed yet), so a 
    real-time lookup can find it and determine that o1 is stale.

In this PR, we prefer to deploy a single trimming strategy, which 
satisfies both requirements, on primary and replicas because:

- It's simpler - no need to distinguish if an engine is running at
primary mode or replica mode or being promoted.

- If a replica subsequently is promoted, user experience is fully
maintained as that replica remembers deletes for the last GC cycle.

However, the version map may consume less memory if we deploy two 
different trimming strategies for primary and replicas.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Engine Anything around managing Lucene and the Translog in an open shard. >enhancement v6.3.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants