-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-zero instance start time #3901
Conversation
6b4e5e5
to
4ebecda
Compare
lgtm overall, I just have a question about a thing that I’m not too familiar with: since you are updating a record, do we have to worry about mixed clusters running one vs the other? — I would assume not since shards are node local mostly, but there might some edge case? |
I don't think shard records are transmitted between nodes but it's a good question. @nickva thoughts? |
4ebecda
to
47df3ca
Compare
I couldn't find any place where we pass the However, I had noticed that the suffix is already part of the shard name and is a tad redundant:
And there is a function which returns the suffix in mem3. Perhaps we could use to get the suffix:
We call when building an etag in chttpd_db and a few places already. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. It's kind of a hack but 1) it's backward compatible with existing replicators and 2) is compatible with the already created dbs in 2.x/3x.
Ideally, we'd implement a db uuid
field like we did in main
, and return that in the db info
/ _ensure_full_commit
, then make the replicator check for uuids source/target changes as described in #3897. However, we can still do that in the future and in the meantime this PR should significantly reduce [*] the chance of the target db getting re-created without replicator noticing.
A few notes:
-
Since we're re-using the old
instance_start_time
field and it's still a timestamp, we might want to remove the "." in front of the suffix. Just in case there is some type checking somewhere in a client library which asserts a numeric format here. There is already amem3:shard_suffix/1
so could add amem3:shard_creation_time/1
. -
Even though the old (<2.x) and the new
instance_start_time
are timestamps, the old ones where updated when the db was opened (endpoint server was restarted, for example). The unix time suffix in our case stays the same for the life of that database instance until it's re-created. We should verify that the replicator will do the right thing in light of that difference. I think it would, just leaving it here as a note for any reviewers. It would be this replicator section -
We should probably update the replicator endpoint instance start time mismatch logs to suggest a better reason for the failure. The
Try to increase max_dbs_open at the target's server."
might not make sense any longer.
[*] We are not completely eliminating the chance as it's still possible that the db would be recreated in the small (under a second) interval between _ensure_full_commit
and writing the checkpoint _local
document.
3882a61
to
475de35
Compare
Verified by re-creating the replication target. The logs look a bit messy got a few of these in a row:
Otherwise it works as intended. All the docs were re-replicated to the new target instance. Noticed a failure for get_dbs_info in the CI and locally as well for _dbs_info endpoint.
Edit: probably something to do with this https://github.com/apache/couchdb/blob/3.x/src/mem3/src/mem3.erl#L125-L149 |
31b36fb
to
caae989
Compare
caae989
to
39e9c83
Compare
Set instance_start_time to the creation time of the database to restore the ability for the replicator to detect a db recreation event during a replication. Without this, a replication can fail to notice the db was deleted and recreated, write the checkpoint document for the in-progress replication and keep going, whereas what should happen is the replication should reset from sequence 0.
39e9c83
to
fedf37d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 well done!
We should check and see if docs need updating as well. There was at least a mention in the replication protocol description that instance_start_time
is always <<"0">>
This has just appeared on the latest 3.3.0 build (3.3.2, actually) from when we upgraded from 3.2.2-2~jammy. Unfortunately, this caused ALL the client replicators to abort with an error:
As 3.2.2-2 returned zero, and 3.3.2 returns the correct time, I'm not really sure how to fix this without triggering a complete re-replication. which in a database with several hundred million entries is an annoyingly complex and slow process. Edit: If anyone ended up here by googling the error, to temporarily work around it roll back to 3.2.2-2, using the command |
We might want to add a special case for comparing to <<"0">> and a non-zero and let that "pass" as a valid comparison, handling cases where a cluster is upgraded... |
instance_start_time changing only causes the replication job to exit. When it's restarted it will look for replication checkpoint docs as usual. In this circumstance they will exist, so replication resumes from there. The "normal" (i.e, couchdb 1.0) reason for instance_start_time to change is the database was closed and reopened (or, more extremely, was deleted and re-created). So it is only a mechanism to ensure replication includes everything, it's not a reset. |
Set instance_start_time to the creation time of the database to
restore the ability for the replicator to detect a db recreation event
during a replication. Without this, a replication can fail to notice
the db was deleted and recreated, write the checkpoint document for
the in-progress replication and keep going, whereas what should happen
is the replication should reset from sequence 0.
Resolves #3897