Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-zero instance start time #3901

Merged
merged 1 commit into from
Jan 19, 2022
Merged

Non-zero instance start time #3901

merged 1 commit into from
Jan 19, 2022

Conversation

rnewson
Copy link
Member

@rnewson rnewson commented Jan 16, 2022

Set instance_start_time to the creation time of the database to
restore the ability for the replicator to detect a db recreation event
during a replication. Without this, a replication can fail to notice
the db was deleted and recreated, write the checkpoint document for
the in-progress replication and keep going, whereas what should happen
is the replication should reset from sequence 0.

Resolves #3897

@janl
Copy link
Member

janl commented Jan 18, 2022

lgtm overall, I just have a question about a thing that I’m not too familiar with: since you are updating a record, do we have to worry about mixed clusters running one vs the other? — I would assume not since shards are node local mostly, but there might some edge case?

@rnewson
Copy link
Member Author

rnewson commented Jan 18, 2022

I don't think shard records are transmitted between nodes but it's a good question. @nickva thoughts?

@nickva
Copy link
Contributor

nickva commented Jan 18, 2022

I couldn't find any place where we pass the #shard record record to the workers. I looked in fabric and mem3_rep mainly. The typical pattern is we spawn the workers based on the local shard map in submit_jobs but don't pass the actual shard records. But it's not a 100% guarantee, I just skimmed through.

However, I had noticed that the suffix is already part of the shard name and is a tad redundant:

 mem3:shards(<<"db1">>).

[#shard{name = <<"shards/00000000-7fffffff/db1.1642540491">>,
        node = '[email protected]',dbname = <<"db1">>,
        range = [0,2147483647],
        ref = undefined,suffix = <<".1642540491">>,
        opts = [{props,[]}]},
        ...

And there is a function which returns the suffix in mem3. Perhaps we could use to get the suffix:

 mem3:shard_suffix(<<"db1">>).
".1642540491"

We call when building an etag in chttpd_db and a few places already.

Copy link
Contributor

@nickva nickva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. It's kind of a hack but 1) it's backward compatible with existing replicators and 2) is compatible with the already created dbs in 2.x/3x.

Ideally, we'd implement a db uuid field like we did in main, and return that in the db info / _ensure_full_commit, then make the replicator check for uuids source/target changes as described in #3897. However, we can still do that in the future and in the meantime this PR should significantly reduce [*] the chance of the target db getting re-created without replicator noticing.

A few notes:

  • Since we're re-using the old instance_start_time field and it's still a timestamp, we might want to remove the "." in front of the suffix. Just in case there is some type checking somewhere in a client library which asserts a numeric format here. There is already a mem3:shard_suffix/1 so could add a mem3:shard_creation_time/1.

  • Even though the old (<2.x) and the new instance_start_time are timestamps, the old ones where updated when the db was opened (endpoint server was restarted, for example). The unix time suffix in our case stays the same for the life of that database instance until it's re-created. We should verify that the replicator will do the right thing in light of that difference. I think it would, just leaving it here as a note for any reviewers. It would be this replicator section

  • We should probably update the replicator endpoint instance start time mismatch logs to suggest a better reason for the failure. The Try to increase max_dbs_open at the target's server." might not make sense any longer.

[*] We are not completely eliminating the chance as it's still possible that the db would be recreated in the small (under a second) interval between _ensure_full_commit and writing the checkpoint _local document.

@rnewson rnewson force-pushed the nonzero_instance_start_time branch 3 times, most recently from 3882a61 to 475de35 Compare January 18, 2022 22:45
@nickva
Copy link
Contributor

nickva commented Jan 18, 2022

Verified by re-creating the replication target. The logs look a bit messy got a few of these in a row:

[error] 2022-01-18T22:59:19.818867Z [email protected] <0.9133.0> -------- CRASH REPORT Process  (<0.9133.0>) with 1 neighbors exited with reason:
{checkpoint_commit_failure,<<"Target database out of sync. Try to increase max_dbs_open at the target's server.">>} at gen_server:decode_msg/9(l
ine:475) <= proc_lib:init_p_do_apply/3(line:226); initial_call: {couch_replicator_worker,init,['Argument__1']}, ancestors: [<0.9039.0>,couch_rep
licator_scheduler_sup,couch_replicator_sup,...], message_queue_len: 0, links: [<0.9134.0>], dictionary: [{last_stats_report,{1642,546729,944177}
}], trap_exit: true, status: running, heap_size: 1598, stack_size: 28, reductions: 8883

Otherwise it works as intended. All the docs were re-replicated to the new target instance.

Noticed a failure for get_dbs_info in the CI and locally as well for _dbs_info endpoint.

http $DB/_dbs_info
HTTP/1.1 500 Internal Server Error


{
    "error": "badmatch",
    "reason": "_dbs",
    "ref": 4194149092
}

Edit: probably something to do with this https://github.com/apache/couchdb/blob/3.x/src/mem3/src/mem3.erl#L125-L149

@rnewson rnewson force-pushed the nonzero_instance_start_time branch 3 times, most recently from 31b36fb to caae989 Compare January 19, 2022 17:04
src/mem3/src/mem3.erl Outdated Show resolved Hide resolved
Set instance_start_time to the creation time of the database to
restore the ability for the replicator to detect a db recreation event
during a replication. Without this, a replication can fail to notice
the db was deleted and recreated, write the checkpoint document for
the in-progress replication and keep going, whereas what should happen
is the replication should reset from sequence 0.
Copy link
Contributor

@nickva nickva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 well done!

We should check and see if docs need updating as well. There was at least a mention in the replication protocol description that instance_start_time is always <<"0">>

@rnewson rnewson merged commit 3abcc42 into 3.x Jan 19, 2022
@rnewson rnewson deleted the nonzero_instance_start_time branch January 19, 2022 19:04
@xrobau
Copy link
Contributor

xrobau commented May 24, 2023

This has just appeared on the latest 3.3.0 build (3.3.2, actually) from when we upgraded from 3.2.2-2~jammy.

Unfortunately, this caused ALL the client replicators to abort with an error:

  "state": "crashing",
  "info": {
    "error": "instance_start_time on source database has changed since last checkpoint."
  },

As 3.2.2-2 returned zero, and 3.3.2 returns the correct time, I'm not really sure how to fix this without triggering a complete re-replication. which in a database with several hundred million entries is an annoyingly complex and slow process.

Edit: If anyone ended up here by googling the error, to temporarily work around it roll back to 3.2.2-2, using the command apt install couch=3.2.2-2~jammy and follow the bug #4621

@nickva
Copy link
Contributor

nickva commented May 24, 2023

We might want to add a special case for comparing to <<"0">> and a non-zero and let that "pass" as a valid comparison, handling cases where a cluster is upgraded...

@rnewson
Copy link
Member Author

rnewson commented May 25, 2023

instance_start_time changing only causes the replication job to exit. When it's restarted it will look for replication checkpoint docs as usual. In this circumstance they will exist, so replication resumes from there.

The "normal" (i.e, couchdb 1.0) reason for instance_start_time to change is the database was closed and reopened (or, more extremely, was deleted and re-created). So it is only a mechanism to ensure replication includes everything, it's not a reset.

@nickva
Copy link
Contributor

nickva commented May 25, 2023

Agree with @rnewson and verified that it doesn't reset the replication in #4621

During cluster upgrades from 3.2 to 3.3 it does make a bit mess in the logs so I created a PR to reduce that part. Basically to accept 0 as a valid timestamp and not crash and restart.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants