Skip to content
This repository has been archived by the owner on Jun 23, 2022. It is now read-only.

fix(dup): don't GC by valid_start_offset during duplication & add app_name to replica_base #448

Merged
merged 5 commits into from
Apr 28, 2020

Conversation

neverchanje
Copy link
Contributor

@neverchanje neverchanje commented Apr 26, 2020

What problem this PR solved

  • In some situations, we need metrics of a certain table on a certain pegasus node. Like that table-level get-latency does. The metric name usually uses "app_name" rather than "gpid" for readability to indicate the table it belongs to. So in this PR, I added "app_name" to replica_base. Therefore classes like 'pegasus_server_impl' can have knowledge of its corresponding app_name.
struct replica_base
{
-   replica_base(gpid id, string_view name) : _gpid(id), _name(name) {}
+   replica_base(gpid id, string_view name, string_view app_name)
+      : _gpid(id), _name(name), _app_name(app_name)
+   {
+   }
  • There is also a data durability issue in plog-GC during duplication. Because of historical issues, we use "app_init_offset" to trigger GC for the logs with file offset before "app_init_offset". Some not-confirmed mutations will be deleted due to this reason. So here we set valid_start_offset to 0 to avoid unexpected GC.
        int64_t valid_start_offset = _app->init_info().init_offset_in_private_log;
        if (min_confirmed_decree >= 0) {
+           // Do not rely on valid_start_offset for GC during duplication.
+           // cleanable_decree is the only GC trigger.
+           valid_start_offset = 0;
            if (min_confirmed_decree < last_durable_decree) {
                cleanable_decree = min_confirmed_decree;
            }
        } else if (is_duplicating()) {
            // unsure if the logs can be dropped, because min_confirmed_decree
            // is currently unavailable
            return;
        }
        plog->garbage_collection(get_gpid(), cleanable_decree, valid_start_offset);

Manual Test:

  1. Start two onebox clusters, A and B, then set up duplication A->B.
  2. Loading writes to A, ensure A is duplicating to B.
  3. Shut down B's replica-server. Ensure A is failing to duplicate (pending_mutations_count increases)
    image
  4. Ensure no GC of the private log after a few minutes.
    grep gc_private onebox/replica* | grep removed

@neverchanje neverchanje marked this pull request as ready for review April 26, 2020 15:47
@hycdong hycdong merged commit 9a1a997 into XiaoMi:master Apr 28, 2020
@foreverneverer
Copy link
Contributor

foreverneverer commented Apr 28, 2020

add app_name to replica_base let the pegasus some code compile error(pegasus and rdsn all latest branch), such as pegasus_mutation_duplicator_test.cpp, it call old version replica_base constructor:

void test_duplicate()
    {
        replica_base replica(dsn::gpid(1, 1), "fake_replica");
        auto duplicator = new_mutation_duplicator(&replica, "onebox2", "temp");
        duplicator->set_task_environment(&_env);
....
    }

https://github.com/XiaoMi/pegasus/blob/fad0c4f44534ca8daa991b10fee998819974b8d1/src/server/test/pegasus_mutation_duplicator_test.cpp#L33-L36

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants