From 2fb55da421434f4f85f61d3d13343d61403a7234 Mon Sep 17 00:00:00 2001 From: you06 Date: Tue, 23 May 2023 20:47:29 +0800 Subject: [PATCH 1/5] add RPC for reducing traffic for resolved ts Signed-off-by: you06 --- text/0000-reduce-traffic-resolved-ts.md | 84 +++++++++++++++++++++++++ 1 file changed, 84 insertions(+) create mode 100644 text/0000-reduce-traffic-resolved-ts.md diff --git a/text/0000-reduce-traffic-resolved-ts.md b/text/0000-reduce-traffic-resolved-ts.md new file mode 100644 index 00000000..56c96a01 --- /dev/null +++ b/text/0000-reduce-traffic-resolved-ts.md @@ -0,0 +1,84 @@ +## Summary + +The proposed method for reducing the traffic of pushing safe ts is hibernated region optimization. This method can be especially beneficial for large clusters since it can save a significant amount of traffic. + +The motivation behind this approach is that TiKV currently uses the `CheckLeader` request to push resolved ts, and the traffic it costs grows linearly with the number of regions. By utilizing hibernated region optimization, it is possible to reduce this traffic and improve overall performance. + +## Motivation + +TiKV pushes forward resolved ts by `CheckLeader` request, the traffic it costs grows linearly with the number of regions. When dealing with a large cluster that requires frequent safe TS pushes, it may result in a significant amount of traffic. Moreover, this is probably cross-AZ traffic, which is not free. + +By optimizing hibernated regions, it is possible to reduce traffic significantly. In a large cluster, many regions are not accessed and remain in a hibernated state. + +## Detailed design + +The design will be described in two sections: the improvement of awake regions and a new mechanism for hibernated regions. + +### Awake Regions + +```diff +message RegionEpoch { + uint64 conf_ver = 1; + uint64 version = 2; +} + +message LeaderInfo { + uint64 region_id = 1; +- uint64 peer_id = 2; + uint64 term = 3; + metapb.RegionEpoch region_epoch = 4; + ReadState read_state = 5; +} + +message ReadState { + uint64 applied_index = 1; + uint64 safe_ts = 2; +} + +message CheckLeaderRequest { + repeated LeaderInfo regions = 1; + uint64 ts = 2; ++ uint64 peer_id = 3; +} + +message CheckLeaderResponse { +- repeated uint64 regions = 1; ++ repeated uint64 failed_regions = 1; + uint64 ts = 2; +} +``` + +The above code is the suggested changes for `CheckLeader` protobuf. + +The `peer_id` field is used to check whether the request leader is the same as the leader in the follower peer. However, `LeaderInfo.peer_id` is redundant, as the leaders in the same peer share the same `peer_id`. Therefore, we can move it to `CheckLeaderRequest.peer_id` to avoid duplicated traffic. **Unfortunately, we must keep this field even if it is unused, to maintain compatibility with protobuf.** + +The `CheckLeaderResponse` will respond with the regions that pass the Raft safety check. The leader can then push its `safe_ts` for those regions. Since most regions will pass the safety check, it is not necessary to respond with the IDs of all passing regions. Instead, we can respond with only the IDs of regions that fail the safety check. + +### Hibernated Regions + +To save traffic, we can push the safe timestamps of hibernated regions together without sending the region information list. The `ts` field in `CheckLeaderRequest` is only used to build the relationship between the request and response, although it's fetched from PD. Ideally, we can push the safe timestamps of hibernated regions using this `ts` value. Additionally, we can remove hibernated regions from `CheckLeaderRequest.regions`. Modify `CheckLeaderRequest` as follows. + +```diff +message CheckLeaderRequest { + repeated LeaderInfo regions = 1; + uint64 ts = 2; + uint64 peer_id = 3; ++ repeated uint64 hibernated_regions = 4; +} +``` + +We only send the IDs for hibernated regions. In the most ideal situation, both `LeaderInfo` and the region ID in the response are skipped, reducing the traffic from 8 bytes to 1 byte per region. + +#### Safety + +As long as the leader remains unchanged, safety is guaranteed. Specifically, if the terms of the quorum peers are not changed, we can assume that the leader is also unchanged. The `CheckLeaderRequest` is sent from the leader to the follower. If the terms of both the leader and the follower are unchanged, the condition "terms of the quorum peers are not changed" is met, and we believe there is no new leader. + +#### Implementation + +To implement safety checks for hibernated regions, we record the term when we receive leader information. We then compare the recorded term with the latest term when we receive a region ID without leader information, which indicates that it is hibernating in the leader. If the terms do not match, we must send the `LeaderInfo` for this region the next time. + +## Drawbacks + +This RFC make the resolved ts management more complex. + +## Unresolved questions From 48f1f786b0609e70363c49e9e0c9b546cb87bc1f Mon Sep 17 00:00:00 2001 From: you06 Date: Tue, 30 May 2023 19:43:24 +0800 Subject: [PATCH 2/5] update a safe version Signed-off-by: you06 --- text/0000-reduce-traffic-resolved-ts.md | 36 +++++++++++++++++-------- 1 file changed, 25 insertions(+), 11 deletions(-) diff --git a/text/0000-reduce-traffic-resolved-ts.md b/text/0000-reduce-traffic-resolved-ts.md index 56c96a01..e2dca9a5 100644 --- a/text/0000-reduce-traffic-resolved-ts.md +++ b/text/0000-reduce-traffic-resolved-ts.md @@ -12,9 +12,9 @@ By optimizing hibernated regions, it is possible to reduce traffic significantly ## Detailed design -The design will be described in two sections: the improvement of awake regions and a new mechanism for hibernated regions. +The design will be described in two sections: the improvement of active regions and a new mechanism for hibernated regions. -### Awake Regions +### Active Regions ```diff message RegionEpoch { @@ -24,7 +24,7 @@ message RegionEpoch { message LeaderInfo { uint64 region_id = 1; -- uint64 peer_id = 2; + uint64 peer_id = 2; uint64 term = 3; metapb.RegionEpoch region_epoch = 4; ReadState read_state = 5; @@ -38,7 +38,6 @@ message ReadState { message CheckLeaderRequest { repeated LeaderInfo regions = 1; uint64 ts = 2; -+ uint64 peer_id = 3; } message CheckLeaderResponse { @@ -50,28 +49,43 @@ message CheckLeaderResponse { The above code is the suggested changes for `CheckLeader` protobuf. -The `peer_id` field is used to check whether the request leader is the same as the leader in the follower peer. However, `LeaderInfo.peer_id` is redundant, as the leaders in the same peer share the same `peer_id`. Therefore, we can move it to `CheckLeaderRequest.peer_id` to avoid duplicated traffic. **Unfortunately, we must keep this field even if it is unused, to maintain compatibility with protobuf.** - The `CheckLeaderResponse` will respond with the regions that pass the Raft safety check. The leader can then push its `safe_ts` for those regions. Since most regions will pass the safety check, it is not necessary to respond with the IDs of all passing regions. Instead, we can respond with only the IDs of regions that fail the safety check. -### Hibernated Regions +### Inactive Regions + +Here we name the regions without writing to inactive regions. In the future TiKV will deprecate hibernated regions and merge the small regions into dymanic regions, if so the inactive regions won't be a problem, but for users that disable dynamic regions, this optimization is still required. + +To save traffic, we can push the safe timestamps of inactive regions together without sending the region information list. The `ts` field in `CheckLeaderRequest` is only used to build the relationship between the request and response, although it's fetched from PD. Ideally, we can push the safe timestamps of inactive regions using this `ts` value. Additionally, we can remove inactive regions from `CheckLeaderRequest.regions`. Modify `CheckLeaderRequest` as follows. -To save traffic, we can push the safe timestamps of hibernated regions together without sending the region information list. The `ts` field in `CheckLeaderRequest` is only used to build the relationship between the request and response, although it's fetched from PD. Ideally, we can push the safe timestamps of hibernated regions using this `ts` value. Additionally, we can remove hibernated regions from `CheckLeaderRequest.regions`. Modify `CheckLeaderRequest` as follows. +We only send the IDs for inactive regions. In the most ideal situation, both `LeaderInfo` and the region ID in the response are skipped, reducing the traffic from 64 bytes to 8 bytes per region. ```diff message CheckLeaderRequest { repeated LeaderInfo regions = 1; uint64 ts = 2; uint64 peer_id = 3; -+ repeated uint64 hibernated_regions = 4; ++ repeated uint64 inactive_regions = 4; +} +``` + +One more phase is required to apply the safe ts, because in check leader process, the follower cannot decide whether the request is from a valid leader, so it keep the safe ts in it's memory and wait for apply safe ts request. + +```protobuf +message ApplySafeTsRequest { + uint64 ts = 1; + repeated uint64 unsafe_regions = 2; +} + +message ApplySafeTsResponse { + uint64 ts = 1; } ``` -We only send the IDs for hibernated regions. In the most ideal situation, both `LeaderInfo` and the region ID in the response are skipped, reducing the traffic from 8 bytes to 1 byte per region. +To save traffic, `ApplySafeTsRequest.unsafe_regions` only contains the regions whose leader may be changed. In the ideal case, this request is small because there is almost no unsafe regions. #### Safety -As long as the leader remains unchanged, safety is guaranteed. Specifically, if the terms of the quorum peers are not changed, we can assume that the leader is also unchanged. The `CheckLeaderRequest` is sent from the leader to the follower. If the terms of both the leader and the follower are unchanged, the condition "terms of the quorum peers are not changed" is met, and we believe there is no new leader. +Safety is guaranteed as long as the leader remains unchanged. By decoupling the pushing of safe ts into two phases, the leader peer can ensure that the leadership is still unchanged before safe ts is generated. Only the region IDs of inactive regions are sent, but we can still confirm that the leader has not changed for a follower if both the leader's term and the follower's term remain unchanged. Because the safe ts is applied only after the leadership is confirmed, correctness will not be compromised. #### Implementation From 03c21d8dd3bc3a45d96903914b3d43d1bd5c1b5d Mon Sep 17 00:00:00 2001 From: you06 Date: Mon, 5 Jun 2023 17:45:45 +0800 Subject: [PATCH 3/5] update the issue of safe ts lag in followers Signed-off-by: you06 --- text/0000-reduce-traffic-resolved-ts.md | 101 ++++++++++++++---------- 1 file changed, 59 insertions(+), 42 deletions(-) diff --git a/text/0000-reduce-traffic-resolved-ts.md b/text/0000-reduce-traffic-resolved-ts.md index e2dca9a5..57d10554 100644 --- a/text/0000-reduce-traffic-resolved-ts.md +++ b/text/0000-reduce-traffic-resolved-ts.md @@ -8,91 +8,108 @@ The motivation behind this approach is that TiKV currently uses the `CheckLeader TiKV pushes forward resolved ts by `CheckLeader` request, the traffic it costs grows linearly with the number of regions. When dealing with a large cluster that requires frequent safe TS pushes, it may result in a significant amount of traffic. Moreover, this is probably cross-AZ traffic, which is not free. -By optimizing hibernated regions, it is possible to reduce traffic significantly. In a large cluster, many regions are not accessed and remain in a hibernated state. +By optimizing inactive regions, it is possible to reduce traffic significantly. In a large cluster, many regions are not accessed and remain in a hibernated state. + +Let’s review the safe ts push mechanism. + +1. The leader sends a `check_leader` request to the followers with a timestamp from PD, which carries the resolved timestamp pushed in previous step 3. +2. Follower response whether the leader matched. +3. If quorum of the voters are still in the current leader’s term, update leader’s `safe_ts`. + +In step 1, the leader generates safe ts, and in the next round, the followers apply those timestamps. However, there is an "advance-ts-interval" gap between two step 1s, which results in a safe timestamp lag for the followers. + +Regarding our test on the hit rate for 5-second staleness, we finally achieved a nearly 100% hit rate by setting the `resolved-ts.advance-ts-interval` to 2.5 seconds. This means that if we want to achieve 100-millisecond staleness in the current mechanism, we should set `resolved-ts.advance-ts-interval` to 50 milliseconds, which doubles the traffic of pushing safe ts. + +**We need a solution that applies the safe TS more efficiently.** ## Detailed design The design will be described in two sections: the improvement of active regions and a new mechanism for hibernated regions. -### Active Regions +### Protobuf + +We still need `CheckLeader` request to confirm the leadership. But with some additional fields. ```diff -message RegionEpoch { - uint64 conf_ver = 1; - uint64 version = 2; +message CheckLeaderRequest { + repeated LeaderInfo regions = 1; + uint64 ts = 2; ++ uint64 store_id = 3; ++ repeated uint64 hibernated_regions = 4; } -message LeaderInfo { - uint64 region_id = 1; - uint64 peer_id = 2; - uint64 term = 3; - metapb.RegionEpoch region_epoch = 4; - ReadState read_state = 5; +message CheckLeaderResponse { + repeated uint64 regions = 1; + uint64 ts = 2; ++ repeated uint64 failed_regions = 3; } +``` -message ReadState { - uint64 applied_index = 1; - uint64 safe_ts = 2; +To apply safe ts for valid leaders as soon as possible, instead of waiting for the next round of advancing resolved timestamps, we need to send another `ApplySafeTS` request. The `ApplySafeTS` request is usually small, the traffic caused by it can be ignored. + +```protobuf +message CheckedLeader { + uint64 region_id = 1; + ReadState read_state = 2; } -message CheckLeaderRequest { - repeated LeaderInfo regions = 1; - uint64 ts = 2; +message ApplySafeTsRequest { + uint64 ts = 1; + uint64 store_id = 2; + repeated uint64 unsafe_regions = 3; + repeated CheckedLeader checked_leaders = 4; } -message CheckLeaderResponse { -- repeated uint64 regions = 1; -+ repeated uint64 failed_regions = 1; - uint64 ts = 2; +message ApplySafeTsResponse { + uint64 ts = 1; } ``` -The above code is the suggested changes for `CheckLeader` protobuf. +### Active Regions + +The `CheckLeaderResponse` respond with the regions that pass the Raft safety check. The leader can then push its `safe_ts` for those regions. Since most regions will pass the safety check, it is not necessary to respond with the IDs of all passing regions. Instead, we can respond with only the IDs of regions that fail the safety check. -The `CheckLeaderResponse` will respond with the regions that pass the Raft safety check. The leader can then push its `safe_ts` for those regions. Since most regions will pass the safety check, it is not necessary to respond with the IDs of all passing regions. Instead, we can respond with only the IDs of regions that fail the safety check. +Another optimization is that we can confirm the leadership if the leader lease is hold, by calling Raft’s read-index command. But this will involve in the Raftstore thread pool, more CPU will be used by this. ### Inactive Regions -Here we name the regions without writing to inactive regions. In the future TiKV will deprecate hibernated regions and merge the small regions into dymanic regions, if so the inactive regions won't be a problem, but for users that disable dynamic regions, this optimization is still required. +Here we list the active regions without writing to inactive regions. In the future, TiKV will deprecate hibernated regions and merge small regions into dynamic ones. If this happens, inactive regions will not be a problem. However, for users who do not use dynamic regions, this optimization is still required. To save traffic, we can push the safe timestamps of inactive regions together without sending the region information list. The `ts` field in `CheckLeaderRequest` is only used to build the relationship between the request and response, although it's fetched from PD. Ideally, we can push the safe timestamps of inactive regions using this `ts` value. Additionally, we can remove inactive regions from `CheckLeaderRequest.regions`. Modify `CheckLeaderRequest` as follows. We only send the IDs for inactive regions. In the most ideal situation, both `LeaderInfo` and the region ID in the response are skipped, reducing the traffic from 64 bytes to 8 bytes per region. -```diff -message CheckLeaderRequest { - repeated LeaderInfo regions = 1; - uint64 ts = 2; - uint64 peer_id = 3; -+ repeated uint64 inactive_regions = 4; -} -``` - One more phase is required to apply the safe ts, because in check leader process, the follower cannot decide whether the request is from a valid leader, so it keep the safe ts in it's memory and wait for apply safe ts request. -```protobuf -message ApplySafeTsRequest { - uint64 ts = 1; - repeated uint64 unsafe_regions = 2; +```diff +#[derive(Clone)] +pub struct RegionReadProgressRegistry { + registry: Arc>>>, ++ checked_states: Arc>>, } -message ApplySafeTsResponse { - uint64 ts = 1; -} ++ struct CheckedState { ++ safe_ts: u64, ++ valid_regions: Vec, ++ } ``` To save traffic, `ApplySafeTsRequest.unsafe_regions` only contains the regions whose leader may be changed. In the ideal case, this request is small because there is almost no unsafe regions. #### Safety -Safety is guaranteed as long as the leader remains unchanged. By decoupling the pushing of safe ts into two phases, the leader peer can ensure that the leadership is still unchanged before safe ts is generated. Only the region IDs of inactive regions are sent, but we can still confirm that the leader has not changed for a follower if both the leader's term and the follower's term remain unchanged. Because the safe ts is applied only after the leadership is confirmed, correctness will not be compromised. +Safety is guaranteed as long as the safe ts is generated from a **valid leader**. By decoupling the pushing of safe ts into two phases, the leader peer can ensure that the leadership is still unchanged before safe ts is generated. + +For inactive regions, only the region IDs are sent. If the term has changed since the last active region check, the follower will respond with a check failure. When the leader receives the `CheckLeaderResponse`, and the inactive region ID is not in `CheckLeaderResponse.failed_regions`, it means that the terms of the leader and follower are both unchanged. After checking leadership for inactive regions, it's safe to push the safe timestamps from the leader. Additionally, since there are no following writes in inactive regions, the `applied_index` check is unnecessary. #### Implementation -To implement safety checks for hibernated regions, we record the term when we receive leader information. We then compare the recorded term with the latest term when we receive a region ID without leader information, which indicates that it is hibernating in the leader. If the terms do not match, we must send the `LeaderInfo` for this region the next time. +To implement safety checks for inactive regions, we record the term when we receive active region checks. We then compare the recorded term with the latest term when we receive a region ID without leader information, which indicates that the region is inactive. If the terms do not match, the leader must treat it as an active region next time. ## Drawbacks This RFC make the resolved ts management more complex. +If there are too many active regions, more CPU will be consumed. + ## Unresolved questions From 5890509e2e693961720e2dc1a620585f99884faf Mon Sep 17 00:00:00 2001 From: you06 Date: Mon, 5 Jun 2023 19:09:46 +0800 Subject: [PATCH 4/5] rename file Signed-off-by: you06 --- ...olved-ts.md => 0108-reduce-traffic-resolved-ts.md} | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) rename text/{0000-reduce-traffic-resolved-ts.md => 0108-reduce-traffic-resolved-ts.md} (87%) diff --git a/text/0000-reduce-traffic-resolved-ts.md b/text/0108-reduce-traffic-resolved-ts.md similarity index 87% rename from text/0000-reduce-traffic-resolved-ts.md rename to text/0108-reduce-traffic-resolved-ts.md index 57d10554..ac2f730f 100644 --- a/text/0000-reduce-traffic-resolved-ts.md +++ b/text/0108-reduce-traffic-resolved-ts.md @@ -10,15 +10,15 @@ TiKV pushes forward resolved ts by `CheckLeader` request, the traffic it costs g By optimizing inactive regions, it is possible to reduce traffic significantly. In a large cluster, many regions are not accessed and remain in a hibernated state. -Let’s review the safe ts push mechanism. +Let's review the safe ts push mechanism. -1. The leader sends a `check_leader` request to the followers with a timestamp from PD, which carries the resolved timestamp pushed in previous step 3. +1. The leader sends a `CheckLeader` request to the followers with a timestamp from PD, which carries the resolved timestamp pushed in previous step 3. 2. Follower response whether the leader matched. -3. If quorum of the voters are still in the current leader’s term, update leader’s `safe_ts`. +3. If quorum of the voters are still in the current leader's term, update leader's `safe_ts`. In step 1, the leader generates safe ts, and in the next round, the followers apply those timestamps. However, there is an "advance-ts-interval" gap between two step 1s, which results in a safe timestamp lag for the followers. -Regarding our test on the hit rate for 5-second staleness, we finally achieved a nearly 100% hit rate by setting the `resolved-ts.advance-ts-interval` to 2.5 seconds. This means that if we want to achieve 100-millisecond staleness in the current mechanism, we should set `resolved-ts.advance-ts-interval` to 50 milliseconds, which doubles the traffic of pushing safe ts. +In other words, to ensure that a read operation with a 1-second staleness works well with a high hit ratio among followers, you need to set `resolved-ts.advance-ts-interval` to 0.5 seconds. This will double the traffic of pushing safe ts. **We need a solution that applies the safe TS more efficiently.** @@ -69,7 +69,7 @@ message ApplySafeTsResponse { The `CheckLeaderResponse` respond with the regions that pass the Raft safety check. The leader can then push its `safe_ts` for those regions. Since most regions will pass the safety check, it is not necessary to respond with the IDs of all passing regions. Instead, we can respond with only the IDs of regions that fail the safety check. -Another optimization is that we can confirm the leadership if the leader lease is hold, by calling Raft’s read-index command. But this will involve in the Raftstore thread pool, more CPU will be used by this. +Another optimization is that we can confirm the leadership if the leader lease is hold, by calling Raft's read-index command. But this will involve in the Raftstore thread pool, more CPU will be used by this. ### Inactive Regions @@ -82,7 +82,6 @@ We only send the IDs for inactive regions. In the most ideal situation, both `Le One more phase is required to apply the safe ts, because in check leader process, the follower cannot decide whether the request is from a valid leader, so it keep the safe ts in it's memory and wait for apply safe ts request. ```diff -#[derive(Clone)] pub struct RegionReadProgressRegistry { registry: Arc>>>, + checked_states: Arc>>, From 706458b409d10f70043948fee4a043ab27b890f6 Mon Sep 17 00:00:00 2001 From: you06 Date: Sun, 25 Jun 2023 19:14:30 +0800 Subject: [PATCH 5/5] update transition state Signed-off-by: you06 --- text/0108-reduce-traffic-resolved-ts.md | 18 +++++++++++++++++- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/text/0108-reduce-traffic-resolved-ts.md b/text/0108-reduce-traffic-resolved-ts.md index ac2f730f..e20eaafc 100644 --- a/text/0108-reduce-traffic-resolved-ts.md +++ b/text/0108-reduce-traffic-resolved-ts.md @@ -73,7 +73,7 @@ Another optimization is that we can confirm the leadership if the leader lease i ### Inactive Regions -Here we list the active regions without writing to inactive regions. In the future, TiKV will deprecate hibernated regions and merge small regions into dynamic ones. If this happens, inactive regions will not be a problem. However, for users who do not use dynamic regions, this optimization is still required. +Here we list the active regions without writing to inactive regions. In the future, TiKV will deprecate hibernated regions and merge small regions into dynamic ones. If this happens, the traffic produced inactive regions will not be a problem. However, for users who do not use dynamic regions, this optimization is still required. To save traffic, we can push the safe timestamps of inactive regions together without sending the region information list. The `ts` field in `CheckLeaderRequest` is only used to build the relationship between the request and response, although it's fetched from PD. Ideally, we can push the safe timestamps of inactive regions using this `ts` value. Additionally, we can remove inactive regions from `CheckLeaderRequest.regions`. Modify `CheckLeaderRequest` as follows. @@ -95,6 +95,22 @@ pub struct RegionReadProgressRegistry { To save traffic, `ApplySafeTsRequest.unsafe_regions` only contains the regions whose leader may be changed. In the ideal case, this request is small because there is almost no unsafe regions. +### Transition State + +Note if the follower receive an inactive region info which contains region ID only, it believes it's safe to push the safe ts from the following apply safe ts request. But the applied index in followers may delay. Thus we need a transition state to ensure the safety. + +```mermaid +stateDiagram-v2 + Active --> Transition + Transition --> Inactive + Transition --> Active + Inactive --> Active +``` + +We introduce a state named `transition`, in proto, it's represented as `LeaderInfo` without `read_state`. When the follower receive a `LeaderInfo` without `read_state`, it will check the term of the region, if the term is changed, it will respond with a `CheckLeaderResponse` with the region ID in `failed_regions`, else it cache the `LeaderInfo` and wait for `ApplySafeTsRequest`. + +Only if the all terms of followers are not changed, the leader turn the state into `inactive`. When the follower receives a `ApplySafeTsRequest` it applies the safe ts, and will use the cached `LeaderInfo` to check the validation of inactive state. + #### Safety Safety is guaranteed as long as the safe ts is generated from a **valid leader**. By decoupling the pushing of safe ts into two phases, the leader peer can ensure that the leadership is still unchanged before safe ts is generated.