raft: Propose in raft node wait the proposal result so we can fail fast while dropping proposal #9137

absolute8511 · 2018-01-12T07:45:14Z

While the leader is transferring or some other states which the raft node may drop the proposal, we should fail fast to notify the proposal canceled. (a separate pr followed by #9067)

This will be used to fix issue #8975 and #8977

codecov-io · 2018-01-12T10:30:59Z

Codecov Report

❗ No coverage uploaded for pull request base (master@2aa3dec). Click here to learn what that means.
The diff coverage is 94.87%.

@@            Coverage Diff            @@
##             master    #9137   +/-   ##
=========================================
  Coverage          ?   72.47%           
=========================================
  Files             ?      367           
  Lines             ?    31257           
  Branches          ?        0           
=========================================
  Hits              ?    22652           
  Misses            ?     6967           
  Partials          ?     1638

Impacted Files	Coverage Δ
raft/node.go	`89.24% <94.87%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2aa3dec...f0dffb4. Read the comment docs.

xiang90 · 2018-01-12T21:59:02Z

raft/node.go

+	// ProposeWithCancel proposes that data be appended to the log
+	// and will invoke the cancel() if the proposal is dropped.
+	// Application can use this to fail fast when the proposal is dropped or cancelled.
+	ProposeWithCancel(ctx context.Context, cancel context.CancelFunc, data []byte) error


why do we need to add this? can we improve Propose func so that it would return an error if proposal is dropped?

Currently, the Propose just send the proposal to chan, and will not wait. If we change the behavior for Propose to block and return an error it may affect others.

I think that it may be better to add a new method. Any suggestion?

Mostly, cancel function is used to cancels the child outside, like:

ctx, cancel := WithCancel(parent) pass(ctx) cancel()

So it is strange for me to see this style here, maybe a callback is better.

Yeah, the method name needs to be improved. A new propose method name should make the caller known that a cancel callback will be used to get notified when the proposal is canceled by dropping. (I found there is no way to cancel in context.Context, so I use the function context.CancelFunc as callback since It is actually canceled the context.Context. )

Currently, the Propose just send the proposal to chan, and will not wait.

that is true. however, step func should be fast and non-blocking most of the time. changing Proposal to wait a bit longer for stepping to finish should have very a little impact if any.

Maybe I missed something. What is your concern specifically?

@siddontang

exactly. if we want to make it work, it will return either a callback or a chan, which will be closes if canceled.

siddontang · 2018-01-22T01:00:31Z

Any update? @absolute8511

We need this feature to improve the timeout for Read operation when leader changed.

/cc @xiang90

absolute8511 · 2018-01-23T11:27:48Z

I changed the Propose to wait on the channel to allow it return the error while dropped by raft.

@xiang90

xiang90 · 2018-01-23T22:50:18Z

raft/node.go

@@ -224,10 +224,15 @@ func RestartNode(c *Config) Node {
 	return &n
 }

+type proposingMsg struct {


the message type inside this will not always be proposal.

renaming this to nodeMsg or msgWithResult is better.

xiang90 · 2018-01-23T22:51:22Z

raft/node.go

-		case m := <-n.recvc:
+			err := r.Step(m)
+			if pm.result != nil {
+				if err == ErrProposalDropped {


why do we only propagate ErrProposalDropped error? we should propagate back all types of errors.

xiang90 · 2018-01-23T22:51:48Z

raft/node.go

-				r.Step(m) // raft never returns an error
+				err := r.Step(m) // raft never returns an error
+				if pm.result != nil {
+					if err == ErrProposalDropped {


xiang90 · 2018-01-23T22:52:00Z

raft/node.go

 			// filter out response message from unknown From.
 			if pr := r.getProgress(m.From); pr != nil || !IsResponseMsg(m.Type) {
-				r.Step(m) // raft never returns an error
+				err := r.Step(m) // raft never returns an error


remove the wrong comment

xiang90 · 2018-01-23T22:53:03Z

raft/node.go

 	ch := n.recvc
 	if m.Type == pb.MsgProp {
 		ch = n.propc
 	}

+	pm := proposingMsg{m: m, result: nil}


no need to put nil in result. it would be nil by default.

xiang90 · 2018-01-23T22:53:18Z

raft/node.go

 	ch := n.recvc
 	if m.Type == pb.MsgProp {
 		ch = n.propc
 	}

+	pm := proposingMsg{m: m, result: nil}
+	if wait {
+		pm.result = make(chan error)


make(chan error, 1)

xiang90 · 2018-01-23T23:01:11Z

raft/node.go

-	propc      chan pb.Message
-	recvc      chan pb.Message
+	propc      chan proposingMsg
+	recvc      chan proposingMsg


do we really need to change the recvc chan? if the proposal is redirected from other peers, the waiter is actually on another node anyway. there is no easy to propagate back the error even with this change.

probably changing the propc is good enough for the current purpose.

xiang90 · 2018-01-23T23:03:05Z

look good to me in general.

we probably want to hook this change up with etcd, and do some benchmarks. i want to make sure this PR wont bring in any significant performance issue.

absolute8511 · 2018-01-24T02:47:42Z

Yeah, a benchmark is needed. Should it in this PR or a new one?

xiang90 · 2018-01-24T22:59:29Z

Yeah, a benchmark is needed. Should it in this PR or a new one?

Let us do it in the same PR, but in a different commit. Thanks. You can use tools/benchmark to perform benchmarks.

siddontang · 2018-01-28T12:27:23Z

any update?

absolute8511 · 2018-01-30T15:58:08Z

Sorry for the delay, I will start tomorrow after my vacation.

absolute8511 · 2018-02-03T16:23:10Z

@xiang90

I looked into the tools/benchmark. What kinds of benchmark do this pr need? I noticed there are some benchmark for write. In this pr I only changed the Propose behavior, so I just re-run the write benchmark to get a result or a new benchmark tool need?

xiang90 · 2018-02-03T17:26:50Z

yes. Let us start with that

…

On Sat, Feb 3, 2018 at 8:23 AM Vincent Lee ***@***.***> wrote: @xiang90 <https://github.com/xiang90> I looked into the tools/benchmark. What kinds of benchmark do this pr need? I noticed there are some benchmark for write. In this pr I only changed the Propose behavior, so I just re-run the write benchmark to get a result or a new benchmark tool need? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9137 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AERby2YrNfr05HeNJsd-TiM5NbbgYiabks5tRIf7gaJpZM4Rb63X> .

robin865 · 2018-02-06T00:06:26Z

So it seems like these changes won't exactly solve this problem? Just a comment, but node.go; step function:

// Step advances the state machine using msgs. The ctx.Err() will be returned,
// if any.
        func (n *node) step(ctx context.Context, m pb.Message) error {
                ch := n.recvc
                if m.Type == pb.MsgProp {
                        ch = n.propc
                }
                select {
                case ch <- m:
                        return nil
                case <-ctx.Done():
                        return ctx.Err()
                case <-n.done:
                        return ErrStopped
                }
        }

When the leader changes, n.propc is now a nil channel. Hence the "case ch <- m:" will block and only the "ctc.Done()" will exit this select. In the changes proposed for this request; I don't see this problem solved. Hence I don't see this allow for "fast failures" in this scenario.

Seems to me we could do something simple here before we send to 'ch';

if ch == nil {
    return Err;
}

absolute8511 · 2018-02-08T06:45:11Z

@xiang90

I run the benchmark on my MacBook using the version with and without this pr and got the below result. I run several times and noticed the result is not very stable. Maybe a standard GCE machine is needed.
benchmark command as below:

./benchmark --endpoints=127.0.0.1:2379 --target-leader --conns=1 --clients=1  put --key-size=8 --sequential-keys --total=10000 --val-size=128
./benchmark --endpoints=127.0.0.1:2379 --target-leader --conns=1 --clients=64  put --key-size=8 --sequential-keys --total=80000 --val-size=128
./benchmark --endpoints=127.0.0.1:2379 --target-leader --conns=1 --clients=256  put --key-size=8 --sequential-keys --total=80000 --val-size=128
./benchmark --endpoints=127.0.0.1:2379 --conns=1 --clients=64  put --key-size=8 --sequential-keys --total=80000 --val-size=128
./benchmark --endpoints=127.0.0.1:2379 --conns=1 --clients=256  put --key-size=8 --sequential-keys --total=80000 --val-size=128

Physical machines

MacBook Pro (Retina, 13-inch, Mid 2014)
Processor: 2.6 GHz Intel Core i5
Memory: 8 GB 1600 MHz DDR3

Test without this pr

etcd Cluster

3 etcd at Git commit:

etcd Version: 3.3.0+git
Git SHA: c5532ebbf
Go Version: go1.9.3
Go OS/Arch: darwin/amd64

Performance in writing one single key

key size in bytes	number of clients	target etcd server	write QPS	90th Percentile Latency (ms)
8	1	leader only	104	13.9
8	64	leader only	2310	40.54
8	256	leader only	5212	65.5
8	64	all servers	2215	41.5
8	256	all servers	5410	62.5

Test with this pr

etcd Cluster

3 etcd at Git commit:

etcd Version: 3.3.0+git
Git SHA: c5532ebbf with this pr patched
Go Version: go1.9.3
Go OS/Arch: darwin/amd64

Performance in writing one single key

key size in bytes	number of clients	target etcd server	write QPS	90th Percentile Latency (ms)
8	1	leader only	106	13.8
8	64	leader only	2316	40.1
8	256	leader only	5057	69.6
8	64	all servers	2297	40.7
8	256	all servers	5424	59.7

absolute8511 · 2018-03-01T11:44:04Z

@xiang90

I re-run the benchmark on standard GCE machine again and use the hack/benchmark script to do the benchmark.

Physical machines

GCE n1-standard-2（2 vCPU，7.5 GB memory ）

1x dedicated local SSD mounted as etcd data directory
7.5 GB memory
2x CPUs

Test without this pr

etcd Cluster

3 etcd at Git commit:

etcd Version: 3.3.0+git
Git SHA: c5532ebbf
Go Version: go1.9.4
Go OS/Arch: linux/amd64

Performance in writing one single key

key size in bytes	number of clients	target etcd server	write QPS	90th Percentile Latency (ms)
64	1	leader only	524	2.2
64	64	leader only	7025	11.8
64	256	leader only	10664	33.0
64	64	all servers	3856	7.2
64	256	all servers	6776	18.9

Test with this pr

etcd Cluster

3 etcd at Git commit:

etcd Version: 3.3.0+git
Git SHA: c5532ebbf + pr
Go Version: go1.9.4
Go OS/Arch: linux/amd64

Performance in writing one single key

key size in bytes	number of clients	target etcd server	write QPS	90th Percentile Latency (ms)
64	1	leader only	582	1.98
64	64	leader only	7307	11.3
64	256	leader only	9803	35.36
64	64	all servers	4645	6.0
64	256	all servers	6726	20.3

siddontang · 2018-03-02T00:59:43Z

Seem that if we don't have many clients, the PR's performance is better, suprised.

siddontang · 2018-03-09T05:02:09Z

any update? @xiang90

absolute8511 · 2018-03-28T10:20:53Z

Are there any other works need to make it merge? @xiang90 @gyuho

xiang90 · 2018-03-30T21:13:41Z

raft/node.go

-	ch := n.recvc
-	if m.Type == pb.MsgProp {
-		ch = n.propc
+func (n *node) stepWait(ctx context.Context, m pb.Message, wait bool) error {


rename this to stepWithWaitOption. change stepWait to call into stepWithWaitOption(..., true)

xiang90 · 2018-03-30T21:14:19Z

raft/node.go

 		case m := <-n.recvc:
 			// filter out response message from unknown From.
 			if pr := r.getProgress(m.From); pr != nil || !IsResponseMsg(m.Type) {
-				r.Step(m) // raft never returns an error
+				r.Step(m)


do we consume the returned error?

As you comment before:

do we really need to change the recvc chan? if the proposal is redirected from other peers, the waiter is actually on another node anyway. there is no easy to propagate back the error even with this change.
probably changing the propc is good enough for the current purpose.

I leaved this unchanged

xiang90 · 2018-03-30T21:15:36Z

@absolute8511 can you please reply to the comment #9137 (comment)?

siddontang · 2018-03-31T13:32:18Z

@absolute8511

In our test by @nolouch, this still can't fail fast when leader changed sometimes.

absolute8511 · 2018-03-31T14:18:31Z

@Rsm10 @xiang90 @siddontang

When the leader changes, n.propc is now a nil channel.

I think this is not always true. And this pr should only solve part of the fail fast. Only proposal dropped by the local Step will fail fast in this pr. Other fail fast issue may need another pr.

If dropped by remote currently we just wait.

@siddontang could you please give some brief test cases which can't fail fast currently, maybe we need some other pr in future to solve other fail fast issue.

siddontang · 2018-03-31T15:24:29Z

@absolute8511

Only proposal dropped by the local Step will fail fast in this pr. Other fail fast issue may need another pr.

You have already explained. If we isolate a follower and rejoin again, the leader will step down, at that time, all the proposed requests can't be notified quickly but have to wait timeout.

absolute8511 · 2018-04-01T15:33:31Z

@xiang90
The CI result:

Expected commit title format '<package>{", "<package>}: <description>'
Got: 3d504737e add chain core to raft users list

Have no idea what's wrong with my commit title?

xiang90 · 2018-04-01T19:08:00Z

@absolute8511 rebase with the current master?

…st while dropping proposal.

absolute8511 · 2018-04-04T14:18:34Z

@xiang90
All done.

xiang90 · 2018-04-04T16:23:04Z

lgtm

guoger · 2019-03-19T06:45:00Z

@absolute8511 doing some archaeological work here and i'm wondering why wasn't ProposeConfChange changed as part of this PR? thx!

absolute8511 force-pushed the raft-proposal-cancel branch from 86e45ec to 3513727 Compare January 12, 2018 08:07

xiang90 reviewed Jan 12, 2018

View reviewed changes

absolute8511 force-pushed the raft-proposal-cancel branch from 3513727 to 7e70fe9 Compare January 22, 2018 09:47

absolute8511 changed the title ~~raft: add new ProposeWithCancel for raft node so we can fail fast while dropping proposal~~ raft: Propose in raft node wait the proposal result so we can fail fast while dropping proposal Jan 22, 2018

absolute8511 force-pushed the raft-proposal-cancel branch 4 times, most recently from ea9e414 to 2505eed Compare January 23, 2018 03:12

xiang90 reviewed Jan 23, 2018

View reviewed changes

absolute8511 force-pushed the raft-proposal-cancel branch from 2505eed to 068e401 Compare January 24, 2018 02:43

gyuho added the area/raft label Feb 25, 2018

absolute8511 force-pushed the raft-proposal-cancel branch from 068e401 to 5be3085 Compare March 1, 2018 11:59

xiang90 reviewed Mar 30, 2018

View reviewed changes

absolute8511 force-pushed the raft-proposal-cancel branch from 5be3085 to bbb9ba9 Compare March 31, 2018 14:36

absolute8511 force-pushed the raft-proposal-cancel branch from bbb9ba9 to 76fc899 Compare April 1, 2018 14:45

absolute8511 force-pushed the raft-proposal-cancel branch 2 times, most recently from 8071ab1 to cd15aa9 Compare April 2, 2018 02:47

raft: Propose in raft node wait the proposal result so we can fail fa…

f0dffb4

…st while dropping proposal.

absolute8511 force-pushed the raft-proposal-cancel branch from cd15aa9 to f0dffb4 Compare April 3, 2018 03:04

xiang90 merged commit 949c6fe into etcd-io:master Apr 4, 2018

langkeer mentioned this pull request Apr 9, 2018

why it took about 8s to time out after publishing fail? Is there any way to shorten the time? #8977

Closed

nolouch mentioned this pull request Apr 9, 2018

etcdserver: handle proposal error when processing internal raft request #9549

Merged

lavacat mentioned this pull request Apr 11, 2023

stepWait not used in ProposeConfChange etcd-io/raft#43

Open

raft: Propose in raft node wait the proposal result so we can fail fast while dropping proposal #9137

raft: Propose in raft node wait the proposal result so we can fail fast while dropping proposal #9137

Conversation

absolute8511 commented Jan 12, 2018 • edited Loading

codecov-io commented Jan 12, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

absolute8511 Jan 14, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

siddontang commented Jan 22, 2018

absolute8511 commented Jan 23, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiang90 commented Jan 23, 2018

absolute8511 commented Jan 24, 2018

xiang90 commented Jan 24, 2018

siddontang commented Jan 28, 2018

absolute8511 commented Jan 30, 2018

absolute8511 commented Feb 3, 2018

xiang90 commented Feb 3, 2018 via email

robin865 commented Feb 6, 2018 • edited Loading

absolute8511 commented Feb 8, 2018 • edited Loading

Physical machines

Test without this pr

etcd Cluster

Performance in writing one single key

Test with this pr

etcd Cluster

Performance in writing one single key

absolute8511 commented Mar 1, 2018 • edited Loading

Physical machines

Test without this pr

etcd Cluster

Performance in writing one single key

Test with this pr

etcd Cluster

Performance in writing one single key

siddontang commented Mar 2, 2018

siddontang commented Mar 9, 2018

absolute8511 commented Mar 28, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiang90 commented Mar 30, 2018

siddontang commented Mar 31, 2018

absolute8511 commented Mar 31, 2018 • edited Loading

siddontang commented Mar 31, 2018

absolute8511 commented Apr 1, 2018

xiang90 commented Apr 1, 2018

absolute8511 commented Apr 4, 2018

xiang90 commented Apr 4, 2018

guoger commented Mar 19, 2019

absolute8511 commented Jan 12, 2018 •

edited

Loading

codecov-io commented Jan 12, 2018 •

edited

Loading

absolute8511 Jan 14, 2018 •

edited

Loading

robin865 commented Feb 6, 2018 •

edited

Loading

absolute8511 commented Feb 8, 2018 •

edited

Loading

absolute8511 commented Mar 1, 2018 •

edited

Loading

absolute8511 commented Mar 31, 2018 •

edited

Loading