etcdserver: renaming db happens after snapshot persists to wal and snap files #7876

fanminshi · 2017-05-04T18:43:36Z

In the case that follower recieves a snapshot from leader
and crashes before renaming xxx.snap.db to db but after
snapshot has persisted to .wal and .snap, restarting
follower results loading old db, new .wal, and new .snap.
This will cause a index mismatch between snap metadata index
and consistent index from db.

This pr forces an ordering where saving/renaming db must
happen after snapshot is persisted to wal and snap file.
this guarantees wal and snap files are newer than db.
on server restart, etcd server checks if snap index > db consistent index.
if yes, etcd server attempts to load xxx.snap.db where xxx=snap index
if there is any and panic other wise.

FIXES #7628

gyuho · 2017-05-04T18:58:07Z

etcdserver/util.go

@@ -95,3 +96,19 @@ func (nc *notifier) notify(err error) {
 	nc.err = err
 	close(nc.c)
 }
+
+func loadBackend(bepath string, be *backend.Backend, QuotaBackendBytes int64) {


can we just return backend.Backend?
And s/QuotaBackendBytes/quotaBackendBytes/?

I'll do that.

gyuho · 2017-05-04T18:58:31Z

snap/db.go

@@ -52,23 +55,26 @@ func (s *Snapshotter) SaveDBFrom(r io.Reader, id uint64) (int64, error) {
 		return n, err
 	}

-	plog.Infof("saved database snapshot to disk [total bytes: %d]", n)
-
+	plog.Infof("saved database snapshot %v to disk [total bytes: %d]", fn, n)


%s to disk?

gyuho · 2017-05-04T19:00:46Z

etcdserver/server.go

+			// 1. check if xxx.snap.db (xxx==snapshot.Metadata.Index) exists.
+			// 2. rename xxx.snap.db to db if exists.
+			// 3. load backend again with the new db file.
+			snapfn, err := snap.GetDBFilePathByID(cfg.SnapDir(), snapshot.Metadata.Index)


Can we just use DBFilePath?

to use snap.DBFilePath() function? func (s *Snapshotter) DBFilePath(id uint64) (string, error) is tied to an object. that's why I created another GetDBFilePathByID() that's a class function.

Oh ok, we can't do s.r.storage.DBFilePath because we haven't created EtcdServer yet.

fanminshi · 2017-05-04T19:18:48Z

all fixed @gyuho

gyuho · 2017-05-04T19:22:13Z

etcdserver/server.go

+			if snapfn != "" {
+				if err := os.Rename(snapfn, bepath); err != nil {
+					plog.Panicf("rename snapshot file error: %v", err)
+				}


Shouldn't we close existing be before we rename and re-open?

Defer to @xiang90 and @heyitsanthony

Thanks!

heyitsanthony · 2017-05-04T19:53:48Z

Why doesn't etcd rename the db snapshot before the raft snapshot is written down? Patching over a lost rename after the raft snapshot is saved could be masking atomicity problems with the snapshot path.

fanminshi · 2017-05-04T20:22:59Z

@heyitsanthony

Saving snapshot from the leader and renaming it is not an atomic procedure in the original code.

Follower saves snapshot from leader with format xxx.snap.db at https://github.com/coreos/etcd/blob/master/rafthttp/http.go#L205
Follower then renames xxx.snap.db to db at here.
https://github.com/coreos/etcd/blob/master/etcdserver/server.go#L812

The node from #7628 has proceeded to step 1 but crashed before step 2 happened.

heyitsanthony · 2017-05-04T20:26:43Z

@fanminshi yes, that's what the code does. I'm asking if the snapshot saving protocol is correct since raft is saying a snapshot was applied but on reboot it turns out it is not since the rename didn't take.

fanminshi · 2017-05-04T20:53:41Z

I look at the code again. It appears to me that the saving protocol is not correct.

saving .wal and .snap from incoming snapshot happens in a separate go routine than renaming xxx.snap.db to db; Hence, there is not way to guarantee db, .wal, and .snap are in synced.

// save snapshot to .wal and .snap
if !raft.IsEmptySnap(rd.Snapshot) {
	// gofail: var raftBeforeSaveSnap struct{}
	if err := r.storage.SaveSnap(rd.Snapshot); err != nil {
		plog.Fatalf("raft save snapshot error: %v", err)
	}
	// gofail: var raftAfterSaveSnap struct{}
	r.raftStorage.ApplySnapshot(rd.Snapshot)
	plog.Infof("raft applied incoming snapshot at index %d", rd.Snapshot.Metadata.Index)
	// gofail: var raftAfterApplySnap struct{}
}

https://github.com/coreos/etcd/blob/master/etcdserver/raft.go#L227-L231

Saving protocol must guarantees .wal, .snap, db are saved in one atomic operation, or else the situations, (db is new and .wal and .snap is old) and (db is old and .wal and .snap is new as shown in #7628), can happen.

I am not sure if that guarantee can be achieved.
one solution is to minimize the spanning window of saving .wal, .snap, db so the chance of diverge is small.
another solution is optimistically recover on restart.

heyitsanthony · 2017-05-04T21:05:08Z

Saving protocol must guarantees .wal, .snap, db are saved in one atomic operation

It cannot and that is not necessary. All that's needed is an ordering so that etcd won't try to load the snapshot before it's been entirely applied to disk.

one solution is to minimize the spanning window of saving .wal, .snap, db so the chance of diverge is small.

This will lead to corruption/data loss.

another solution is optimistically recover on restart.

This will lead to corruption/data loss.

fanminshi · 2017-05-04T21:18:21Z

@heyitsanthony the ordering I am thinking about is to save/renaming db before .wal and .snap is saved.

case: follower didn't not save db and crashed.
Ok: restart follower will use on old db and old .wal and leader will send snapshot again.

case: follower saved new db and crashed before saving .wal and .snap.
db is new but .wal and .snap is old. I am not sure how follower should handle this case. From my testing, I saw follower panic in raft layer if the follower is brand new.(I need to investigate this one more)

case: follower save both db and .wal and .snap:
Ok: follower should restart without any problem.

heyitsanthony · 2017-05-04T22:05:17Z

@fanminshi whatever write-out policy guaranteeing the reboot the invariant snapshot.Metadata.Index <= db.ConsistentIndex() (checked here: https://github.com/coreos/etcd/blob/master/etcdserver/server.go#L457) when loading from the starting db file should be OK.

codecov-io · 2017-05-04T22:51:06Z

Codecov Report

❗ No coverage uploaded for pull request base (master@a53a9e1). Click here to learn what that means.
The diff coverage is 60.86%.

@@            Coverage Diff            @@
##             master    #7876   +/-   ##
=========================================
  Coverage          ?   75.62%           
=========================================
  Files             ?      332           
  Lines             ?    26316           
  Branches          ?        0           
=========================================
  Hits              ?    19901           
  Misses            ?     4979           
  Partials          ?     1436

Impacted Files	Coverage Δ
etcdserver/util.go	`78.46% <53.33%> (ø)`
etcdserver/server.go	`80.43% <66.66%> (ø)`
snap/db.go	`66.66% <80%> (ø)`
etcdserver/raft.go	`87.87% <80%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a53a9e1...dfdaf08. Read the comment docs.

fanminshi · 2017-05-04T23:14:06Z

All fixed. PTAL

heyitsanthony · 2017-05-04T23:17:22Z

etcdserver/raft.go

@@ -223,6 +226,8 @@ func (r *raftNode) start(rh *raftReadyHandler) {
 				// gofail: var raftAfterSave struct{}

 				if !raft.IsEmptySnap(rd.Snapshot) {
+					// waits etcd server to finish renaming snap db to db.
+					<-replaceDBDone


can this reuse raftDone instead of needing a separate channel?

it seems to me that the purpose of raftDone is to notified etcdserver that raft has persisted log.

The purpose of replaceDBDone is to notify raft that etcdserver has rename the snapshot db to db.

Both chans serve different purposes. I don't see a good way to control above logic with one chan.

The applier will always post to raftDone if there's a snapshot and the raft loop will always wait on raftDone if there's a snapshot. What's to control? I want to avoid the extra channel allocation on this path if possible.

kk, it seems to me the raftDone should act as execution flow chan between etcd server and raft since both need to send msg to it. I am not sure if the name raftDone make sense anymore.

heyitsanthony · 2017-05-04T23:18:15Z

test case?

fanminshi · 2017-05-04T23:39:49Z

@heyitsanthony I manually injection of panic before renaming db https://github.com/coreos/etcd/blob/master/etcdserver/server.go#L812 to simulate #7628.

is there a simple way to programmatically simulate

leader transfer snap to a follower
follower panic before renaming
follower restart with a new binary that doesn't have the panic.

heyitsanthony · 2017-05-04T23:46:30Z

@fanminshi it might be easier to use the Recorder mocks to test the order of operations is obeyed. Something like TestSnapshot in server_test.go.

fanminshi · 2017-05-05T21:06:37Z

Add a test. PTAL

heyitsanthony · 2017-05-05T21:09:47Z

etcdserver/server_test.go

+	}
+}
+
+// func TestSnapshotOrdering2(t *testing.T) {


heyitsanthony · 2017-05-05T21:10:46Z

etcdserver/server_test.go

+
+	testdir, err := ioutil.TempDir(os.TempDir(), "testsnapdir")
+	if err != nil {
+		t.Fatalf("Couldn't open tempdir (%v)", err)


s/C/c; error messages are lower case

heyitsanthony · 2017-05-05T21:10:53Z

etcdserver/server_test.go

+	}
+	defer os.RemoveAll(testdir)
+	if err := os.MkdirAll(testdir+"/member/snap", 0755); err != nil {
+		t.Fatalf("Couldn't make snap dir (%v)", err)


heyitsanthony · 2017-05-05T21:12:31Z

etcdserver/server_test.go

+				seenDBFilePath = true
+			case "SaveSnap":
+				if !seenDBFilePath {
+					t.Fatalf("expect DBFilePath calls before SaveSnap, but it is other way around")


better error message please

"SaveSnap() calls before DBFilePath(), but it shouldn't." ?

SaveSnap called before DBFilePath

heyitsanthony · 2017-05-05T21:15:41Z

etcdserver/server_test.go

+	n.readyc <- ready
+	var seenDBFilePath bool
+	timer := time.After(5 * time.Second)
+	for {


why loop? is the sequence nondeterministic?

Save() https://github.com/coreos/etcd/blob/master/etcdserver/raft.go#L217 can be called before or after DBFilePath().

the sequence can be [Save(), DBFilePath(), SaveSnap()] or [ DBFilePath(), Save(), SaveSnap()]. all I care is that DBFilePath() happens before SaveSnap()

yeah I got that from the first comment

heyitsanthony · 2017-05-05T21:19:02Z

etcdserver/server_test.go

+	s.applyV2 = &applierV2store{store: s.store, cluster: s.cluster}
+
+	be, tmpPath := backend.NewDefaultTmpBackend()
+	defer func() {


defer os.RemoveAll(tmpPath)

heyitsanthony · 2017-05-05T21:20:39Z

etcdserver/raft.go

@@ -83,7 +83,7 @@ type RaftTimer interface {
 type apply struct {
 	entries  []raftpb.Entry
 	snapshot raftpb.Snapshot
-	raftDone <-chan struct{} // rx {} after raft has persisted messages
+	notifyc  chan struct{} // notifyc acts a bridge between etcdserver and raftNode


// notifyc synchronizes etcd server applies with the raft node notifyc chan struct{}

heyitsanthony · 2017-05-05T21:21:31Z

etcdserver/raft.go

@@ -223,6 +223,8 @@ func (r *raftNode) start(rh *raftReadyHandler) {
 				// gofail: var raftAfterSave struct{}

 				if !raft.IsEmptySnap(rd.Snapshot) {
+					// waits etcd server to finish renaming snap db to db.


// wait for snapshot db to be renamed to in-use db

heyitsanthony · 2017-05-05T21:22:21Z

etcdserver/server.go

@@ -812,6 +812,8 @@ func (s *EtcdServer) applySnapshot(ep *etcdProgress, apply *apply) {
 	if err := os.Rename(snapfn, fn); err != nil {
 		plog.Panicf("rename snapshot file error: %v", err)
 	}
+	// notifies raftNode that db has been replaced.


// raft can now claim the snapshot has been applied

heyitsanthony · 2017-05-05T21:23:51Z

etcdserver/server_test.go

@@ -951,6 +951,149 @@ func TestSnapshot(t *testing.T) {
 	<-ch
 }

+// TestSnapshotOrdering ensures that when applying snapshot etcdserver renames snap db to db before raft persists snapshot to wal and snap files.


// TestSnapshotOrdering ensures the snapshot db is applied before acknowledging the snapshot in raft

fanminshi · 2017-05-05T21:55:45Z

All fixed. PTAL

heyitsanthony · 2017-05-05T22:03:49Z

etcdserver/server_test.go

+	}
+
+	rs := raft.NewMemoryStorage()
+	p := mockstorage.NewStorageRecorderStream(testdir)


p := mockstorage.NewStorageRecorder(testdir)

otherwise hangs on failure

guess not, NewStorageRecorder doesn't work with the fix...

fanminshi · 2017-05-05T22:13:03Z

merge when green.

heyitsanthony · 2017-05-08T20:43:45Z

@fanminshi is this ready to merge?

xiang90 · 2017-05-08T20:50:53Z

@heyitsanthony @fanminshi

I have a concern on this approach. If we blindly pick up the db file with the current snapshot recovery ordering (db -> snap), we might end up with starting the server with an old snap file and a new db.

Then we have a storage state that is ahead of the raft index.

We probably need to either rollback db file or move forward with the unfinished snapshot process during a restart when etcd server failed in the process of snapshot recovery.

fanminshi · 2017-05-08T21:30:33Z

I think that storage state that's ahead of raft index doesn't make sense; we expect db state <= raft index . I am working on a new fix to address @xiang90 concerns.

The order right now is, 1. save db snapshot xxx.snap.db, 2. rename xxx.snap.db to db, 3. persists snapshot to snap and wal.

If etcd server fails before 3, then db is ahead of raft.

Suppose the new ordering is 1. save db snapshot xxx.snap.db, 2. persists snapshot to snap and wal, 3. rename xxx.snap.db to db.

failure at before and after 1 triggers leader to resent snapshot.
failure after 2 and before 3 results snap and wal being newer than db. Since step 1 must happen before step 2, all we need to do is to rename xxx.snap.db to db and load that on server restart.
failure after 3 is fine since etcd server will load the most recent db and snap on restart.

fanminshi · 2017-05-08T22:03:27Z

fix db state ahead of raft issue. PTAL

heyitsanthony · 2017-05-08T22:11:27Z

snap/db.go

+	return GetDBFilePathByID(s.dir, id)
+}
+
+func GetDBFilePathByID(dbPath string, id uint64) (string, error) {


avoid public functions when not used outside of package

dbFilePathFromID

I needs to export this function so that etcdserver can use it to find xxx.snap.db.

xiang90 · 2017-05-08T22:56:14Z

etcdserver/util.go

@@ -95,3 +96,20 @@ func (nc *notifier) notify(err error) {
 	nc.err = err
 	close(nc.c)
 }
+
+func loadBackend(bepath string, quotaBackendBytes int64) (be backend.Backend) {


openBackend?

will change.

xiang90 · 2017-05-08T23:09:30Z

etcdserver/server.go

@@ -385,6 +372,25 @@ func NewServer(cfg *ServerConfig) (srv *EtcdServer, err error) {
 				plog.Panicf("recovered store from snapshot error: %v", err)
 			}
 			plog.Infof("recovered store from snapshot at index %d", snapshot.Metadata.Index)
+


probably move this part to a new func.

fanminshi · 2017-05-08T23:09:56Z

etcdserver/server.go

+			kv := mvcc.New(be, &lease.FakeLessor{}, &cIndex)
+			kvindex := kv.ConsistentIndex()
+			if snapshot.Metadata.Index > kvindex {
+				snapfn, err := snap.GetDBFilePathByID(cfg.SnapDir(), snapshot.Metadata.Index)


@heyitsanthony usingGetDBFilePathByID() at here.

xiang90 · 2017-05-08T23:15:23Z

etcdserver/server.go

+			if snapshot.Metadata.Index > kvindex {
+				snapfn, err := snap.GetDBFilePathByID(cfg.SnapDir(), snapshot.Metadata.Index)
+				if err != nil {
+					plog.Panic(err)


need a better panic message.

"finding xxx.snap.db error: w/e"?

xiang90 · 2017-05-08T23:15:34Z

etcdserver/server.go

+					plog.Panic(err)
+				}
+				if err := os.Rename(snapfn, bepath); err != nil {
+					plog.Panicf("rename snapshot file error: %v", err)


need a better panic message.

we need more context on this error.

"renaming xxx.snap.db to db error: w/e"?

error recovering etcd from snapshot: failed to...

fanminshi · 2017-05-09T18:51:02Z

all fixed. PTAL

gyuho · 2017-05-09T19:01:31Z

etcdserver/util.go

@@ -15,11 +15,19 @@
 package etcdserver

 import (
+	"os"
 	"time"



gofmt this blank line?

xiang90 · 2017-05-09T19:25:16Z

@fanminshi LGTM in general. Defer to @gyuho and @heyitsanthony

xiang90 · 2017-05-09T19:25:34Z

etcdserver/raft.go

 					// gofail: var raftAfterSaveSnap struct{}
 					r.raftStorage.ApplySnapshot(rd.Snapshot)
 					plog.Infof("raft applied incoming snapshot at index %d", rd.Snapshot.Metadata.Index)
 					// gofail: var raftAfterApplySnap struct{}
+


remove this line?

heyitsanthony · 2017-05-09T20:38:12Z

lgtm after fixing line formatting suggestions

…ap files In the case that follower recieves a snapshot from leader and crashes before renaming xxx.snap.db to db but after snapshot has persisted to .wal and .snap, restarting follower results loading old db, new .wal, and new .snap. This will causes a index mismatch between snap metadata index and consistent index from db. This pr forces an ordering where saving/renaming db must happen after snapshot is persisted to wal and snap file. this guarantees wal and snap files are newer than db. on server restart, etcd server checks if snap index > db consistent index. if yes, etcd server attempts to load xxx.snap.db where xxx=snap index if there is any and panic other wise. FIXES etcd-io#7628

…g wal and snap files

fanminshi · 2017-05-09T21:01:40Z

fixed formatting issue. merging when green.

fanminshi force-pushed the fix_7628 branch from ad787e1 to b6ad71c Compare May 4, 2017 18:44

gyuho reviewed May 4, 2017

View reviewed changes

fanminshi force-pushed the fix_7628 branch from b6ad71c to 7b16bbd Compare May 4, 2017 19:17

gyuho reviewed May 4, 2017

View reviewed changes

fanminshi force-pushed the fix_7628 branch from 7b16bbd to 26a0db8 Compare May 4, 2017 19:29

fanminshi force-pushed the fix_7628 branch from 26a0db8 to d31e1cb Compare May 4, 2017 23:13

fanminshi changed the title ~~etcdserver: renames xxx.snap.db to db in NewServer()~~ etcdserver: renaming db happens before snapshot persists to wal and snap files May 4, 2017

heyitsanthony suggested changes May 4, 2017

View reviewed changes

fanminshi force-pushed the fix_7628 branch 2 times, most recently from 7590bf3 to 1e58b07 Compare May 5, 2017 21:05

fanminshi force-pushed the fix_7628 branch from 1e58b07 to 2b1b1ba Compare May 5, 2017 21:21

heyitsanthony suggested changes May 5, 2017

View reviewed changes

fanminshi force-pushed the fix_7628 branch from 2b1b1ba to 2d6bd2f Compare May 5, 2017 21:54

heyitsanthony reviewed May 5, 2017

View reviewed changes

heyitsanthony approved these changes May 5, 2017

View reviewed changes

fanminshi force-pushed the fix_7628 branch from 2d6bd2f to 8cac929 Compare May 8, 2017 22:01

heyitsanthony reviewed May 8, 2017

View reviewed changes

fanminshi changed the title ~~etcdserver: renaming db happens before snapshot persists to wal and snap files~~ etcdserver: renaming db happens after snapshot persists to wal and snap files May 8, 2017

fanminshi force-pushed the fix_7628 branch from 8cac929 to 57e7830 Compare May 8, 2017 22:46

xiang90 reviewed May 8, 2017

View reviewed changes

fanminshi commented May 8, 2017

View reviewed changes

xiang90 reviewed May 8, 2017

View reviewed changes

fanminshi force-pushed the fix_7628 branch 2 times, most recently from 813a678 to cf77919 Compare May 9, 2017 17:56

gyuho reviewed May 9, 2017

View reviewed changes

etcdserver/util.go Outdated

@@ -15,11 +15,19 @@

package etcdserver

import (

"os"

"time"

Copy link

Contributor

gyuho May 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gofmt this blank line?

xiang90 reviewed May 9, 2017

View reviewed changes

fanminshi added 2 commits May 9, 2017 14:00

etcdserver: add a test to ensure renaming db happens before persistin…

dfdaf08

…g wal and snap files

fanminshi force-pushed the fix_7628 branch from cf77919 to dfdaf08 Compare May 9, 2017 21:00

fanminshi merged commit 47f5b7c into etcd-io:master May 9, 2017

gyuho mentioned this pull request Mar 23, 2018

*: "--unsafe-overwrite-db" flag to support v2 migration with no previous v3 data #9484

Closed

etcdserver: renaming db happens after snapshot persists to wal and snap files #7876

etcdserver: renaming db happens after snapshot persists to wal and snap files #7876

Conversation

fanminshi commented May 4, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fanminshi May 4, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fanminshi commented May 4, 2017

Choose a reason for hiding this comment

heyitsanthony commented May 4, 2017

fanminshi commented May 4, 2017

heyitsanthony commented May 4, 2017

fanminshi commented May 4, 2017

heyitsanthony commented May 4, 2017

fanminshi commented May 4, 2017

heyitsanthony commented May 4, 2017

codecov-io commented May 4, 2017 • edited Loading

Codecov Report

fanminshi commented May 4, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

heyitsanthony commented May 4, 2017

fanminshi commented May 4, 2017

heyitsanthony commented May 4, 2017

fanminshi commented May 5, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fanminshi May 5, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fanminshi commented May 5, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fanminshi commented May 5, 2017

heyitsanthony commented May 8, 2017

xiang90 commented May 8, 2017 • edited Loading

fanminshi commented May 8, 2017

fanminshi commented May 8, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fanminshi commented May 9, 2017

Choose a reason for hiding this comment

xiang90 commented May 9, 2017

Choose a reason for hiding this comment

heyitsanthony commented May 9, 2017

fanminshi commented May 9, 2017

fanminshi commented May 4, 2017 •

edited

Loading

fanminshi May 4, 2017 •

edited

Loading

codecov-io commented May 4, 2017 •

edited

Loading

fanminshi May 5, 2017 •

edited

Loading

xiang90 commented May 8, 2017 •

edited

Loading