Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vendor: Backport bbolt freelist corruption and fragmentation fixes to 3.2 branch #8813

Merged
merged 2 commits into from
Nov 16, 2017

Conversation

jpbetz
Copy link
Contributor

@jpbetz jpbetz commented Nov 2, 2017

This PR backports all of coreos/bbolt v1.3.1-coreos.5 to the etcd 3.2 branch. This includes a larger but manageable set of bolt db related fixes. See https://github.com/coreos/bbolt/releases for full list of changes.

The most critical fixes we need from this backport are:

tx.WriteTo Freelist Corruption Fix

etcd-io/bbolt#67 was identified while testing this PR, but it impacts all etcd 3.2- releases. We tracked it down to data race in the bbolt db WriteTo function that is used by etcd HA recovery that can result in corrupted freelists in db files.

Fragmentation Fix

etcd-io/bbolt#3 prevents "runaway" bolt data file fragmentation under certain workloads. We've tested a fork of etcd 3.0.17 with a fork bolt db v1.3.0 that had the fix applied to it. We found that the fix improved the operability of kubernetes at scale.

For compatibility with all etcd 3.2.* releases, the NoFreelistSync option introduced in etcd-io/bbolt#1 will not be enabled.

@jpbetz jpbetz changed the title vendor: Switch from boltdb v1.3.0 to coreos/bbolt v1.3.1-coreos.3 vendor: For release-3.2 branch, switch from boltdb v1.3.0 to coreos/bbolt v1.3.1-coreos.3 Nov 2, 2017
@jpbetz jpbetz self-assigned this Nov 2, 2017
@gyuho gyuho requested a review from xiang90 November 2, 2017 22:21
@jpbetz
Copy link
Contributor Author

jpbetz commented Nov 2, 2017

@xiang90 @gyuho Alternatively, I'd be willing to do this with a more selective backport of just etcd-io/bbolt#3, but given that this will also help us vet the bbolt improvements for the 3.3 branch, I figured I'd start with this proposal.

Let me know if I missed any backward compatibility issues you're aware of.

@@ -25,7 +25,7 @@ import (
"sync/atomic"
"time"

"github.com/boltdb/bolt"
bolt "github.com/coreos/bbolt"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious on why naming the import github.com/coreos/bbolt to bolt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, I see what's going on.

Copy link
Contributor Author

@jpbetz jpbetz Nov 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to minimize the code change-- this way I only need to change the import, not any of the package references in the rest of the go file.

Copy link
Contributor Author

@jpbetz jpbetz Nov 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For master this wouldn't make as much sense since that code lives on indefinitely, but for a backport to a branch that has a finite livespan it seemed simpler.

@gyuho
Copy link
Contributor

gyuho commented Nov 2, 2017

As long as NoFreelistSync: false.

Btw, functional-tester in release-3.2 fails https://semaphoreci.com/coreos/etcd/branches/pull-request-8813/builds/1.

connection refused (http://127.0.0.1:22379)

Probably, http://127.0.0.1:22379 node couldn't recover with this patch.

We can investigate more tomorrow.

@jpbetz
Copy link
Contributor Author

jpbetz commented Nov 3, 2017

Sure enough. RELEASE_TEST=y INTEGRATION=y PASSES='functional' ./test passed locally. I'll dig in tomorrow.

@gyuho
Copy link
Contributor

gyuho commented Nov 3, 2017

@jpbetz Locally reproducible with my Linux VM

# tester log
2017-11-03 19:29:11.580149 I | etcd-agent: restart etcd
2017-11-03 19:29:16.612959 W | etcd-tester: #0 setHealthKey error (context deadline exceeded (http://127.0.0.1:22379))
2017-11-03 19:29:18.987222 I | etcd-tester: transport: http2Client.notifyError got notified that the client transport was broken read tcp 127.0.0.1:54062->127.0.0.1:22379: read: connection reset by peer.
2017-11-03 19:29:18.987246 I | etcd-tester: transport: http2Client.notifyError got notified that the client transport was broken read tcp 127.0.0.1:54046->127.0.0.1:22379: read: connection reset by peer.
2017-11-03 19:29:18.987256 I | etcd-tester: transport: http2Client.notifyError got notified that the client transport was broken read tcp 127.0.0.1:54038->127.0.0.1:22379: read: connection reset by peer.
2017-11-03 19:29:18.987522 I | etcd-tester: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:22379: getsockopt: connection refused"; Reconnecting to {127.0.0.1:22379 <nil>}
2017-11-03 19:29:18.988234 W | etcd-tester: #1 setHealthKey error (rpc error: code = Internal desc = transport is closing (http://127.0.0.1:22379))
2017-11-03 19:29:18.988321 I | etcd-tester: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:22379: operation was canceled"; Reconnecting to {127.0.0.1:22379 <nil>}
2017-11-03 19:29:18.988330 I | etcd-tester: grpc: addrConn.transportMonitor exits due to: grpc: the connection is closing
# agent-2/etcd.log shows it fatal-ed
2017-11-03 19:29:18.202508 I | rafthttp: established a TCP streaming connection with peer 82993e6b8783693c (stream Message writer)
2017-11-03 19:29:18.202627 I | rafthttp: peer ddd7d6d5ef97d9b6 became active
2017-11-03 19:29:18.202642 I | rafthttp: established a TCP streaming connection with peer ddd7d6d5ef97d9b6 (stream Message writer)
2017-11-03 19:29:18.202817 I | rafthttp: established a TCP streaming connection with peer ddd7d6d5ef97d9b6 (stream MsgApp v2 writer)
2017-11-03 19:29:18.202998 I | rafthttp: established a TCP streaming connection with peer ddd7d6d5ef97d9b6 (stream Message reader)
2017-11-03 19:29:18.203041 I | rafthttp: established a TCP streaming connection with peer 82993e6b8783693c (stream MsgApp v2 writer)
2017-11-03 19:29:18.203220 I | rafthttp: established a TCP streaming connection with peer ddd7d6d5ef97d9b6 (stream MsgApp v2 reader)
2017-11-03 19:29:18.203409 W | rafthttp: closed an existing TCP streaming connection with peer 82993e6b8783693c (stream MsgApp v2 writer)
2017-11-03 19:29:18.203427 I | rafthttp: established a TCP streaming connection with peer 82993e6b8783693c (stream MsgApp v2 writer)
2017-11-03 19:29:18.203601 W | rafthttp: closed an existing TCP streaming connection with peer ddd7d6d5ef97d9b6 (stream Message writer)
2017-11-03 19:29:18.203621 I | rafthttp: established a TCP streaming connection with peer ddd7d6d5ef97d9b6 (stream Message writer)
2017-11-03 19:29:18.208041 W | rafthttp: closed an existing TCP streaming connection with peer 82993e6b8783693c (stream Message writer)
2017-11-03 19:29:18.208183 I | rafthttp: established a TCP streaming connection with peer 82993e6b8783693c (stream Message writer)
2017-11-03 19:29:18.208228 W | rafthttp: closed an existing TCP streaming connection with peer ddd7d6d5ef97d9b6 (stream MsgApp v2 writer)
2017-11-03 19:29:18.208246 I | rafthttp: established a TCP streaming connection with peer ddd7d6d5ef97d9b6 (stream MsgApp v2 writer)
2017-11-03 19:29:18.220388 I | rafthttp: receiving database snapshot [index:426472, from ddd7d6d5ef97d9b6] ...
2017-11-03 19:29:18.304989 W | rafthttp: closed an existing TCP streaming connection with peer 82993e6b8783693c (stream MsgApp v2 writer)
2017-11-03 19:29:18.305018 I | rafthttp: established a TCP streaming connection with peer 82993e6b8783693c (stream MsgApp v2 writer)
2017-11-03 19:29:18.930007 I | snap: saved database snapshot to disk [total bytes: 138330112]
2017-11-03 19:29:18.930035 I | rafthttp: received and saved database snapshot [index: 426472, from: ddd7d6d5ef97d9b6] successfully
2017-11-03 19:29:18.930096 I | raft: 334422354dfdd49d [commit: 349634, lastindex: 349635, lastterm: 6] starts to restore snapshot [index: 426472, term: 7]
2017-11-03 19:29:18.930129 I | raft: log [committed=349634, applied=349634, unstable.offset=349636, len(unstable.Entries)=0] starts to restore snapshot [index: 426472, term: 7]
2017-11-03 19:29:18.930153 I | raft: 334422354dfdd49d restored progress of 334422354dfdd49d [next = 426473, match = 426472, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
2017-11-03 19:29:18.930162 I | raft: 334422354dfdd49d restored progress of 82993e6b8783693c [next = 426473, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
2017-11-03 19:29:18.930169 I | raft: 334422354dfdd49d restored progress of ddd7d6d5ef97d9b6 [next = 426473, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
2017-11-03 19:29:18.930178 I | raft: 334422354dfdd49d [commit: 426472] restored snapshot [index: 426472, term: 7]
2017-11-03 19:29:18.930321 I | etcdserver: applying snapshot at index 340992...
2017-11-03 19:29:18.933731 I | etcdserver: raft applied incoming snapshot at index 426472
2017-11-03 19:29:18.945241 I | etcdserver: recovering lessor...
2017-11-03 19:29:18.945431 C | mvcc/backend: cannot commit tx (write agent-2/etcd.data/member/snap/db: file too large)

I just pulled down this branch and ran with Makefile:

make build -f ./hack/scripts-dev/Makefile
TEST_OPTS="PASSES='functional'" make test -f ./hack/scripts-dev/Makefile

Have you seen mvcc/backend: cannot commit tx (write agent-2/etcd.data/member/snap/db: file too large)?

@jpbetz
Copy link
Contributor Author

jpbetz commented Nov 6, 2017

Thanks @gyuho, I'll try to reproduce today and get to the bottom of it.

@gyuho
Copy link
Contributor

gyuho commented Nov 13, 2017

@jpbetz Any updates on this?

I have another big set of changes into release-3.2 branch (just for client #8867) with some conflicting code in glide.lock if merged first.

@jpbetz
Copy link
Contributor Author

jpbetz commented Nov 13, 2017

@gyuho Go ahead with your backport, I'll rebase. I'm looking at this more today (wasn't able to get time last week), but don't yet fully understand the issue.

@jpbetz
Copy link
Contributor Author

jpbetz commented Nov 14, 2017

The "file too large" error appears to be surfacing a major issue.

Here's the bbolt stack when the error is returned:

os.File.WriteAt() // returns 'file too large' err
bbolt.Tx.db.ops.writeAt(buf, offset)
bbolt.Tx.write()
bbolt.Tx.Commit()
etcd.mvcc.backend.batchTx.commit()

This is failing when writing to a 113 MB file on a mount point with over 200GB of free space:

C | mvcc/backend: cannot commit tx (error writing dirty pages to disk: error writing chunk of size 4096 to file of size 118669312: cannot commit tx (write agent-2/etcd.data/member/snap/db: file too large)

This is intermittent on my local machine where it occurs once every 3-6 runs. On semaphore CI tests, it fail consistently.

Semaphore failures:

2017-11-03 18:38:53.408388 I | etcd-tester: [round#0 case#5] recovering failure "kill leader member for a long time and expect it to recover from an incoming snapshot"
2017-11-03 18:38:53.408497 I | etcd-agent: restart etcd
... 6 minutes pass ...
2017-11-03 18:44:56.929771 W | etcd-tester: [round#0 case#5] functional-tester returning with error (recovery error: dial tcp 127.0.0.1:32379: getsockopt: connection refused (http://127.0.0.1:32379)

This suggests a failure in recovery, or an excessively slow recovery.

@gyuho The best next step may be to improve our functional testing to persist the etcd.log files for semaphore runs as that would tell a lot more about what is actually going on with the etcd instance when attempting to recover. What do you think?

@gyuho
Copy link
Contributor

gyuho commented Nov 14, 2017

@jpbetz

improve our functional testing to persist the etcd.log files

We store data and logs locally, but do not output the etcd server logs in semaphore. Plus, etcd tester (failure injecting process) does not know which node has crashed. We might need a bit of bash hacks to grep those etcd server errors, if we cannot reproduce locally.

This suggests a failure in recovery, or an excessively slow recovery, which oddly, I have not been able to replicate locally.

Yeah, seems like a semaphore issue?

Btw, I am running functional tests with this branch on actual 5-node cluster and there's no failure since this morning (http://dash.etcd.io/dashboard/db/functional-tests?orgId=1).

I will try to reproduce locally in linux VMs.

@jpbetz
Copy link
Contributor Author

jpbetz commented Nov 14, 2017

We store data and logs locally, but do not output the etcd server logs in semaphore.

We could add semaphore 'After job' commands to cat all 3 agent logs, I don't think they're all that big. Would that be okay?

We'd still have to correlate errors using the correct agent-* log and time, but we'd at least have a record of the etcd errors that we can dig in to when tests fail.

@gyuho
Copy link
Contributor

gyuho commented Nov 14, 2017

We could add semaphore 'After job' commands to cat all 3 agent logs, I don't think they're all that big. Would that be okay?

Sounds good.

@jpbetz
Copy link
Contributor Author

jpbetz commented Nov 14, 2017

@gyuho Semaphore updated. Trying it out with this PR now.

@gyuho
Copy link
Contributor

gyuho commented Nov 14, 2017

@jpbetz We need to remove this line
https://github.com/coreos/etcd/blob/6f077bd74c5714dfa9221d8f3cb84f856e36ea42/test#L141?

@jpbetz
Copy link
Contributor Author

jpbetz commented Nov 14, 2017

@gyuho Just discovered the semaphore SSH option. This is amazing. Hope to get resolution on this today.

@jpbetz
Copy link
Contributor Author

jpbetz commented Nov 15, 2017

Quick update. SSHd into semaphore and did some debugging. The semaphore failure is consistently caused by the 'file too large' error during the lease stress tests. The error is caused by a file.writeAt() operation on a modest sized file (last failure it was 87 MB). Semaphore jobs run on ext4 and there was around 2.3G free on the mount used by the jobs after the test run. "file size" ulimit is unlimited. The write is always for 4k bytes (bbolt page size).

Current theory is that while the file size is modest and the data being written would fit, the offset passed to file.writeAt() is high and causing the write to increase the size the file dramatically. I'll verify this next. If correct, I'll need to trace back why the offset is set to the value it is.

@jpbetz
Copy link
Contributor Author

jpbetz commented Nov 15, 2017

writeAt() offset for failing write: 369435907006464

That would expand the file from ~100MB to ~330TB!

Exact logging added to tx.go:write():

if _, err := tx.db.ops.writeAt(buf, offset); err != nil {
  stat, ferr := tx.db.file.Stat()
  if ferr != nil {
    ...
  }
  return fmt.Errorf("error writing chunk of size %d, offset %d, page id %d, (%#+v) to file of size %d (%#+v): %s", sz, offset, p.id, p, stat.Size(), stat, err)
}

output:

mvcc/backend: cannot commit tx (error writing chunk of size 4096, offset 369435907006464, page id 90194313234, (&bolt.page{id:0x1500000012, flags:0x10, count:0x1e, overflow:0x0, ptr:0x7040}) to file of size 117706752 (&os.fileStat{name:"db", size:117706752, mode:0x180, modTime:time.Time{sec:63646376479, nsec:414148099, loc:(*time.Location)(0x145c820)}, sys:syscall.Stat_t{Dev:0xfc00, Ino:0x1114db, Nlink:0x1, Mode:0x8180, Uid:0x0, Gid:0x0, X__pad0:0, Rdev:0x0, Size:117706752, Blksize:4096, Blocks:229904, Atim:syscall.Timespec{Sec:1510779679, Nsec:462149753}, Mtim:syscall.Timespec{Sec:1510779679, Nsec:414148099}, Ctim:syscall.Timespec{Sec:1510779679, Nsec:462149753}, X__unused:[3]int64{0, 0, 0}}}): write agent-3/etcd.data/member/snap/db: file too large)

@jpbetz
Copy link
Contributor Author

jpbetz commented Nov 16, 2017

Narrowed this down to reading a bolt db file with a meta like:

bolt page ~/Documents/bbolt-issue-logs/etcd.data/member/snap/db 0
Page ID:    0
Page Type:  meta
Total Size: 4096 bytes
Version:    2
Page Size:  4096 bytes
Flags:      00000000
Root:       <pgid=28735>
Freelist:   <pgid=28736>
HWM:        <pgid=28737>
Txn ID:     762
Checksum:   2139d06a29492655

where the freelist page id references a page that is not a freelist:

./bolt page ~/Documents/bbolt-issue-logs/etcd.data/member/snap/db 28736
Page ID:    28736
Page Type:  leaf
Total Size: 4096 bytes
Item Count: 30

I've submitted etcd-io/bbolt#66 to detect such a case and fail fast, but I still need to sort out why the freelist was not written correctly. Note that we use NoFreelistSync: false here since we need to be compatible with 3.2 member/snap/db files. On 3.3+ we set NoFreelistSync: true and would never read the freelist from disk and, hence, would not encounter this issue.

@jpbetz
Copy link
Contributor Author

jpbetz commented Nov 16, 2017

The CI failure happens when the db file is snapshotted by a node using bbolt tx.go:WriteTo and then sent to another node for recovery. WriteTo opens the db file directly, in read only mode in this case, and then copies it to Writer for sending it over the network. I'll add some tests to make see how WriteTo handles NoFreelistSync: false, the CI failure strongly suggests that the freelist does not get recorded correctly here.

@jpbetz
Copy link
Contributor Author

jpbetz commented Nov 16, 2017

Does bbolt's Tx.WriteTo function have incomplete transaction isolation? https://github.com/coreos/bbolt/blob/master/tx.go#L357

It looks like the copy from the db's file to the writer can occur concurrent with writes to the same file. Could a call to WriteTo from a read-only bbolt transaction concurrent with bbolt write transactions result in corrupted freelists? Bolt mvcc guarantees the leaf and branch nodes will not be overwritten when there are ongoing read transactions, but the freelist pages can be overwritten, right?

If so that might explain the issue here. I’ll try to replicate.

@gyuho

@gyuho
Copy link
Contributor

gyuho commented Nov 16, 2017

If I understand correct, freelist writes can concur with tx.WriteTo.

the freelist pages can be overwritten

*Tx.Commit -> tx.commitFreelist() (overwrites freelists)

But I am not sure how this causes corruption in io.CopyN(w, f, tx.Size()-int64(tx.db.pageSize*2)), when commitFreelist doesn't modify tx.meta.pgid, tx.db.pageSize ?

Do you know exactly where that corruption happens? I will try to reproduce once I finish the gRPC backports as well.

@jpbetz
Copy link
Contributor Author

jpbetz commented Nov 16, 2017

WriteTo breaks the assumption that freelists are only ever needed by write transactions. WriteTo needs the freelist to write a valid db file, but it can be called in read transactions.

The corruption would happen here when a bbolt write transaction persists pages to the db file that overwrite the pages that WriteTo expects to still contain a valid freelist. This can happen because the pages containing the freelist are themselves freed when the first write transaction that writes a newer freelist completes.

Looks like @heyitsanthony just submitted the fix: etcd-io/bbolt#67

@gyuho
Copy link
Contributor

gyuho commented Nov 16, 2017

@jpbetz Yeah can you rebase with https://github.com/coreos/bbolt/releases/tag/v1.3.1-coreos.5? Thanks!

@jpbetz
Copy link
Contributor Author

jpbetz commented Nov 16, 2017

One interesting thing about this issue- It occurred consistently on semaphore CI but rarely on my development box. My current theory is that high latency blockIO on semaphore tickles this WriteTo race condition.

@jpbetz jpbetz changed the title vendor: For release-3.2 branch, switch from boltdb v1.3.0 to coreos/bbolt v1.3.1-coreos.3 vendor: For release-3.2 branch, switch from boltdb v1.3.0 to coreos/bbolt v1.3.1-coreos.5 Nov 16, 2017
@gyuho
Copy link
Contributor

gyuho commented Nov 16, 2017

Possibly. I couldn't reproduce this with my VMs either.

@jpbetz jpbetz changed the title vendor: For release-3.2 branch, switch from boltdb v1.3.0 to coreos/bbolt v1.3.1-coreos.5 vendor: Backport bbolt freelist corruption and fragmentation fixes to 3.2 branch Nov 16, 2017
…t after

This is primarily so CI tooling can capture the agent logs after the functional tester runs.
@jpbetz
Copy link
Contributor Author

jpbetz commented Nov 16, 2017

Rebased

Copy link
Contributor

@gyuho gyuho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. thanks!

@jpbetz jpbetz merged commit 1557f8b into etcd-io:release-3.2 Nov 16, 2017
gyuho added a commit that referenced this pull request Nov 22, 2017
…elease-3.1

Automated cherry pick of #8813 release 3.1
@jpbetz
Copy link
Contributor Author

jpbetz commented Nov 27, 2017

@wojtek-t: Just FYI this is being backported to 3.1 tomorrow as well. I mention it because the tx.WriteTo Freelist Corruption Fix impacts all HA clusters.

@jpbetz
Copy link
Contributor Author

jpbetz commented Oct 16, 2018

For posterity:

This is known to cause corruption in both peer snapshot send operations as well as corruption in backup creation (e.g. 'etcdctl snapshot save'). #10109 adds detection of this corruption, and many others when etcdctl snapshot status is run.

To avoid this issue, the minimum recommended versions of etcd to run in production are:
3.1.11+
3.2.10+
3.3.0+

see also https://groups.google.com/d/msg/etcd-dev/nZQl17RjxHQ/FkC_rZ_4AwAJ

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants