-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vendor: Backport bbolt freelist corruption and fragmentation fixes to 3.2 branch #8813
Conversation
@xiang90 @gyuho Alternatively, I'd be willing to do this with a more selective backport of just etcd-io/bbolt#3, but given that this will also help us vet the bbolt improvements for the 3.3 branch, I figured I'd start with this proposal. Let me know if I missed any backward compatibility issues you're aware of. |
@@ -25,7 +25,7 @@ import ( | |||
"sync/atomic" | |||
"time" | |||
|
|||
"github.com/boltdb/bolt" | |||
bolt "github.com/coreos/bbolt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious on why naming the import github.com/coreos/bbolt
to bolt
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm, I see what's going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to minimize the code change-- this way I only need to change the import, not any of the package references in the rest of the go file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For master this wouldn't make as much sense since that code lives on indefinitely, but for a backport to a branch that has a finite livespan it seemed simpler.
As long as Btw, functional-tester in release-3.2 fails https://semaphoreci.com/coreos/etcd/branches/pull-request-8813/builds/1.
Probably, We can investigate more tomorrow. |
Sure enough. |
@jpbetz Locally reproducible with my Linux VM
I just pulled down this branch and ran with
Have you seen |
Thanks @gyuho, I'll try to reproduce today and get to the bottom of it. |
@gyuho Go ahead with your backport, I'll rebase. I'm looking at this more today (wasn't able to get time last week), but don't yet fully understand the issue. |
The "file too large" error appears to be surfacing a major issue. Here's the bbolt stack when the error is returned:
This is failing when writing to a 113 MB file on a mount point with over 200GB of free space:
This is intermittent on my local machine where it occurs once every 3-6 runs. On semaphore CI tests, it fail consistently. Semaphore failures:
This suggests a failure in recovery, or an excessively slow recovery. @gyuho The best next step may be to improve our functional testing to persist the |
We store data and logs locally, but do not output the etcd server logs in semaphore. Plus, etcd tester (failure injecting process) does not know which node has crashed. We might need a bit of bash hacks to grep those etcd server errors, if we cannot reproduce locally.
Yeah, seems like a semaphore issue? Btw, I am running functional tests with this branch on actual 5-node cluster and there's no failure since this morning (http://dash.etcd.io/dashboard/db/functional-tests?orgId=1). I will try to reproduce locally in linux VMs. |
We could add semaphore 'After job' commands to cat all 3 agent logs, I don't think they're all that big. Would that be okay? We'd still have to correlate errors using the correct agent-* log and time, but we'd at least have a record of the etcd errors that we can dig in to when tests fail. |
Sounds good. |
@gyuho Semaphore updated. Trying it out with this PR now. |
@jpbetz We need to remove this line |
@gyuho Just discovered the semaphore SSH option. This is amazing. Hope to get resolution on this today. |
Quick update. SSHd into semaphore and did some debugging. The semaphore failure is consistently caused by the 'file too large' error during the lease stress tests. The error is caused by a Current theory is that while the file size is modest and the data being written would fit, the offset passed to |
That would expand the file from ~100MB to ~330TB! Exact logging added to if _, err := tx.db.ops.writeAt(buf, offset); err != nil {
stat, ferr := tx.db.file.Stat()
if ferr != nil {
...
}
return fmt.Errorf("error writing chunk of size %d, offset %d, page id %d, (%#+v) to file of size %d (%#+v): %s", sz, offset, p.id, p, stat.Size(), stat, err)
} output:
|
Narrowed this down to reading a bolt db file with a meta like:
where the freelist page id references a page that is not a freelist:
I've submitted etcd-io/bbolt#66 to detect such a case and fail fast, but I still need to sort out why the freelist was not written correctly. Note that we use |
The CI failure happens when the db file is snapshotted by a node using bbolt |
Does bbolt's It looks like the copy from the db's file to the writer can occur concurrent with writes to the same file. Could a call to If so that might explain the issue here. I’ll try to replicate. |
If I understand correct, freelist writes can concur with
But I am not sure how this causes corruption in Do you know exactly where that corruption happens? I will try to reproduce once I finish the gRPC backports as well. |
The corruption would happen here when a bbolt write transaction persists pages to the db file that overwrite the pages that Looks like @heyitsanthony just submitted the fix: etcd-io/bbolt#67 |
@jpbetz Yeah can you rebase with https://github.com/coreos/bbolt/releases/tag/v1.3.1-coreos.5? Thanks! |
One interesting thing about this issue- It occurred consistently on semaphore CI but rarely on my development box. My current theory is that high latency blockIO on semaphore tickles this |
Possibly. I couldn't reproduce this with my VMs either. |
…t after This is primarily so CI tooling can capture the agent logs after the functional tester runs.
Rebased |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. thanks!
…elease-3.1 Automated cherry pick of #8813 release 3.1
@wojtek-t: Just FYI this is being backported to 3.1 tomorrow as well. I mention it because the |
For posterity: This is known to cause corruption in both peer snapshot send operations as well as corruption in backup creation (e.g. 'etcdctl snapshot save'). #10109 adds detection of this corruption, and many others when To avoid this issue, the minimum recommended versions of etcd to run in production are: see also https://groups.google.com/d/msg/etcd-dev/nZQl17RjxHQ/FkC_rZ_4AwAJ |
This PR backports all of
coreos/bbolt v1.3.1-coreos.5
to the etcd 3.2 branch. This includes a larger but manageable set of bolt db related fixes. See https://github.com/coreos/bbolt/releases for full list of changes.The most critical fixes we need from this backport are:
tx.WriteTo
Freelist Corruption Fixetcd-io/bbolt#67 was identified while testing this PR, but it impacts all etcd 3.2- releases. We tracked it down to data race in the bbolt db
WriteTo
function that is used by etcd HA recovery that can result in corrupted freelists in db files.Fragmentation Fix
etcd-io/bbolt#3 prevents "runaway" bolt data file fragmentation under certain workloads. We've tested a fork of etcd 3.0.17 with a fork bolt db v1.3.0 that had the fix applied to it. We found that the fix improved the operability of kubernetes at scale.
For compatibility with all etcd 3.2.* releases, the
NoFreelistSync
option introduced in etcd-io/bbolt#1 will not be enabled.