-
Notifications
You must be signed in to change notification settings - Fork 646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
using primary and slave DBs to solve the panic problem caused by DB c… #830
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: yingchunliu-zte The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
4a5d0a5
to
5b80a56
Compare
…onflicts Signed-off-by: yingchunliu-zte <[email protected]>
5b80a56
to
415cfe3
Compare
Thanks @yingchunliu-zte - that approach already exists (in smaller scale) internally via the meta pages: You can see in the recent investigation by @ahrtr that simply rolling back the page helps in recovering: There's another scenario where we believe some pages didn't persist properly during power-down events on virtualized filesystems. I don't think that copying an entire bucket or database file is helping here at all. |
Thanks @tjungblu Commit is not an atomic operation, and when the system loses power, the db disk drop may be incomplete. |
Yes, basically I agree with @tjungblu . Please also refer to https://github.com/ahrtr/etcd-issues/blob/master/docs/cncf_storage_tag_etcd.md#storage-boltdb-feature But on other hand, It's totally up to applications to do whatever higher level protection (e.g master-slave) they want. But it may not be an easy task. From bbolt perspective, there indeed are some long standing data corruption issue. One of the possible reasons could be due to filesystem as mentioned in #778 (comment). But it's also possible that there are some bugs in the freelist management, refer to #789. I am open to any thoughts on how to resolve such data corruption issues. |
It's atomic. Please refer to the link in my previous comment. To be clearer, we won't accept this PR but thanks anyway. Please feel free to raise a topic in discussions if you want. |
If commit were not atomic, you now have a two-phase commit issue without an actual commit you need solve. I hope you see where this is going :)
+1, happy to brainstorm this further along |
When the system is powered off and reset, containerd experiences a panic:
panic: freepages: failed to get all reachable pages (key[0]=(hex)** on leaf page(1229) needs to be < than key of the next element in ancestor(hex)**. Pages stack: [1974,1229])
The main steps for using primary and slave DBs to solve the panic problem caused by DB conflicts are as follows:
When the system is powered off, if writing to master db causes a master db conflict, containerd will use slave db after panic. If you are writing a slave db, containerd will create a new slave db.
kind: bug