-
Notifications
You must be signed in to change notification settings - Fork 76
zbox is unable to successfully handle a machine failure #36
Comments
thank you very much @fin-ger , that's an interesting test. I run your test script and can see it stopped at 7th round but cannot reproduce the error you posted. That crash happened during writing file, not during opening the repo. From your error log, it looks like the error happened when reading super block, which is during open an existing repo. Could you share how you build the executable file on Alpine Linux? I tried built one on Ubuntu but it cannot run on Alpine. Also, the published zbox v0.6.1 is quite old, the latest code on master branch has been refactored a lot and with many bugs fixed and performance improvement, could you test using the latest code instead? Just use below dependency line in your Cargo.toml:
Another tip is you can turn on zbox debug output by setting environment variable in file
Looking forward to seeing more result, thanks. |
The error happens when running The executable was automatically build by the travis-ci configuration. I am using the official alpine:edge docker container:
The executable can be found in I will create a new version of my test now which uses the latest master of zbox and the The error is only happening when running the test inside a VM that gets forcefully stopped. The repository is afterwards (booting the VM again) checked against the previously generated |
I built a new version (0.4.0) that uses zbox from the current git master and added |
Thanks @fin-ger . What I found is it looks like QEMU didn't flush write data to its driver. After the test crashed on the 7th round, the repo folder is like this: zbox:~# ls -l zbox-fail-test-repo So you can see there is only one super block and it is empty. And the wal folder is not even created at all. Super block and wal must be guaranteed persistent to disk. The correct one should like this: /vol # ls -l zbox-fail-test-repo So that means QEMU lies to zbox the write() and flush() are completed but it is actually not. The possible reason could be the cache mode not specified when starting the QEMU VM. You can try add it in run-test.exp line 10
Different cache mode explanation can be found here. I've tried some but still cannot see the files are guaranteed written to disk. |
Okay, so if this is a qemu issue than it is not relevant for zbox. Have you tested failures of real machines with zbox? |
Honestly, I haven't tested the real machine failure because I can't find a good reproducible way to do that test. But I did some random IO error fuzz tests by using a special faulty storage. That storage will generate IO error and the fuzzer will reopen repo randomly but deterministically. Your test makes me think maybe I can use QEMU to do the fuzz crash test, just like this guy did for OS testing, but still need to figure out how to make persistent write in QEMU first. |
I tested if a
without any |
I have added |
QEMU file io looks so tricky, I might test the |
As this filesystem aims to provide ACID functionality, I tested if it can handle a full machine failure while a write is in progress. My test shows that zbox fails to even open the repository after such a machine failure.
I tried to make the test as reproducible as possible, so you can recreate it on your own machine. The machine failure is simulated by forcefully shutting down a virtual machine where a zbox program is currently writing on.
Is recovery from a full machine failure not supported yet or am I using the zbox API in wrong way? (main.rs)
The text was updated successfully, but these errors were encountered: