Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nixos/zfs: mitigate data loss issues when resuming from hibernate #208037

Merged
merged 1 commit into from
Sep 6, 2024

Conversation

aecay
Copy link
Contributor

@aecay aecay commented Dec 27, 2022

Description of changes

There are occasional data corruption issues when resuming from hibernate on ZFS. The upstream issue is here: openzfs/zfs#260. The nixpkgs attempt to address this is here: #171680.

Particularly relevant is this comment by @bryanasdev000:

ZFS corruption seem caused by either using swap on ZFS (which is not a good idea, see openzfs/zfs#7734) or hibernating and importing a pool before resume/swap (openzfs/zfs#12842 (comment) openzfs/zfs#12842 (comment)).

The nixos stage1 boot script is prone to the second failure case, because it imports ZFS pools in the postDeviceCommands step, which is run before attempting to resume from suspend. @infinisil made a patch to the stage1 script here which moves postDeviceCommands to after the attempt to resume: infinisil@448fe5d

This certainly fixes the problem, but I'm unsure of how general of a fix it is. postDeviceCommands is used (afaics) to load some other filesystem containers (for lack of a better word...) such as luks and btrfs. It's valid to resume from a swap file in one of those containers. So, I think the correct fix is to move only the ZFS imports to after the resume step. That's what this PR accomplishes, by introducing a new postResumeCommands step that ZFS can hook into.

This is my first nixpkgs PR and I hope these changes will be more or less suitable to incorporate as is -- but if there's further exploration or discussion that's needed then I will be happy with that outcome too! 🙂

Things done
  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandbox = true set in nix.conf? (See Nix manual)
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 23.05 Release Notes (or backporting 22.11 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
    • (Release notes changes) Ran nixos/doc/manual/md-to-db.sh to update generated release notes
  • Fits CONTRIBUTING.md.

@NiklasGollenstede
Copy link
Contributor

Nice! I also looked into this problem about half a year ago (after realizing that it was probably what corrupted my root pool on a good SSD another half year earlier), and I am pretty sure your solution should fix the issue. I solved it (in my host config, not in nixpkgs) by duplicating the resume logic as an early postDeviceCommands (which probably breaks resume from LUKS encrypted swap?, haven't really tested it yet).

Your solution is definitely more elegant. It should probably be considered (and documented as) a potentially breaking change. Both the stage 1 script and ZFS's integration are pretty (legacy-) messy, and this does probably not improve that situation.

This should probably come together with removing / defaulting to true boot.zfs.allowHibernation. But only after having tests that:

  • show that the issue exists before this patch
  • and that it is solved with it.

@aecay
Copy link
Contributor Author

aecay commented Dec 29, 2022

@NiklasGollenstede thanks for your comments 🙂

only after having tests that:

  • show that the issue exists before this patch
  • and that it is solved with it

How would this work? (In your answer, you can assume that I know next to nothing about nixos's test architecture, which would be true!) AFAIK, the data corruption bug (which I have never experienced myself, knock on wood) is something that happens only rarely, even when the conditions that trigger it are met (import a ZFS pool early in the boot process, then resume into an image where that pool is already imported).

I can understand writing tests that cause the import-then-resume situation to occur. But I don't predict that such a test will reliably be broken without these fixes. We might need to run the test dozens of times to observe a failure.

It should probably be considered (and documented as) a potentially breaking change

What use cases would it break? The only one I can think of would be resuming from swap on a zvol. But every guide to swap/hibernation/zfs that I have read warns against that in pretty strong terms (examples from arch wiki; nixos wiki). So while this change is technically breaking for that use case, I'd welcome any opinions about whether it's important to call out in a changelog a breaking change to a workflow that's already explicitly flagged as broken/not working. (Or maybe you have case[s] in mind other than resume-from-zvol-swap that would be broken by this change?)

This should probably come together with removing / defaulting to true boot.zfs.allowHibernation

Yes, that would seem to make sense 👍 I will refrain from adding it to this PR for now, but if a consensus emerges around merging something like this change then I will change the setting to true in the next round of changes.

@NiklasGollenstede
Copy link
Contributor

It should probably be considered (and documented as) a potentially breaking change

What use cases would it break?

For example, I (and I know of quite a number of other people who say to do the same) clear my temporary filesystems, including /, by rolling back a ZFS dataset when booting. I do it like this:

boot.initrd.postDeviceCommands = lib.mkAfter ''
    ( zfs list -H -o name -t snapshot -r ${cfg.temp.zfs.dataset} | grep '@empty$' | xargs -n1 --no-run-if-empty zfs rollback -r )
'';

At best, that line would become no-op. Shouldn't completely make my boot fail, but it would not behave as expected (I should probably make if fail loudly).

Generally, any boot.initrd.postDeviceCommands = lib.mkAfter that expect to be able to do ZFS things would break.


How would this work?

Excellent question! I have never worked with NixOS tests either. After looking at examples and documentation a bit, they generally look quite well designed and intuitive to me, but I am still unclear on how one best does a customized installation. SO I decided to just work with my installer framework first and worry about integration into NixOS test later. But ...

is something that happens only rarely, even when the conditions that trigger it are met

It appears you are right about that: I got it to create a permanent corruption once, but even using the exact same script to repeat the installation and triggering actions in the running VM, I couldn't repeat it.
I guess I have proven to myself that this issue is real (at least in my setup 1), but there is little point in "exporting" my test, if it is not reproducible (and I don't know how much more time I want to spend on fiddling around with this).

Guess no tests (or someone will have to spend more time on finding a good test case).

Footnotes

  1. I have to note here that I not only import the pool before resuming, but also do the rollback above. My test writes a big file to /tmp, then adds a mirror to the root pool, hibernates while it is resilvering, does the rollback in the initrd 2 (of the dataset that holds /tmp), resumes, and then in most cases, the resilvering happily resumes. Sometimes there is a corruption that zfs scrub/clear fixes, sometimes not. Only once I got the permanent corruption. And: the file in /tmp is still there, even though the dataset was rolled back!

  2. In the current boot stage 1 scrip, boot.initrd.postDeviceCommands = lib.mkAfter is the one and only insertion point to do stuff with ZFS before mounting /. So doing the rollback kind of has to be inserted here, before resuming (but that really isn't good).

@MagicRB
Copy link
Contributor

MagicRB commented Jan 23, 2023

I think this should be merged and backported, it may break a few peoples boot processes but the other option is to not backport it and risk data-loss. And even then the worse that can happen is that the computer doesn't boot, better than "hey your zpool is now dead, better format it"

@NiklasGollenstede
Copy link
Contributor

I think this should be merged

As far as ZFS is concerned, probably yes. But there may be other concerns about adding another *Commands stage to the initrd.

and backported, it may break a few peoples boot processes

Personally, I'm firmly against backporting this. This PR is breaking, and we'd have no way to communicate it (no release notes) or proper way for people to stall the update (not upgrading the release channel).

but the other option is to not backport it and risk data-loss

There already is boot.zfs.allowHibernation in 22.11. This PR re-enables hibernation (i.e. adds a feature, and doesn't fix a critical bug).

@bryanasdev000
Copy link
Member

I think will be nice to bring ZFS and Stage 1 maintainers into this to discuss, pinging here or in Matrix, so the issue does not stale.

I remembered this issue because of https://www.reddit.com/r/NixOS/comments/120f4q7/nixos_didnt_start_anymore_zfs_error/

@MasterCATZ
Copy link

I really do hope ZFS gets this

every-time I Suspend to RAM / Hibernate the ZFS pools gets marked as Corrupted and have no choice but to restart the PC as it is forced into offline mode

@SuperSandro2000
Copy link
Member

every-time I Suspend to RAM / Hibernate the ZFS pools gets marked as Corrupted and have no choice but to restart the PC as it is forced into offline mode

I am not using legacy mountpoints and do not have such problems because zfs mounts the partitions to the right places by itself and not the script.

@NiklasGollenstede
Copy link
Contributor

NiklasGollenstede commented Apr 8, 2023 via email

@SuperSandro2000
Copy link
Member

fileSystems.*.options = [ "zfsutil" ]

I use this and which also generates fstab entries.

@ElvishJerricco
Copy link
Contributor

I believe this is the right approach. But I do think hibernation should still be disabled by default with zfs. While the zfs devs believe the issues are resolved as long as you don't import before resuming, this is very untested and they all seem to agree that it should be used with caution.

We also need a fix for systemd stage 1, though as of very recent systemd versions that should be as simple as adding after = ["local-fs-pre.target"];

@ElvishJerricco
Copy link
Contributor

After some conversation with some more knowledgeable than I: ZFS should probably not be considered safe to resume after hibernation, even if initrd handles it carefully. From a comment on a openzfs/zfs issue:

Maybe one more remark regarding just how brittle hibernation currently is: hibernation does a system-wide sync prior to "freezing" userspace and kernel threads. But there is no transactionality here, i.e., userspace and kernel threads can continue to dirty DMU state, issue and execute ZIOs to the pool, until they are "frozen" (sigh, naming, it should be called "descheduled for hibernation").

So ZIOs can occur and end up forgotten in the interim between that system-wide sync and the actual freeze, meaning the hibernation image that you would resume from is essentially from the past compared to the data physically on the disks.

All this is to say: boot.zfs.allowHibernation should always be false, because although data corruption is (arguably) rare, it is an inevitability. ZFS simply lacks safety guarantees for hibernation at the moment. There have been discussions on how to fix this, but nothing has been implemented.

@RaitoBezarius
Copy link
Member

Given the situation, I guess, we should be able to proceed with this change to improve the reliability for people who decide to opt-in, but we should probably show a warning or something. I don't have any strong opinion on that.

@mabra
Copy link

mabra commented Aug 22, 2023

This is a reason for me, to stop all further to-linux-migration now - a precondition was snap-shot-able filesystem. A BIG PITA.

@RaitoBezarius
Copy link
Member

This is a reason for me, to stop all further to-linux-migration now - a precondition was snap-shot-able filesystem. A BIG PITA.

Apologies for that, but I don't advise to go for ZFS for such things. It's mostly a server filesystem, not really a laptop/desktop one.

Consider Btrfs or XFS, or bcachefs if you are not afraid of experimental.

@ElvishJerricco
Copy link
Contributor

ElvishJerricco commented Aug 22, 2023

I'm happy with this being merged as long as allowHibernation remains default false. I'd also like to see its description updated, as this does not adequately explain the severity of the risk:

Allow hibernation support, this may be a unsafe option depending on your setup. Make sure to NOT use Swap on ZFS.

There's no "may" about it. It is unsafe, even with this change, and no matter the setup. And the risk is not to any one pool; it's to all imported pools.

@mabra
Copy link

mabra commented Aug 22, 2023

This is a reason for me, to stop all further to-linux-migration now - a precondition was snap-shot-able filesystem. A BIG PITA.

Apologies for that, but I don't advise to go for ZFS for such things. It's mostly a server filesystem, not really a laptop/desktop one.

Consider Btrfs or XFS, or bcachefs if you are not afraid of experimental.

Until my disasters with Linux, I understood it as the heart of stability and for my day work, experimental is never an option - for such things, I use separate hardware.
BTRFS is a one man show, which needed "some connection" (in a so called "free" world) to come into the kernel and I read enough comparisons to newer use it.
The raid(5) hole may effect XFS and it does not include a storage manager - my whole work is done in KVMs and LXC-containers add some services to this (dns, cache, servers). Nothing of this will be as simple as with ZFS. It is the best, which can linux give at all (in opposite to all the nightmare, I have since my 18 month on linux).
And, would I be able to restore a complex workstation with "one click" - this is the second most reason I came to ZFS, otherwise, I could have stayed with windows and NTFS, which worked stabil for me for longer then 10 years. The wikipedia (which may or may ynot the best reference) about snapshots in XFS:
"XFS does not yet[25] provide direct support for snapshots, as it currently expects the snapshot process to be implemented by the volume manager."
BTW, at increasing energycost these days, I am even hibernating my samba sever, which uses ZFS on root, because it offers other services too.
I have had one really big storage crash on my workstation and the speculation was about NVME with too less cache-ram - but this was speculation only, while the kernel does not reserve the amount of cache, the disks require ..... In my case, two different NVMEs failed at the same moment and led to the biggest storage crash in my 40 years IT. After reboot, both were completely error free ....
Linux appears to me more and more as construction sets on steroids.

@ElvishJerricco
Copy link
Contributor

@mabra This thread is not a place to debate Linux storage technologies. The recommendation remains, not just from myself, but from upstream ZFS devs, that you should not use hibernation while ZFS is in use.

@mabra
Copy link

mabra commented Aug 22, 2023

@mabra This thread is not a place to debate Linux storage technologies. The recommendation remains, not just from myself, but from upstream ZFS devs, that you should not use hibernation while ZFS is in use.

Thanks!
No question, I have understood, including the consequences.

@wegank wegank added 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md 2.status: merge conflict labels Mar 19, 2024
@stale stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Mar 20, 2024
@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/how-good-is-zfs-on-root-on-nixos/40512/29

@ElvishJerricco
Copy link
Contributor

I was reminded of this today. I'd like to merge the ZFS improvement (even though it only makes hibernation a little more safe), but there are conflicts because postResumeCommands has been implemented in another PR that was merged.

@wegank wegank added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jul 4, 2024
@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/hibernation-zfs-root-seperate-swap-safe/51711/2

@stale stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Sep 6, 2024
@eclairevoyant eclairevoyant changed the title Mitigation for zfs data loss issues when resuming from hibernate nixos/zfs: mitigate data loss issues when resuming from hibernate Sep 6, 2024
@eclairevoyant
Copy link
Contributor

@ElvishJerricco it's now become a one-line change, is that all that's needed?

@ElvishJerricco
Copy link
Contributor

Should be, yep.

@ElvishJerricco
Copy link
Contributor

@ofborg test zfs

@ElvishJerricco
Copy link
Contributor

Tests failing for unrelated reasons. Going to fix the tests locally, build, and then merge this. Then open a PR with fixed tests

@ElvishJerricco ElvishJerricco merged commit 3f4b909 into NixOS:master Sep 6, 2024
14 of 17 checks passed
@MasterCATZ
Copy link

MasterCATZ commented Sep 7, 2024

Two interesting points! @MasterCATZ: Does it really happen every time? Then maybe we could craft a reproducible test case from your setup. Also, if restarting restores the corruption, it sounds like it's not on-disk. Did you ever try scrubbing instead? (Though that may persist the corruption ...) @sandro: I would think the timing of the import and mount vs the resume is what matters, not so much how/why they are mounted (but maybe I'm wrong). With your non-legacy mountpoints, do you let ZFS mount on import, or do you use canmount=noauto and fileSystems.*.options = [ "zfsutil" ] (to still have them mounted via fstab / systemd)? In either case it would be interesting to know whether /, /nix/store or any other (implicitly) neededForBoot locations are ZFS mounts. Apr 8, 2023 15:52:59 Sandro @.***>:

every-time I Suspend to RAM / Hibernate the ZFS pools gets marked as Corrupted and have no choice but to restart the PC as it is forced into offline mode I am not using legacy mountpoints and do not have such problems because zfs mounts the partitions to the right places by itself and not the script. — Reply to this email directly, view it on GitHub[#208037 (comment)], or unsubscribe[https://github.com/notifications/unsubscribe-auth/ACQQFXZ7DPZVAG5Z3N66PQLXAGCVVANCNFSM6AAAAAATKY63OQ]. You are receiving this because you were mentioned.[Tracking image][https://github.com/notifications/beacon/ACQQFX7EES6IULZTXBWRY7LXAGCVVA5CNFSM6AAAAAATKY63OSWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTSZOY3K4.gif]

sorry I never noticed this message ,

lately I have not noticed ZFS complaining when resuming ,
zfs-2.2.99-281_g07e95b467
zfs-kmod-2.2.99-529_g23a489a41
6.9.12-060912-generic
but I have also been trying my hardest to make sure I export all zfs pools before I suspend

my ZFS pools spans across multiple disk-shelves using multi-path devices using 3 way mirrors

if their is still a need to test I am happy to not export and give suspend a few loops

actually I am spinning up my steam library shelf now , does not matter if that gets nuked

Pre suspend

io@aio:~$ sudo zpool status
  pool: ZFS15k
 state: ONLINE
config:

	NAME           STATE     READ WRITE CKSUM
	ZFS15k         ONLINE       0     0     0
	  raidz3-0     ONLINE       0     0     0
	    ZFS15kB1   ONLINE       0     0     0
	    ZFS15kB2   ONLINE       0     0     0
	    ZFS15kB3   ONLINE       0     0     0
	    ZFS15kB4   ONLINE       0     0     0
	    ZFS15kB5   ONLINE       0     0     0
	    ZFS15kB6   ONLINE       0     0     0
	    ZFS15kB7   ONLINE       0     0     0
	    ZFS15kB8   ONLINE       0     0     0
	    ZFS15kB9   ONLINE       0     0     0
	    ZFS15kB10  ONLINE       0     0     0
	    ZFS15kB11  ONLINE       0     0     0
	    ZFS15kB12  ONLINE       0     0     0
	    ZFS15kB13  ONLINE       0     0     0
	    ZFS15kB14  ONLINE       0     0     0
	    ZFS15kB15  ONLINE       0     0     0
	    ZFS15kB16  ONLINE       0     0     0
	    ZFS15kB17  ONLINE       0     0     0
	    ZFS15kB18  ONLINE       0     0     0
	    ZFS15kB19  ONLINE       0     0     0
	    ZFS15kB20  ONLINE       0     0     0
	    ZFS15kB21  ONLINE       0     0     0
	    ZFS15kB22  ONLINE       0     0     0
	    ZFS15kB23  ONLINE       0     0     0

errors: No known data errors

  pool: ZFS3WAY
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
	The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
	the pool may no longer be accessible by software that does not support
	the features. See zpool-features(7) for details.
  scan: resilvered 230M in 00:35:36 with 0 errors on Sun Jun  2 14:03:44 2024
config:

	NAME                                 STATE     READ WRITE CKSUM
	ZFS3WAY                              ONLINE       0     0     0
	  mirror-0                           ONLINE       0     0     0
	    dm-uuid-mpath-35000cca234e2d3ca  ONLINE       0     0     0
	    dm-uuid-mpath-350014ee0038627b0  ONLINE       0     0     0
	    dm-uuid-mpath-35000cca234e04c25  ONLINE       0     0     0
	  mirror-1                           ONLINE       0     0     0
	    dm-uuid-mpath-350014ee6032154c1  ONLINE       0     0     0
	    dm-uuid-mpath-350014ee65876e250  ONLINE       0     0     0
	    dm-uuid-mpath-35000cca234e0f391  ONLINE       0     0     0
	  mirror-2                           ONLINE       0     0     0
	    dm-uuid-mpath-35000cca234e04df1  ONLINE       0     0     0
	    dm-uuid-mpath-35000cca234df6934  ONLINE       0     0     0
	    dm-uuid-mpath-35000cca234e496af  ONLINE       0     0     0
	  mirror-3                           ONLINE       0     0     0
	    dm-uuid-mpath-350014ee603229c06  ONLINE       0     0     0
	    dm-uuid-mpath-35000c500501d05fa  ONLINE       0     0     0
	    dm-uuid-mpath-350014ee6adcbfbfb  ONLINE       0     0     0
	  mirror-4                           ONLINE       0     0     0
	    dm-uuid-mpath-350014ee65876d4dd  ONLINE       0     0     0
	    dm-uuid-mpath-350014ee658766abf  ONLINE       0     0     0
	    dm-uuid-mpath-35000cca234e04b50  ONLINE       0     0     0

errors: No known data errors

I can confirm it still happens but only to the drives that have been spun down that are internally in the PC
zfs array that was on the shelf that remained powered on is still ok (and I was writing / reading to it during suspend )

I did a suspend test with ZFS15k that has my 15,000 rpm drives that store steam Libary during a game update
its pool was unaffected because they are external drives that remained power on

how ever ZFS3WAY which is what I store my documents on that are internally connected to the PC's power supply was forced offline and faulted and this was not even having data being written / read to it

happy to apply any patches / do zfs update build

Screenshot from 2024-09-07 10-30-01

log file
journellog7-8.txt

I can do another test and power off the disk-shelf to see how that affects the resume ,
it does normally take about 3 mins before all drives finish spinning up however

edit

so suspending then powering off the shelves , then powering them on and waiting ~ 5 mins for al the drives to spin up then activating pc ZFS is ok

suspending turning off the shelf then turning it back on and activating pc before they have all stabilized zfs fails

so zfs is wanting to use them before they have actually re-started back up

@ElvishJerricco
Copy link
Contributor

@MasterCATZ I hope this being merged didn't give you the impression that hibernation was ok with ZFS. We explicitly left it disabled by default because it's not. Regular suspend should be ok, but hibernation is not.

@amarshall
Copy link
Member

This change may break a lot of folks using some form of impermanence with ZFS as often that is done via e.g. postDeviceCommands = "zfs rollback zroot@blank", but after this change, the pool is not yet imported, so the rollback fails.

At a minimum, I think we need a release notes entry for this as it’s a breaking change for anyone whose config expects the pool to be imported in postDeviceCommands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.