Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track data corruption and other showstopper bugs in releases #13755

Open
tuider opened this issue Aug 9, 2022 · 7 comments
Open

Track data corruption and other showstopper bugs in releases #13755

tuider opened this issue Aug 9, 2022 · 7 comments
Labels
Type: Feature Feature request or new feature

Comments

@tuider
Copy link

tuider commented Aug 9, 2022

Describe the feature would like to see added to OpenZFS

I'm about to upgrade a bunch of old boxes running a mix of 0.6 and 0.7 releases. I'm not going to just straight up upgrade to the latest version because i value stability over new features, so i would like to know in advance which blocking bugs are still present (fix not backported) in which releases before i venture on this journey.

Is this information already available somewhere?

How will this feature improve OpenZFS?

This will help users plan their upgrade path and avoid unpleasant surprises.

Additional context

@tuider tuider added the Type: Feature Feature request or new feature label Aug 9, 2022
@gdevenyi
Copy link
Contributor

Related, #13624 #13612

@ryao
Copy link
Contributor

ryao commented Sep 13, 2022

These patches should be backported:

13f2b8f
e5327e7

I do not know if the draid one involves a data corruption issue, but at the very least, hitting that bug is likely to be unpleasant.

The btree one has the potential to cause corruption (although doing the analysis needed to prove that would be a pain). I suggest @tonyhutter do a 2.1.6 release with it just because of how potentially bad it is. This should also affect 0.8.y.

@gdevenyi gdevenyi mentioned this issue Sep 26, 2022
13 tasks
@gdevenyi
Copy link
Contributor

gdevenyi commented Oct 4, 2022

https://github.com/openzfs/zfs/releases/tag/zfs-2.1.6 does not contain the requested detail

@tonyhutter
Copy link
Contributor

so i would like to know in advance which blocking bugs are still present (fix not backported) in which releases before i venture on this journey.

Typically if you run the latest 0.6.5.x/0.7.x/0.8.x/2.0.x/2.1.x branch that's supported by default for your distro, then you should be pretty safe. I would recommend against running a middle version of any of those branches, as if we find a showstopper bug, we will simply put out a new release with the fix (as we did with 0.7.8 when 0.7.7 had a corruption bug - see warning in 0.7.7 release notes https://github.com/openzfs/zfs/releases/tag/zfs-0.7.7).

@ryao
Copy link
Contributor

ryao commented Oct 4, 2022

The most severe bug fixed in 2.1.6 is fixed by 13f2b8f. I am not sure if it can cause corruption, but I am also not sure if it cannot cause corruption. There have been a few reports of pool corruption that might have been caused by it, but I have no proof beyond suspicion. However, it had been detected in issue reports from people running debug builds of ZFS, since it can trip an assertion. It causes undefined behavior that is difficult to model. This bug is present in all older versions with the btree code, which I think goes back to 0.8.0.

There was also another bug fixed by 52afc34 that also goes back to 0.8.0, which could cause free functions to be called on uninitialized data when processing encrypted data. However, that one is much more rare and requires a hash function fail to trigger it. There were no issue reports involving it. It would not surprise me if it has never been triggered anywhere in the world.

Lastly, we were calling free on stack memory in the skein code, which was fixed by a2163a9. That one was likely introduced in 0.7.0, but I did not check to be certain. The kernel free function probably harmlessly ignored it (since there were no issue reports), but I did not check to be certain.

These fixes came from an on-going effort that I am making to find and correct bugs via static analyzers. I have been fixing not only the serious bugs, but also the trivial ones detected, and changing code where appropriate to stop triggering complaints from static analyzers. The idea is to get the codebase into a state where we can use static analysis to find regressions in the code (such as the three that I listed) before they are put into releases

That would have happened much sooner had I not stepped away from the project after it reached a level of completeness that made me feel that there was not much left for me to do anymore. In hindsight, that was wrong, but it seemed that way as I had done everything I had wanted to do at the time. Coincidentally, the skein bug entered the code around the time I had started feeling that way. I regret that my stepping away from the project left these issues in the codebase for so long, but on the bright side, I am refreshed from my time away and I see plenty of things that need my attention. I expect to remain active for years. :)

@dreamcat4
Copy link

@ryao Just for clarification... when referring to 'data corruption' (over multiple different bugs). Is that wording also assumed include a type of a bug when the indexing of files also dissapears? I.e. that files are no longer visible or found on the filesystem traversal. The just seem to 'dissapear'. Or does 'data corruption' in your bug hunting work here always only exhibit as finding out that data within files was being silently corrupted?

Sorry I do understand that zfs works at a block level instead of a file level. I am really asking more about (from a user perspective)... How to notice these bugs. How they exhibit themselves on a running system. What are the full range of possible symptoms. So what the user can look out for. And whether it may be attributed to such data corruption bugs. Or cannot possibly be (by these definitions). But instead must be something else entirely different. So (if you will indulge me), there is a simple venn diagram could be drawn out. Of those types which fall into either camp. And then also an intersection of those a 'maybe or maybe not'. In the middle. For which is is uncertain and not clearly known.

& Many thanks for all your long term work on this project BTW. Very much appreciated 👍

@ryao
Copy link
Contributor

ryao commented Oct 17, 2022

@dreamcat4 Data corruption would typically refer to file contents. Metadata corruption would involve things like directory contents being corrupt (although some might count directories as data too). In any case, I am not aware of any bugs where files disappear, although I do know a bug involving unicode normalization where phantom dentries can make it look like that (#13980).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Feature request or new feature
Projects
None yet
Development

No branches or pull requests

5 participants