Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible silent corruption in 0.7.3. A small text file got zeroed. #6931

Closed
fling- opened this issue Dec 7, 2017 · 9 comments
Closed

Possible silent corruption in 0.7.3. A small text file got zeroed. #6931

fling- opened this issue Dec 7, 2017 · 9 comments

Comments

@fling-
Copy link
Contributor

fling- commented Dec 7, 2017

After upgrading kernel to 4.13.15 and zfs to 0.7.3 on one of the boxen I updated
lxd to 2.19 and noticed it is failing to start. It appears the initscript got
zeroed. At first I thought it should be a breakage in python caused by a gcc-6
bug as I also upgraded gcc recently. Then I downgraded lxd to 2.18 to just make
it work as I still had the binary package.

Lately I upgraded another much slower box to the same kernel and zfs version.
Then I used quickpkg in a stage4 filesystem to build (without compilation)
binary packages for it's upgrade without compiling anything on this box because
of the limited time for the upgrade I had. After installing some of the
packages including glibc I noticed libm.so got zeroed the similar way.

But libm.so is fine in both binary package and the working stage4!
I started thinking something could be wrong with zfs so I returned to the first
box and checked the lxd initscript in the binary package and it appeared to
be fine and non-zeroed.

It could still be caused by a bug in gcc and/or python breakage but it is not
so clear for me how could this happen as there was no compilation involved on
the second box where libm.so got zeroed in the process of emerging a binary
package. Similarly the zeroed file on the first box is fine in the binary
package too. (portage has buildpkg feature which saves binary packages of
everything it builds and installs)

The files are good in tarballs but got zeroed in the process of intalling or
shortly after. Otoh I don't see checksum errors on zfs side so it could also
be a silent corruption.

Unfortunately zfs got upgraded together with kernel, gcc and python on both
affected boxes which makes it harder to distinguish what caused the zeroing.

I never seen this behavior prior this recent upgrade.

Unfortunately I don't have a snapshot of the first box taken after the lxd
initscript zeroing and before I fixed the file.

I will take a snapshot of the affected filesystem with the file zeroed on the
second box and will keep it for the future investigation.

In stage4 chroot:

localhost ~ # hexdump -C /usr/lib64/libm.so
00000000  2f 2a 20 47 4e 55 20 6c  64 20 73 63 72 69 70 74  |/* GNU ld script|
00000010  0a 2a 2f 0a 4f 55 54 50  55 54 5f 46 4f 52 4d 41  |.*/.OUTPUT_FORMA|
00000020  54 28 65 6c 66 36 34 2d  78 38 36 2d 36 34 29 0a  |T(elf64-x86-64).|
00000030  47 52 4f 55 50 20 28 20  2f 6c 69 62 36 34 2f 6c  |GROUP ( /lib64/l|
00000040  69 62 6d 2e 73 6f 2e 36  20 20 41 53 5f 4e 45 45  |ibm.so.6  AS_NEE|
00000050  44 45 44 20 28 20 2f 75  73 72 2f 6c 69 62 36 34  |DED ( /usr/lib64|
00000060  2f 6c 69 62 6d 76 65 63  5f 6e 6f 6e 73 68 61 72  |/libmvec_nonshar|
00000070  65 64 2e 61 20 2f 6c 69  62 36 34 2f 6c 69 62 6d  |ed.a /lib64/libm|
00000080  76 65 63 2e 73 6f 2e 31  20 29 20 29 0a           |vec.so.1 ) ).|
0000008d

The file is identical to one in the binary package built with quickpkg in this
stage4 (from the installed files and without any compiling)

In the affected root to where the binary package got installed:

localhost ~ # hexdump -C /usr/lib64/libm.so
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000080  00 00 00 00 00 00 00 00  00 00 00 00 00           |.............|
0000008d

System information

Type Version/Name
Distribution Name gentoo
Distribution Version 13
Linux Kernel 4.13.15-gentoo-gnu
Architecture amd64
ZFS Version 0.7.3
SPL Version 0.7.3
@gmelikov
Copy link
Member

gmelikov commented Dec 7, 2017

It's already fixed, patch will be uncluded in 0.7.4, see #3125. Or you can cherry-pick fix by yourself 454365b .

@gmelikov gmelikov closed this as completed Dec 7, 2017
@fling-
Copy link
Contributor Author

fling- commented Dec 7, 2017

The bug is easily reproducible by comparing the file from emerge -K to the file from tar xf.
Will now try #3125 and hope it will fix the issue.
How do I test my existing files for corruptions? I used rsync a lot recently moving large amounts of data between pools.

@gmelikov
Copy link
Member

gmelikov commented Dec 7, 2017

Please see #3125, for example #3125 (comment)

@fling-
Copy link
Contributor Author

fling- commented Dec 7, 2017

@gmelikov this only finds all the corrupted files in glibc.
But I want to find corruptions in other files I moved with rsync. They are mostly huge disk images.

@gmelikov
Copy link
Member

gmelikov commented Dec 7, 2017

IIRC this regression was mainly reproducible on gentoo with portage https://bugs.gentoo.org/635002 and it filled files with \0 , so if you want to check all files - you should just check if they are filled with \0.

@gmelikov
Copy link
Member

gmelikov commented Dec 7, 2017

In addition - it was portage regression too https://bugs.gentoo.org/635126

@fling-
Copy link
Contributor Author

fling- commented Dec 7, 2017

@gmelikov does this mean 1. the bug mostly affects files installed with portage and 2. the bug makes the whole file filled with \0 and not only the part of an affected file?
I'm asking because I don't care about what portage breaks and it is easily fixed but I care about huge files I moved with rsync between pools (not related to portage).

@gmelikov
Copy link
Member

gmelikov commented Dec 7, 2017

I wasn't affected by this regression because I don't use Gentoo, so the best is to read #3125, but if I understand everything right - yes. If rsync was broken too, then we could find it at once.

@fling-
Copy link
Contributor Author

fling- commented Dec 10, 2017

@gmelikov can't reproduce the corruption with 454365b applied.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants