-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NULL pointer dereference via zfs_setattr_dir #8072
Comments
Another note: user-triggerable NULL ptr dereferences, under the right circumstances, are exploitable to execute arbitrary code in ring0. The zeropage "protection" does not protect against arbitrarily large values of N in 0+N, meaning if you can influence the shift, you can bypass it. It's pretty much a gimmick more than a real mitigation, though it works for the simplest of cases or cases where the vulnerability does not involve accessing memory where 0/[user influenced deref of NULL]+[user controlled offset] makes it possible to go well past the initial measly 65k default setting. Someone who does not understand this basic concept is too much of a moron to be allowed to touch anything near ring0 code that can be accessed/reached by an userland process. Alas, the Linux Code of Conduct seems to have left out 'stupidity' as a disqualifying affliction. |
This code path was introduced by the project quota work to propagate attribute changes into the contents of xattr directories. @vogelfreiheit I suspect this problem would not occur of you were using |
Nice, note that my comment regarding the security impact of these (under the right circumstances), was not really addressing the ZoL team... but the Linux kernel and KSPP developers who get mitigations merged that only provide a false sense of security (sometimes ripping off third-party inventions and code). I can retest with xattr=sa, since it is possible to reset the option after creating the dataset (and unfortunately this happened in a production system). I enabled debug symbols for that particular kernel so if it happens again I hope to have a stacktrace and maybe some useful data. This also happened in a system where encrypted datasets are loaded and unloaded somewhat regularly. |
@vogelfreiheit On the other hand, I made some static code analysis according to above kernel stack trace. I introduced zfs_setattr_dir() that is used for resolving one of the existing ZFS bug (#6957). The logic is relatively simple: it scans the directory EA objects and changes the UID/GID/ProjectID (as required) via calling the existing sa_bulk_update() which logic is not changed by the project quota patch. Then why the zfs_setattr_dir() caused the subsequent NULL-pointer during "SA_COPY_DATA()", there are two possible cases:
For the case 1), related logic is quite independent and simple, I double checked, no result. Currently, I more suspect the case 2), but I need more clews to locate the root cause. So if you could offer some simple test scripts that you can reproduce the issue on your site, that will be quite helpful. Thanks! |
I will try to see if I can reproduce the issue, but I remember I was mass-chmod'ing and chown'ing a large tree. Rather not repro this on that system so I can try to setup a test VM. Kernel is the standard Ubuntu Server one on amd64 from Bionic. |
Related issue: If the Any ideas on how to salvage the data set? I'd rather not lose ~7 TB if I can avoid it. The scrub came back clean, so actual data seems intact. FWIW: I'm on Arch,
|
I think I experienced the same issue. I reported it on zfs-discuss, but I was not aware of this bug report. Ubuntu 18.04.2 kernel 4.15.0-46-generic, ZoL 0.8.0-762f9ef3d9d897b5baf7c91d6e8a7bf371a9b02f (3rd of March 2019), 32 GB ECC RAM. ZFS pool created originally under illumos, still retaining all the settings from illumos, see the dataset properties: https://pastebin.com/VmPGy26Q -> notably, xattr=on, acltype=off Imported+mounted the pool, issued a "chown -R" on a subdir of a dataset, the pool locked (more or less). I cannot export it successfully, I cannot mount that dataset anymore. The other ones are fine however. See the kernel messages: https://pastebin.com/ufNX78FR
|
I'll try with the OmniOSce install CD, I have a spare SSD in the server. In the meanwhile, I tried rolling back to the latest snapshot and it froze again. I can sacrifice another dataset from that pool after copying the data somewhere safe... Are there zdb or other commands you would like me to execute before trying to reproduce the issue on my existing pool? |
I don't know why the pool is not imported automatically at boot, but anyway I just noticed that when today I issued the zpool import I got another call trace. I'm on mobile now so I cannot easily post it, but I'll do it tomorrow. |
We've had this issue from time to time: https://github.com/openzfsonosx/zfs/blob/master/module/zfs/sa.c#L771 |
Besides the stack trace I already posted, I found another one before it and another one after it. All three are here (the second one is the same as the one above): They are each of a different length. I'll get ready for the next steps. |
I closed the issue I had posted, I just imported after reboot and the import took longer than normal, but then completed. I then tried mounting the encrypted dataset and got the i/o error again. Also, it might be worth noting, the pool was originally created on MacOS (as was the encrypted dataset). My Macbook is out of commission, so I moved the drives to my Linux system (which required that I use the 0.8 branch due to features enabled, including dataset encryption). |
I patched it (I'm not a coder, but with some help from someone in #zfsonlinux I was able to replace printk with dprintk to get it to compile) and upon first mount, it hung with lots of disk activity and then i/o errors (disk read errors) before suspending i/o (separate problem, I was trying on USB3, which has not always been the most stable on this system, so that was my bad - I rebooted and tried from USB2, started a scrub to make sure the chksum errors I saw before were just read errors from the USB issue (and after letting it run a while and getting 0 errors back, am convinced the data itself is actually fine). I then loaded the key and tried mounting and immediately got the "Cannot mount 'tank0/crypt': Input/Output error back. dmesg error:
My patched sa.c file: Hopefully this helps someone figure this out. |
I can reliably reproduce the issue on my pool simply by booting with the OmniOSce '028 install CD and by rolling back to the latest snapshot, but trying to replicate it on a new pool was not successful. These are the steps: Let me know if I can help further. I'll keep the old pool for a while longer while transferring everything to a new, ZoL-created pool. |
I was able to do a zfs send / recv (in my case to a file and import into my FreeBSD zpool) and had no issues accessing the data that way. It looks like the cause is the fact that I created the pool on OSX originally and moved it to Linux (based on what everyone is saying). I'll move the rest of my data off and recreate under Linux. That's going to take some time (USB2, although I might look at temporarily hooking it up to my desktop via MLSATA & dock as I have one laying around). I'm happy to test with my setup while it's still around, should be for a few days, at leas. |
I think I used OmniOS r151002 dated 2012, according to the date I get from "zpool history". Or maybe OpenIndiana. |
If it can be useful, I could try a more radical approach: I can leave in the problematic dataset(s) a very limited number of files, then add a USB stick as stripe, and with device_removal remove the original HDDs, leaving everything in a much smaller USB stick. If the issue is still reproducible, I can physically send by mail the USB stick (or I electronically its DD image) to the developer(s). Since it requires quite some work with several reboots to verify the reproducibility of the bug, I'd like to know in advance if you can make use of the resulting image/DD stick. |
@kpande @ahrens I can do that. It will be two files, since I have a stripe, but no big change. I completed the data transfer to the new ZoL-created zpool, therefore the old problematic one can be stripped to remove sensitive data and to reduce its size to a minimal size. I'll perform the stripping under OmniOSce. |
I actually did more than asked: I isolated the file that causes the bug, it was simple: I attach the compressed archive with the two files that represent my pool (since it was a striped pool). The files were obtained as I was instructed by @kpande The archive has been created by using tar with the option -S to support sparse files, and it was then compressed with gzip under Windows (using 7-Zip) to split it into 10MiB chunks. I added manually a second ".gz" extension to have GitHub accept them. zfs_bug.tar.gz.001.gz The pool was purged from all the sensitive data and it contains now only one dataset (tank/home/olaf) with 4 files in the folder /tank/home/olaf/Movies/Line rider. The file that causes the issue is "Line Rider Extreme.flv". I will wait about a week for a feedback, in case you need more information, then I'll destroy the pool. Enjoy the bug hunting! |
@dewi-ny-je I was not able to reproduce the issue by importing your pool and doing a chown:
Did you do anything different? |
@tcaputi I used the command I pasted, therefore I would say you did it right. You can see my configuration in my first post: #8072 (comment) It's strange because I could reproduce it reliably. What further information can I provide you? |
As I described above in #8072 (comment) once the dataset experiences said error, the dataset cannot be mounted anymore. Later I will "freeze" my pool again and then I will upload here the "frozen" pool: maybe it will be possible to investigate further. After doing that, I will update ZoL to the latest revision and I will try again. |
@dewi-ny-je Do you know if the issue happens consistently? And is it always on the same file every time? This might be a race condition that requires repeated chowning / metadata changing to work. |
I'm not sure... The issue happened every time I tried on the whole "Movies" folder (I think I tried two or three times), but I cannot say about which file was causing it each time, since I run The issue happened again to the "Line rider" folder when I tried the same script in it after I purged the rest of the dataset. The issue did NOT happen when I chown'ed -R another two whole datasets, both with about as many files as the whole "Movies" directory (before the purge). I tried these datasets only once each. So the procedure I'll try later will be
I see you used 16.04, but unfortunately I cannot perform the test on a different Ubuntu release, this would take too much time, since it involves using the live CD and compiling ZoL 0.8.0. |
Well, it was a simple test. I have no more ideas :( |
Maybe its a race condition. Perhaps your script produced the issue because it was doing a lot of chowns at once. Can you try running it in a loop again? |
I haven't deleted any file from the folder "Line rider" that caused the lock before uploading the pool here: if you run For info: after copying the content of the problematic dataset (tank/home/olaf) to my new, ZoL-created pool, I run However, if you think it can be of help, I will try |
Unfortunately, it sounds like we're having a really difficult time reproducing this issue. I've removed this issue from holding up the 0.8 tag, but we of course still need to identify the root cause here. |
I just tried:
Run multiple times, no luck. Further ideas? As in the past, I'll give you a week for further suggestions, then I'll get rid of the old pool. |
Is this on the new software? maybe try going back to the old one? |
I'm still on ZoL 0.8.0-762f9ef3d9d897b5baf7c91d6e8a7bf371a9b02f (3rd of March 2019). |
I've encountered the same problem (see #8597). Does not appear to be a race condition as I can reproduce it consistently on the same files every time. I can |
@aarononeal Is there a way you could get us a reproducer? The big reason this issue hasn't been fixed yet is because none of us have been able to determine what causes the issue. |
@tcaputi I will try to export a truncated dataset tonight and see if I can get it reproducing under that. If these details help, it was originally a FreeBSD (FreeNAS) dataset on a 2 drive mirror pool, xattrs were added by MacOS (over AFP or SMB), and then I did a send/recv of the earliest snapshot to a new Linux encrypted dataset in a new 2 drive mirror pool, followed by incrementals up to a new latest snap. I’m now doing that again by cloning that Linux dataset to a 6 drive raidz2. Decided to update permissions and ownership in the process and hit the issue. |
Except for the send/recv, quite similar to my case. |
@tcaputi Here's a dataset that reproduces the issue. Repro steps:
You should see:
In my other dataset And
|
@aarononeal I've taken a quick peek at your send stream and here's a couple of notes: First of all, there's no need to use encryption at all. Your steps will reproduce the problem simply by receiving the stream into a normal dataset. Second, the xattrs, with names of the form "org." are "invisible" to Linux because they don't have a recognized prefix (such as "user."). This shouldn't, of course, be any problem insofar as ZFS is concerned, but is worthy of mentioning. I suspect either myself (when I get a bit more time to look at it) or another developer will be able to figure out why this is happening pretty quickly. |
@dweeezil Thanks! I really appreciate the findings and pointers so far, and I hope this turns out to be easy to troubleshoot with the stream. I'm so curious to learn what I did to get my dataset into this situation. :-) |
Can confirm this reproduces the issue perfectly. Let me see what I can figure out. |
See this note for the likely bug description. I'm testing the obvious fix now. |
The bulk[] array index, count, must be reset per-iteration in order to not overwrite the stack. Signed-off-by: Tim Chase <[email protected]> Fixes: openzfs#8072 openzfs#8597
I encountered this issue at least 82e996c on Fedora 31 (Linux kernel 5.4.5-300.fc31.x86_64 and vanilla 5.4.7). No native encryption datasets.
|
System information
Describe the problem you're observing
When recursively setting the permissions for a directory via chmod. the process locked up. After inspecting the kernel message buffer a NULL pointer seems to have been accessed in one of the attr structures. Unfortunately I do not have debugging symbols loaded, but will load them after reboot.
The dataset involved is encrypted using native encryption.
Describe how to reproduce the problem
Include any warning/errors/backtraces from the system logs
The text was updated successfully, but these errors were encountered: