-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TRIM/UNMAP/DISCARD support for vdevs #924
Conversation
This adds TRIM (a.k.a UNMAP, DISCARD, hole punching) support for disk vdevs. The original patch is from Pawel Jakub Dawidek <[email protected]> who wrote it for FreeBSD. Etienne Dechamps <[email protected]> ported it to ZFS On Linux. The code builds a map of regions that were freed. On every write the code consults the map and eventually removes ranges that were freed before, but are now overwritten. Freed blocks are not TRIMed immediately. There is a tunable that defines how many txg we should wait with TRIMming freed blocks (64 by default). There is a low priority thread that TRIMs ranges when the time comes. During TRIM we keep in-flight ranges on a list to detect colliding writes - we have to delay writes that collide with in-flight TRIMs in case something will be reordered and write will reached the disk before the TRIM. We don't have to do the same for in-flight writes, as colliding writes just remove ranges to TRIM. Most of the code stayed unchanged during the porting to Linux. The only big change is in the vdev disk module, since the FreeBSD and Linux interfaces for issuing discards to block devices is obviously different. On FreeBSD it seems that issuing a DELETE request of any size is sufficient; on Linux we have to be careful not to exceed maximum discard limits. That's why we introduce a new vdev_disk_io_trim() function inspired from the Linux blkdev_issue_discard() function and the pre-existing vdev_disk_physio() function. The new function takes care of splitting discard requests into smaller ones if necessary. In theory, the code should work for main pool disk vdevs, slog disk vdevs, L2ARC disk vdevs, and supports mirror and raidz. File vdevs are not supported yet. Note that the new feature is disabled by default (zfs_notrim=1). To use it, you have to explictly set the module parameter "zfs_notrim" to "0". Be aware that this code is largely untested and brings huge risks of potential data corruption. Use at your own risk and expect data loss.
Note: you need dechamps/spl@299d642 for the build to work. |
This warning appeared in cc6cd40. In practice there's no code path where "flags" may be uninitialized, but it seems GCC is not smart enough to notice it.
Pawel Jakub Dawidek, author of most of cc6cd40, noticed a mistake in zio which could lead to a NULL pointer dereference with raidz. Pawel suggested this patch to fix it.
Wouldn't it make more sense to have zfs_trim=0 as a default? (having options that default enable to NOT to do something seems llike a a double-negative to me) |
Yes it is. In the general case I would agree with you, but:
Personally I really don't like options that begin with the word "no", but in this case I'm making an exception. |
This adds TRIM support for file vdevs, using the newly-added VOP_SPACE compatibility interface which translates to Linux fallocate(FALLOC_FL_PUNCH_HOLE). When a TRIM request is issued to a file vdev, ZFS will "punch a hole" (deallocate space) inside the underlying file. Note that few Linux filesystems support hole punching (currently OCFS2, XFS, and ext4). Note that for this to work, you need to build against a kernel that supports file hole punching (i.e. Linux >= 2.6.38). In addition, for this to work in userspace (e.g. ztest), the installed userspace kernel headers (i.e. /usr/include/linux) must be up-to-date.
Support for file vdevs is here, using the Linux file hole punching capabilities (deallocate space inside files). Note that as of 4e789e1 you need openzfs/spl#168 to build. Note that there are a few prerequisites for this to work:
You can confirm that you're building with kernel file hole punching support if the SPL configure script indicates that A good way to be sure that TRIM support on file vdev works is to do something like this with
|
When punching holes in file vdevs from userspace using VOP_SPACE(), errors are not reported correctly to the caller. This patch fixes that.
In cc6cd40, vdev_disk_io_trim() returns ENOTSUP if DISCARD isn't supported. This isn't consistent with the the rest of the code which uses EOPNOTSUPP.
The current code doesn't expect the value of zfs_notrim to change under its nose, so don't allow it.
ztest runs with TRIM enabled on a backing filesystem that doesn't support hole punching revealed the following race condition: 1. A write is issued to a vdev and added to the TRIM map inflight writes tree. 2. The very first TRIM request for this vdev is issued. 3. The TRIM request fails with EOPNOTSUPP, as expected. vdev_notrim is set. 4. The write completes. trim_map_write_done() is called. 5. trim_map_write_done() sees that vdev_notrim is set, and returns immediately. 6. As a result, the write stays in the TRIM map inflight writes tree forever. 7. When the vdev is closed, trim_map_destroy() tries to destroy the inflight writes tree, which triggers an assertion failure because it isn't empty. This patch fixes the issue by removing the check on vdev_notrim in trim_map_write_done().
ztest runs with TRIM enabled revealed a race condition which occurs when a vdev close is pending while there are TRIM requests in flight. When the TRIM ZIOs complete, the SPA lock is released. The SPA lock is then immediately taken by the vdev close path, which calls trim_map_destroy() before trim_map_vdev_commit_done() has a chance to run. trim_map_destroy() then tries to destroy the inflight free tree, which fails with an assertion failure because it is non-empty. This patch fixes the issue by making trim_map_destroy() call trim_map_vdev_commit_done() before destroying the TRIM map.
Bad news: even with the recent fixes, ztest runs always end up corrupting the pool after a few seconds. In other words, the code probably isn't safe in its current state. Debugging this won't be easy. |
Currently, vdev_notrim is set to 0 after the actual vdev has been opened (e.g. in vdev_disk_open()). This defeats the purpose of setting vdev_notrim in vdev_disk_open(), so don't do it.
|
After some debugging, it seems that the corruption is caused by TRIMming valid data: ztest tries to re-read data that was previously trimmed, and of course fails. Further investigation seems to indicate that the blocks being corrupted are always gang blocks, although I'm not 100% sure. In addition, I don't understand what the connection with RAID-Z is - maybe the issue exists for other types of vdev but it only surfaces with RAID-Z, I don't know. Since this seems related to gang blocks, I took a closer at a Anyway, I still don't understand why we have to create a special case for gang blocks. Apparently, ztest tries to read a gang block after it has been freed, which doesn't make much sense to me. |
@dechamps I don't have the time right now to help dig in to any of this, but it might be worth striking up a conversation with Pawel. These seems like flaws that need to be corrected in the original patch so he may be interested. |
Well, actually, I'm keeping him informed of all the developments in this thread, including the race conditions I found earlier. For some reason he doesn't seem to want to comment on Github directly, but he does answer my emails. Or at least he did - his last sign was from 10 days ago. Maybe he's on a vacation or something. Anyway, I'm still investigating, with or without his help. |
Confirmed: if I test for |
Found it!
...
As you can see, when the gang block leader is written (or read),
The PSIZE of a gang block leader is the logical size of the data, not the physical size of the indirect block, so that's incorrect. It didn't cause any issues until now because
Unfortunately, the TRIM code wasn't doing that, so it was issuing TRIM requests with incorrect sizes. I think the reason this only happens with RAID-Z is because when testing with other vdev types, we get lucky and the bigger TRIM request doesn't actually touch other valid data. However, if my analysis is correct, then this bug could be a potential corruption hazard with any vdev type. The solution is to make sure Now, the good news is, we're passing ztest with |
Currently, the PSIZE of the block pointer is used as the I/O size for all free ZIOs. This is incorrect for gang header blocks, for which PSIZE is the logical size of the gang block data. This patch fixes the issue by using SPA_GANGBLOCKSIZE as the I/O size for "free gang header block" ZIOs.
This test is useless since 19afab1, which fixed gang blocks issues. Besides, the original code wasn't even working properly since io_bp is always NULL on a physical vdev ZIO.
In eefbe9b, we fixed the I/O size for "free gang header block" ZIOs. However, the fix also included gang data blocks, which are of variable size. Fix the fix by distinguishing between gang header blocks and gang data blocks.
Here's an excerpt from a debugging session during a ztest which failed because of the remaining corruption issue: https://gist.github.com/3706822 Apparently, ZFS is writing a "L0 DMU objset" block, then frees it, then tries to read it, which obviously fails. Here we go again... |
Currently, labels are preserved when TRIMming the cache vdev on zpool destroy. This is not what we want.
I'm setting up long ztest runs to investigate the assertion failure issue I reported some comments ago. Even a3adbc8 fails, so that's probably not a regression. I will be testing Bottom line is: the branch doesn't pass a long (> 30 minutes) ztest run. It can't be considered stable yet. |
On 63 instances of 45 of the failed tests (73%) ended with the error message I started investigating some comments ago:
12 of the failed tests (19%) ended with an error message similar to the ones in #989 (issue in 3 tests failed with the following message:
1 test failed with the following message:
1 test failed with the following message:
We need to fix #989 before I can continue ztesting. I need reliable results. |
I got 100% failure rate over 185 three-hour pass ztest runs on dccbfaf (and commits after that). 183 tests (99%) failed with the |
After further investigation, I think that's normal. If an I/O is done on an inaccessible vdev, |
Okay, now I'm getting somewhere. I noticed the following stack trace during one of my attempts to reproduce the issue:
I also managed to locate the thread responsible for the conflicting ZIO, i.e. the one that is already in the trim map inflight writes tree which results in the assertion failure:
It resembles a race condition: the second thread was about to remove the write from the tree, but it had to wait for the trim map mutex, which is owned by the first thread adding its own conflicting write. Consequently, instead of removing then adding the write, the first thread tried to add the write before it had a chance to be removed. There are interesting facts about these two writes:
Both writes are targetting the same physical vdev, same offset, same size (thus the conflict). One thing I don't understand is how such a situation can happen in the first place: if I interpret the traces correctly, it means we got two simultaneous ZIOs writing data at exactly the same offset on the the same vdev. Isn't that frowed upon? Shouldn't it result in potential data corruption due to the unpredictable ordering of write requests? I would expect the first ZIO to wait until the second ZIO is done before issuing its write request. |
I got another occurence. The traces are the same, except this time, I got two conflicting ZIOs with resilver priority and the In the mean time, I'm trying a patch which makes TRIM ignore repair writes. Looks like it fixes the issue, although I need more ztesting to be sure. Ignoring repair writes should be safe considering that, by definition, resilvering/self-healing doesn't write blocks that are freed, so they cannot conflict with inflight TRIM requests. At least that's the theory. I'm going to ask the Illumos folks about this conflicting writes issue. |
@dechamps Are you developing with assertions enabled (i.e. --enable-debug)? |
Of course. How would I catch these assertion failures if I didn't? |
The trim map inflight writes tree assumes non-conflicting writes, i.e. that there will never be two simultaneous write I/Os to the same range on the same vdev. This seemed like a sane assumption; however, in actual testing, it appears that repair I/Os can very well conflict with "normal" writes. I'm not quite sure if these conflicting writes are supposed to happen or not, but in the mean time, let's ignore repair writes for now. This should be safe considering that, by definition, we never repair blocks that are freed.
Okay, according to my testing, 7be8b0c definitely fixes the |
This patch adds some improvements to the way the trim module considers TXGs: - Free ZIOs are registered with the TXG from the ZIO itself, not the current SPA syncing TXG (which may be out of date); - L2ARC are registered with a zero TXG number, as L2ARC has no concept of TXGs; - The TXG limit for issuing TRIMs is now computed from the last synced TXG, not the currently syncing TXG. Indeed, under extremely unlikely race conditions, there is a risk we could trim blocks which have been freed in a TXG that has not finished syncing, resulting in potential data corruption in case of a crash.
Currently, the TRIM thread waits indefinitely until it is waked up by spa_sync(). This triggers a blocked task warning in Linux if there is no write activity on the pool for a long time. This patch fixes the issue by making the TRIM thread wake every second to make the kernel happy.
While the current code delays TRIMs by a number of TXGs specified by the trim_txg_limit parameter, it always TRIMs frees from one TXG at a time in most circumstances, i.e. it doesn't merge frees from adjacent TXGs. This patch allows to batch TRIMs from several consecutive TXGs in order to merge more TRIM requests, thus reducing their number and making them larger. This should significantly improve TRIM performance, depending on your workload and how your physical vdev handles TRIM requests. The batch size, expressed in TXGs, is specified by the new trim_txg_batch parameter. It defaults to 32. Note the default for trim_txg_limit has been reduced from 64 to 32 to keep the worst-case trim delay constant.
206dddd brings us TRIM TXG batching. This is an optimization aimed at reducing the number of TRIM requests issued by making them larger. This should significantly improve performance depending on your workload and the way your physical vdevs handle TRIM requests. There is a module parameter ( Testing shows a 50% reduction in TRIM requests over a 5-minute ztest run. This is not bad considering ztest is probably close to the worst-case scenario for this feature (heavily random I/O on a constantly changing pool). |
Currently, the trim module uses the same algorithm for data and cache devices when deciding to issue TRIM requests, based on how far in the past the TXG is. Unfortunately, this is not ideal for cache devices, because the L2ARC doesn't use the concept of TXGs at all. In fact, when using a pool for reading only, the L2ARC is written but the TXG counter doesn't increase, and so no new TRIM requests are issued to the cache device. This patch fixes the issue by using time instead of the TXG number as the criteria for trimming on cache devices. The basic delay/batch principle stays the same, but parameters are expressed in seconds instead of TXGs. The new parameters are named trim_l2arc_limit and trim_l2arc_batch, and both default to 30 second.
5445fb9 brings us time-based TRIM for cache devices. This is an optimization over the previous behavior, which used the TXG counter to schedule TRIMs to cache devices, which doesn't make a ton of sense. We have two new module parameters: |
The TODO list is now empty. Everything is done, this branch is feature-complete and I implemented all the optimization ideas that came to my mind. I'll close this pull request and open a new one on |
It could be useful to implement something similar to fstrim so that an administrator can TRIM the free space map. That would allow us to rejuvenate any pools that were created on SSDs prior to this work. Also, it might be useful to test with ashift=13 to ensure that there are no odd issues there. Modern SSDs have 8192-byte pages, so they should be using ashift=13, rather than ashift=12. |
@ryao: I agree, but I certainly won't have the time to do what you ask. Unfortunately, I only have until tomorrow to wrap this up. We're almost there. ztest runs on e000967660eb6bc67dd9679b2729d811284a1a9b and eeb919ae2b49faa5f1f8282798201c4e2e6ded3d indicate these commits are now 100% stable (not even a single failure over 34 1-hour passes). Yay! Unfortunately, the latest code, that is my optimizations from yesterday, fails approximately 25% of the time with the following assertion:
Investigating. Hopefully that's the last one! |
Continued in #1016. |
What is the status of this? |
See #1016 for the updated pull request |
Please, when discard()ing the newly added device, either make it a background job or add an option to skip it. On my 3500GB LUN I test zfs with, discarding the whole device with fstrim takes 8 hours(!!!), 21 minutes on an already backend discarded device. In other words, if I have a 3500GB LUN with old, dirty data and I try to discard it it will hang there for 8 hours. This can potentially be very bad if you need the filesystem NOW (like in a backup recovery etc.) The LUN in question is connected from a Clariion CX4-480 via 4Gb Fibre. I don't see much on activity (iostat, vmstat) when fstrim runs, and there is a huge difference between a clean and dirty LUN, so the point of congestion is in the array itself. It is of course possible that fstrim is in some way horribly inefficient, but I'd take the safe side. Also, performance suffers horribly on ext4 with -o discard and the recommendation is to not use it and just use maintenance windows to run fstrim occasionally - I'd prefer such mechanism in ZFS instead of automatic (even if deferred/optimized) discard. |
This pull request is obsolete. Please comment on #1016. That being said, I hear your concerns about discard performance, and we'll do our best to alleviate this kind of issues. |
NOTE : this pull request has been replaced by #1016.
This adds TRIM (a.k.a UNMAP, DISCARD, hole punching) support for vdevs. The original patch is from Pawel Jakub Dawidek ([email protected]) who wrote it for FreeBSD. I'm just porting it to ZFS On Linux. Original discussion is in #598. Note that you need openzfs/spl#168 to build this.
Clean branch (caution: history may be rewritten): https://github.com/dechamps/zfs/compare/trim-clean
This code is largely untested (especially the Linux-specific parts) and carries a high risk of data corruption since it deals with deletion of data. DO NOT APPLY THIS PATCH IF YOU CARE ABOUT YOUR DATA. USE AT YOUR OWN RISK. EXPECT DATA LOSS.
@behlendorf: I'm not asking you to merge this. At least not right now. I'm just opening a pull request to give the patch more visibility and get more testing. As usual, as soon as it's ready and received enough testing for your taste, I'll clean it and open another, final pull request.
Features
Current status
zconfig.sh
passes with flying colors starting from dccbfaf.zfault.sh
passes with flying colors starting from dccbfaf.Usage instructions
Note that the new feature is disabled by default (
zfs_notrim=1
). To use it, you have to explicitly load the module withzfs_notrim=0
. For TRIM to work, you need to have vdevs that support it, e.g. SSDs. If you don't have any, you can still test using a Linux RAM block device (rd
module), as it supports discards. Note that if you choose to test withrd
, make sure to useashift=12
, else you will trigger a corruption bug in rd.There is a module parameter (
trim_txg_limit
) that specifies after how many TXGs the freed data is actually discarded on the device (default 32). Indeed, the algorithm doesn't discard data immediately so that pool disaster recovery (i.e.zpool -F
) don't break. I suggest you test with the default and with a low value (e.g. 1) to make the algorithm more aggressive, which might trigger different bugs.There is a module parameter (
trim_txg_batch
) that specifies how many TXGs are squashed together in order to batch (merge) TRIM requests (default 32). The optimal value heavily depends on your workload and the performance characteristics of your physical vdevs.Module parameters
trim_l2arc_limit
andtrim_l2arc_batch
work the same way but are specific to cache devices. They are expressed in seconds instead of TXGs.There is a module parameter (
zfs_trim_zero
) that, when enabled, makes sure TRIMmed data is zeroed. Note that for disk vdevs, this effectively means that TRIM is disabled and replaced by "write zeroes". This mode allows ztest to test TRIM even on filesystems that don't support file hole punching. Outside ztest, it can also be used to diagnose corruption issues more reliably. Note that, for obvious reasons, performance greatly suffers in this mode.The patch introduces a new kstat page in
/proc/spl/kstat/zfs/zio_trim
. It allows you to quickly check if TRIM is actually working or not. Keep in mind that if you're using a hightrim_txg_limit
(e.g. 64), then it may take a while before TRIM requests are actually issued.TRIM support on file vdevs (a.k.a file hole punching)
There are a few prerequisites for file (not disk) vdev support using Linux file hole punching capabilities:
truncate_range
interface, so it should work (albeit only from kernel space). Ironically, ZFS itself doesn't (yet).zfs_notrim
still needs to be set to0
, no matter what the vdev type is.ztest
), be sure that your installed Linux headers (i.e./usr/include/linux
) are up to date (>= 2.6.38). Also, as mentioned above, if you're usingztest
, be sure to run it on a filesystem that supports file hole punching (tmpfs doesn't). Note that you can't change parameters usingztest
, so you'll have to hacktrim_map.c
to manually setzfs_notrim=0
by default and rebuild.A good way to be sure that TRIM support on file vdev works is to do something like this with
zfs_notrim=0
andtrim_txg_limit=1
, assuming/store
is stored on ext4 or another filesystem with file hole punching support:Again, this section only applies to file vdevs, not disk (block device) vdevs. Most users have no use for TRIM support on file vdevs.
TODO list
Note that I have no idea if I'll have enough time to see this whole thing through, so if you want to take care of some items of the TODO list, be my guest.