-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TRIM/UNMAP/DISCARD support for vdevs (2) #1016
Conversation
Currently, the PSIZE of the block pointer is used as the I/O size for all free ZIOs. This is incorrect for gang header blocks, for which PSIZE is the logical size of the gang block data. This patch fixes the issue by using SPA_GANGBLOCKSIZE as the I/O size for "free gang header block" ZIOs.
Currently, zavl users cannot call avl_is_empty since it is not exported. This patch fixes that.
The code builds a map of regions that were freed. On every write the code consults the map and eventually removes ranges that were freed before, but are now overwritten. Freed blocks are not TRIMed immediately. There is a tunable that defines how many txg we should wait with TRIMming freed blocks (64 by default). There is a low priority thread that TRIMs ranges when the time comes. During TRIM we keep in-flight ranges on a list to detect colliding writes - we have to delay writes that collide with in-flight TRIMs in case something will be reordered and write will reached the disk before the TRIM. We don't have to do the same for in-flight writes, as colliding writes just remove ranges to TRIM. NOTE: this patch only adds the core infrastucture for TRIM. It doesn't implement actual discard requests on vdevs. In other words, the patch in itself doesn't actually do anything (yet). NOTE: you need DKIOCTRIM support in SPL for this patch to build. Signed-off-by: Etienne Dechamps <[email protected]>
Currently, compilation fails because trim_map_free_locked() exceeds the frame size. This is because we're allocating a large structure (zio_t) on the stack. This patch fixes the issue by only allocating the fields we need from the zio structure.
This patch adds a new kstat page: zio_trim, which indicates the total number of bytes TRIMmed, and the number of successful, unsupported and failed TRIM requests. This makes it easy to verify if TRIM is actually working or not.
Since 2fd4ce81a6748b46728127212a1ba384c008696f, gang block free zios have correct size, so there's no need to skip them. Besides, the test fails with a NULL dereference bug on RAID-Z because io_bp is NULL in this case.
Currently, the TRIM module doesn't check for spa_freeze_txg when issuing TRIM requests. In a normal run this isn't an issue since vdev_config_sync() doesn't wake the TRIM thread if the SPA is frozen. However, when closing a frozen pool, trim_thread_destroy() gets called, which wakes the TRIM thread, which results in a TRIM commit. As a result, all TXGs until spa_syncing_txg - txg_limit get TRIMmed, which includes spa_freeze_txg. As a result, when reopening, ZFS tries to read data from the frozen TXG, and fails. This patch fixes the issue by taking spa_freeze_txg into account when commiting TRIMs.
This adds DISCARD support to vdev_disk_physio(), and uses it to execute DKIOCTRIM requests. The implementation is inspired by the Linux blkdev_issue_discard() function.
This adds TRIM support for file vdevs, using the newly-added VOP_SPACE compatibility interface which translates to Linux fallocate(FALLOC_FL_PUNCH_HOLE). When a TRIM request is issued to a file vdev, ZFS will "punch a hole" (deallocate space) inside the underlying file. Note that few Linux filesystems support hole punching (currently OCFS2, XFS, and ext4). Note that for this to work, you need to build against a kernel that supports file hole punching (i.e. Linux >= 2.6.38). In addition, for this to work in userspace (e.g. ztest), the installed userspace kernel headers (i.e. /usr/include/linux) must be up-to-date. NOTE: you need VOP_SPACE() support in SPL for this patch to build.
This patch adds a new module tunable: zfs_trim_zero. When this parameter is set to 1, ZFS makes sure that DKIOCTRIM zeroes the range being TRIMmed. This makes data corruption issues easier to reproduce, at the cost of greatly decreased performance. On disk vdevs, zfs_trim_zero replaces DISCARD with zero page writes, which means it will work even with vdevs that do not support DISCARD. Note that DISCARD won't be used even on vdevs that support it. On file vdevs, if FALLOC_FL_PUNCH_HOLE fails and zfs_trim_zero=1, then a zero buffer will be written instead. It is not necessary to write zeroes if hole punching works because FALLOC_FL_PUNCH_HOLE specifies that the specified range must appear as zeroes after hole punching.
For ztest to reliably check for corruption issues in the TRIM code, we need to make sure TRIM zeroes data, even if the underlying filesystem doesn't support file hole punching.
ztest runs with TRIM enabled on a backing filesystem that doesn't support hole punching revealed the following race condition: 1. A write is issued to a vdev and added to the TRIM map inflight writes tree. 2. The very first TRIM request for this vdev is issued. 3. The TRIM request fails with EOPNOTSUPP, as expected. vdev_notrim is set. 4. The write completes. trim_map_write_done() is called. 5. trim_map_write_done() sees that vdev_notrim is set, and returns immediately. 6. As a result, the write stays in the TRIM map inflight writes tree forever. 7. When the vdev is closed, trim_map_destroy() tries to destroy the inflight writes tree, which triggers an assertion failure because it isn't empty. This patch fixes the issue by removing the check on vdev_notrim in trim_map_write_done().
ztest runs with TRIM enabled revealed a race condition which occurs when a vdev close is pending while there are TRIM requests in flight. When the TRIM ZIOs complete, the SPA lock is released. The SPA lock is then immediately taken by the vdev close path, which calls trim_map_destroy() before trim_map_vdev_commit_done() has a chance to run. trim_map_destroy() then tries to destroy the inflight free tree, which fails with an assertion failure because it is non-empty. This patch fixes the issue by making trim_map_destroy() call trim_map_vdev_commit_done() before destroying the TRIM map.
With ashift=12, zio_vdev_io_start() expands all ZIOs with size < 4096 using zio_push_transform(), allocating a new ZIO buffer in the process. The allocation is of course unnecessary for DKIOCTRIM ZIOs, which don't hold any data. This patch makes sure a new buffer is only allocated on read/write ZIOs. This also fixes an assertion failure in __vdev_disk_physio because it was called with DISCARD and a non-null data buffer, which makes no sense.
When a FREE ZIO is issued on raidz, vdev_raidz_map_alloc() skips allocating data for the raidz columns because it is not needed. vdev_raidz_map_free(), however, has not been updated accordingly, and still tries to free the column data. As a result, zio_buf_free() is used with a NULL buffer, which corrupts the zio_buf_cache. This patch fixes the issue by making vdev_raidz_map_free() do a NULL check before freeing the column data. This fixes the following panic when running zconfig.sh -c: SPLError: 17651:0:(spl-kmem.c:1936:spl_kmem_cache_alloc()) ASSERTION(obj) failed SPLError: 17651:0:(spl-kmem.c:1936:spl_kmem_cache_alloc()) SPL PANIC
The trim map inflight writes tree assumes non-conflicting writes, i.e. that there will never be two simultaneous write I/Os to the same range on the same vdev. This seemed like a sane assumption; however, in actual testing, it appears that repair I/Os can very well conflict with "normal" writes. I'm not quite sure if these conflicting writes are supposed to happen or not, but in the mean time, let's ignore repair writes for now. This should be safe considering that, by definition, we never repair blocks that are freed.
The current code releases SCL_STATE while it is waiting for TRIM ZIOs to complete. That's dangerous because the vdev configuration could change while the I/O is in progress. This patch makes sure SCL_STATE is held during the whole duration of the I/O. This is consistent with the way e.g. zil_flush_vdevs() does things.
There's no reason to use the ZIO_FLAG_CONFIG_WRITER flag for TRIM ZIOS. Remove it.
With this patch, ZFS will attempt to TRIM the whole vdev in the following cases: - zpool create, zpool add, zpool attach any vdev - zpool import spare and L2ARC devices We do this to make sure we start from a "clean state". This way, the physical device knows that all the data it held before it was added to the pool is now obsolete. This is similar to what mke2fs does.
This adds TRIM support to cache vdevs. When ARC buffers are removed from the L2ARC in arc_hdr_destroy(), arc_release() or l2arc_evict(), the size previously occupied by the buffer gets scheduled for TRIMming. As always, actual TRIMs are only issued to the L2ARC after txg_trim_limit.
With this patch: - L2ARC vdevs are trimmed on export, except the labels which are preserved to keep import working. - L2ARC vdevs are trimmed entirely on remove and pool destroy.
This patch adds some improvements to the way the trim module considers TXGs: - Free ZIOs are registered with the TXG from the ZIO itself, not the current SPA syncing TXG (which may be out of date); - L2ARC are registered with a zero TXG number, as L2ARC has no concept of TXGs; - The TXG limit for issuing TRIMs is now computed from the last synced TXG, not the currently syncing TXG. Indeed, under extremely unlikely race conditions, there is a risk we could trim blocks which have been freed in a TXG that has not finished syncing, resulting in potential data corruption in case of a crash.
Currently, the TRIM thread waits indefinitely until it is waked up by spa_sync(). This triggers a blocked task warning in Linux if there is no write activity on the pool for a long time. This patch fixes the issue by making the TRIM thread wake every second to make the kernel happy.
While the current code delays TRIMs by a number of TXGs specified by the trim_txg_limit parameter, it always TRIMs frees from one TXG at a time in most circumstances, i.e. it doesn't merge frees from adjacent TXGs. This patch allows to batch TRIMs from several consecutive TXGs in order to merge more TRIM requests, thus reducing their number and making them larger. This should significantly improve TRIM performance, depending on your workload and how your physical vdev handles TRIM requests. The batch size, expressed in TXGs, is specified by the new trim_txg_batch parameter. It defaults to 32. Note the default for trim_txg_limit has been reduced from 64 to 32 to keep the worst-case trim delay constant.
Currently, the trim module uses the same algorithm for data and cache devices when deciding to issue TRIM requests, based on how far in the past the TXG is. Unfortunately, this is not ideal for cache devices, because the L2ARC doesn't use the concept of TXGs at all. In fact, when using a pool for reading only, the L2ARC is written but the TXG counter doesn't increase, and so no new TRIM requests are issued to the cache device. This patch fixes the issue by using time instead of the TXG number as the criteria for trimming on cache devices. The basic delay/batch principle stays the same, but parameters are expressed in seconds instead of TXGs. The new parameters are named trim_l2arc_limit and trim_l2arc_batch, and both default to 30 second.
@dechamps Thanks for cleaning up pull request and detailed summary. I'm working through reviewing the changes now, but I've noticed the following ASSERT gets regularly tripped by ztest with the TRIM patch stack. This doesn't occur prior to these changes and I'm able to hit it quite easily. Have you seen this? ztest: ../../cmd/ztest/ztest.c:5733: Assertion `0 == spa_open(ztest_opts.zo_pool, &spa, ((char *)__func__)) (0x0 == 0x5)' failed. |
That's strange, I never got that with any recent version of the patch. I do get this error, however, when I forget to disable the reguid test in ztest, but this branch has this test disabled, so I assume that's not it. What is the filesystem behind your ztest working directory (by default In addition, can you try with 6a3ceba? That should be the first stable commit in the patch set. |
That appears to be it. In my VM's |
Okay, now this is getting very strange. Personally, I'm running ztest on top of tmpfs. I have two hypothesis regarding your issue:
Is the ext4 filesystem you're running ztest on mounted with the |
@ryao: if you want to work on this, the best approach would probably to use @skiselkov's code and mix it with some of my additional patches (like TRIM whole devices on add, etc.), since most of these are not related to the actual algorithm used to TRIM data blocks. |
@ryao The latest patch version can be found here: http://cr.illumos.org/~webrev/skiselkov/zfs_unmap/ (did a quick rebase on latest illumos master). This implementation of trim works by tracking the metaslab allocator and issuing TRIMs (in a time-delayed fashion) on regions where metaslab has released portions. The tracking itself is done using in-core spacemaps, so we effectively coalesce adjacent extents. I try to keep stuff reasonably documented, but in case there are unclear portions, give me a kick. |
I just set up a virtual machine that has access to two real SSD partitions in order to help testing TRIM on ZoL. Do you have a more-or-less current rebase to something close to ZoL HEAD for me? Please also point me to any stress-testing tools that you know of. |
@skiselkov That sounds sufficient. Just put a pool on an iSCSI LUN that is mapped to a zvol and you should be able to test the TRIM functionality. The only difference is that ZFS won't be simulating the deficiencies of real controllers. @niekbergboer There is no rebase yet, but I will see about adapting @skiselkov's work soon. |
On Linux a simpler solution is to use a ramdisk (rd) as it supports DISCARD operations; however, keep https://bugzilla.kernel.org/show_bug.cgi?id=47711 in mind. |
@skiselkov May I ask, what is the status of this port? |
@sopmot It's a work in progress, not yet done. |
Hi all We would love to see TRIM support in ZFS on Linux - how can we support this pull request (e.g. providing test-hardware with many SSDs and full SSH access)? Many thanks for all your hard work and efforts!! Best regards |
I added support for the new queuing rewrite for TRIM in FreeBSD has so be worth anyone working on the metaslab version looking to include that. It does currently have some temporary fixes for IO reordering panics which I expect to change when I hear back from George Wilson on his fixes for that, but the core is solid. |
This pull request is going to need to be updated to use the latest code from George. We certainly want to avoid doing our own thing if possible. |
Hi all We are seriously interested in supporting this pull request and are also ready to donate for the development. As SSD-Storage becomes more and more a kind of "standard", we think this may be a big step forward for the ZFS on Linux project. Somebody interested out there? We look forward to the feedback. Best regards, |
Agree, would like to see this one prioritized. What work is remaining at this point? A rebase of the patch series? |
I asked @ryao 's opinion I couple of weeks ago about this on irc. My understanding is that otherwise he would be interested:) We also discussed in 2 words, that an fstrim-like solution would be probably the easiest and fastest way and it may be the best solution (if I recall correctly...). |
For some (myself included) a trim of log and l2 devices at boot / import would go a long way, and shouldn't (right? they start "empty" -- after any log replay if needed -- each time) be too involved... |
👍 |
Great news! It means we can drop this pull request (which has bugs that I never got around to fix, and is probably technically inferior anyway) and just implement the Linux DISCARD calls to back the Illumos stuff. That should be much easier. |
Awesome news! I could test it asap! kpande, could you explain how trim could suffers my SSD performance? From my point of view it's aimed to increase performance of my drive. |
The "discard" mount option slows down writes considerably, and it could have detrimental effect on the longevity of your SSD. |
Hey guys, I'm the author of this. I'd caution against early porting. This is still a WIP, we need to figure out the whitelist of devices where we can safely enable this by default. Also, raidz is currently untested, so it might eat data, for all I know. Lastly, I'd like to get on-demand trim (aka "fstrim") implemented as well before I'd consider this stuff "ready". |
I don't think it's part of public builds yet. It literally got rolled into the gate this Monday to get it into our internal QA builds so QA folk can share in the joys of losing data and having unexplained scrub errors :P I guess people could try and build their own, but that's not for the faint of heart, as this stuff is literally hot off the press. Give me a few more weeks to get it to work well :) |
Any update on this ? This feature is really useful ... at least with fstrim that can be launched anytime and does not slow down writes; |
Any status on this? |
@Marlinc #3656 has been working well for me in testing so far. I've now tested with more complex vdev arrangements: raidz,2,3 etc. It was rebased it against a post-0.6.5 master over this weekend. I added vdev-level tracing support to handle its new DTRACE2 call and possibly other vdev-related tracing in the future. I still want to augment the documentation a bit and to investigate the txg batching feature somewhat (at the very least, we need a nonzero check for one of the new module parameters). |
Closing we'll pick up this support once it's finalized in illumos. In the meanwhile @dweeezil has ported this change to Linux and it is working well. |
…ocol is not provided. (openzfs#1016)
This adds TRIM (a.k.a UNMAP, DISCARD, hole punching) support for vdevs. The original patch is from Pawel Jakub Dawidek who wrote it for FreeBSD. I just ported it to ZFS On Linux with added features and fixed bugs. This is a clean version of #924. Note that you need openzfs/spl#168 to build this.
This patch should be 99% stable. I'm still not sure about the last 1% because I'm sometimes having ztest failures which are difficult to reproduce and might be caused by ZoL ztest bugs. Unfortunately, I'm running out of time and I won't be able to investigate any further, so I'm leaving this to anyone interested.
In any case, if there are indeed bugs, actual failures should be extremely rare and are very unlikely to cause any data corruption. Nevertheless, I would advise anyone playing with this to err on the safe side with a good backup policy, at least until the code is thoroughly reviewed.
@behlendorf: you told me George Wilson is interested in this work. You should direct him to this pull request for review. Hopefully he'll be able to tell if there are stability issues with this code or not.
NOTE : it seems Github displays commits in chronological order in this pull request, as opposed the logical (parent-child) order, which makes it harder to understand and review. If you want to review this, I suggest you use the compare view, which list commits in their actual order.
Features
TODO list
Usage instructions
Note that the new feature is enabled by default. To disable it, use
zfs_notrim=1
as module parameter. For TRIM to work, you need to have vdevs that support it, e.g. SSDs. If you don't have any, you can still test using a Linux RAM block device (rd
module), as it supports discards. Note that if you choose to test withrd
, make sure to useashift=12
, else you will trigger a corruption bug in rd.There is a module parameter (
trim_txg_limit
) that specifies after how many TXGs the freed data is actually discarded on the device (default 32). Indeed, the algorithm doesn't discard data immediately so that pool disaster recovery (i.e.zpool -F
) don't break. I suggest you test with the default and with a low value (e.g. 1, 0) to make the algorithm more aggressive, which might trigger different bugs.There is a module parameter (
trim_txg_batch
) that specifies how many TXGs are squashed together in order to batch (merge) TRIM requests (default 32). The optimal value heavily depends on your workload and the performance characteristics of your physical vdevs. I suggest you test with the default and with a low value (e.g. 1, 0) to make the algorithm more aggressive, which might trigger different bugs.Module parameters
trim_l2arc_limit
andtrim_l2arc_batch
work the same way but are specific to cache devices. They are expressed in seconds instead of TXGs.There is a module parameter (
zfs_trim_zero
) that, when enabled, makes sure TRIMmed data is zeroed. Note that for disk vdevs, this effectively means that TRIM is disabled and replaced by "write zeroes". This mode allows ztest to test TRIM even on filesystems that don't support file hole punching. Outside ztest, it can also be used to diagnose corruption issues more reliably. Note that, for obvious reasons, performance greatly suffers in this mode.The patch introduces a new kstat page in
/proc/spl/kstat/zfs/zio_trim
. It allows you to quickly check if TRIM is actually working or not. Keep in mind that if you're using a hightrim_txg_limit
(e.g. 64), then it may take a while before TRIM requests are actually issued.TRIM support on file vdevs (a.k.a file hole punching)
There are a few prerequisites for file (not disk) vdev support using Linux file hole punching capabilities:
truncate_range
interface, so it should work (albeit only from kernel space). Ironically, ZFS itself doesn't (yet).ztest
), be sure that your installed Linux headers (i.e./usr/include/linux
) are up to date (>= 2.6.38). Also, as mentioned above, if you're usingztest
, be sure to run it on a filesystem that supports file hole punching (tmpfs doesn't). Note that you can't change parameters usingztest
, so you'll have to hacktrim_map.c
to manually setzfs_notrim=0
by default and rebuild.A good way to be sure that TRIM support on file vdev works is to do something like this with
trim_txg_limit=0
andtrim_txg_batch=0
, assuming/store
is stored on ext4 or another filesystem with file hole punching support:Again, this section only applies to file vdevs, not disk (block device) vdevs. Most users have no use for TRIM support on file vdevs.