SATA trim for vdev members #598

GregorKopka · 2012-03-12T13:06:58Z

ZoL should issue SATA TRIM commands to the underlying block devices when blocks are freed.

For CACHE devices (unless persistend L2ARC may be implemented at some point in the future): on cache device activation (and deactivation, like zpool remove ) a TRIM command should free the whole space (minus needed persistent superblocks) available on such devices .

Only then SSDs will be hinted of unused blocks so their wear-leveling algorithms can function as intended.

behlendorf · 2012-03-12T17:21:27Z

I believe there was some work for this going on at Illumos. We should check and see what they have done, with luck we'll just be able to port their changes with some modifications for Linux.

Rudd-O · 2012-06-10T22:09:20Z

What's the story on this? I badly need to TRIM my underlying device here.

dechamps · 2012-06-11T08:07:52Z

Should be easy for L2ARC devices (just discard the whole device on add/import) and SLOG devices (discard the log when txg is synced). Not so easy for actual data devices - as I remember from some discussion on the original ZFS mailing list a few years ago, there was serious issues involved. Most importantly, immediately discarding unused blocks could make zpool disaster recovery (i.e. import -F) impossible.

Rudd-O · 2012-06-11T16:16:48Z

What about batched discard? ZFS can know which areas of the disks are not used by a process of elimination, so it should at least implement what the fstrim command needs to do its job.

ryao · 2012-07-05T22:29:23Z

It should be possible to implement using the free space map. The only problem is that no one has time to do it.

ryao · 2012-07-13T15:09:23Z

One of the FreeBSD has developed this:

http://people.freebsd.org/~pjd/patches/zfstrim.patch

He told me that it is meant for single SSD pools. We only need to import it to gain TRIM support.

dechamps · 2012-07-15T08:58:24Z

This patch seems to indeed implement the desired feature and it even delays discards by some TXGs to avoid breaking zpool import -F, so it should be good, but it's also quite complex and intrusive. Which is not an issue per se, but we should keep in mind that any bugs in this kind of code could have very dangerous consequences (i.e. data loss), especially if we accidentally issue a discard on real, non-freed data. This needs careful reviewing and heavy, thorough testing. I'm especially interested on how the author made sure to prevent some potentially devastating race conditions, like say, when discarding a range while it's being written again at the same time (for example due to command reordering in the block device queue).

So, is there any discussion on this patch somewhere? Code review? Test results? Googling "zfstrim.patch" doesn't seem to yield anything. Quite frankly, merging this patch without extensive code review and testing would be pure madness IMO.

Also, this patch doesn't discard whole cache and log devices on export/import, although that's easy to implement separately (and much less dangerous).

ryao · 2012-07-15T14:59:24Z

@dechamps It was a funded project in FreeBSD. pjd told me that the people who funded it have put it into production use. You can email him @freebsd.org with questions.

By the way, do you intend to try porting this? It is on my to do list, although you probably would get to it sooner than I will.

dechamps · 2012-07-15T16:16:13Z

@dechamps It was a funded project in FreeBSD. pjd told me that the people who funded it have put it into production use. You can email him @freebsd.org with questions.

Mail sent.

By the way, do you intend to try porting this? It is on my to do list, although you probably would get to it sooner than I will.

I'm not quite sure yet, it depends on what my work load will be in the coming week. Anyway, as I understand it, the patch is very easy to port: it seems the only thing to do is to replace FreeBSD's BIO_DELETE operation by Linux's BLKDISCARD.

dechamps · 2012-07-17T21:08:18Z

Here's pjd's answers to my questions:

I told him that this is mostly for SSD-only pools, not single-SSD pools. The customer I prepared this for is using this with pools that contain multiple SSDs.

Broadly speaking, how does your code work?

The code builds a map of regions that were freed. On every write the code consults the map and eventually removes ranges that were freed before, but are now overwritten.

Freed blocks are not TRIMed immediately. There is a tunable that defines how many txg we should wait with TRIMming freed blocks (64 by default).

There is a low priority thread that TRIMs ranges when the time comes. During TRIM we keep in-flight ranges on a list to detect colliding writes - we have to delay writes that collide with in-flight TRIMs in case something will be reordered and write will reached the disk before the TRIM. We don't have to do the same for in-flight writes, as colliding writes just remove ranges to TRIM.

What steps did you take to prevent potentially dangerous race conditions from happening (see my comment)?

The race you were talking about is handled by keeping a list of in-flight TRIMs, so we can detect that colliding write happens and delay such write until TRIM is done.

How has the code been tested?

It was and still is being tested by my customer. Some bugs were found and fixed.

Has the code been reviewed by anyone (especially experienced ZFS developers)?

No. At least not yet. The code is still pretty fresh.

Do your trust the code to be reliable and safe to use in production environments?

From what I know the customer of mine already put it in production or at least started some production tests on some limited number of machines. I agree the code could use some independent review and testing in different environments. For example it hasn't been tested with pools that contain log vdevs.

In light of this new information, I think we can proceed with the merge, but maybe we should disable the feature by default (zfs_notrim=1) and expose it as a module parameter marked experimental in the description. @behlendorf: what are your thoughts on this?

behlendorf · 2012-07-18T19:55:55Z

I think we should proceed very very very carefully here. We're talking about touching some delicate code with a fairly intrusive change. I know this would be a great feature, but it would be easy to get this subtly wrong.

Why don't we start by porting the FreeBSD patch to a ZoL branch. Then those who are interested in this feature can get some serious testing on the code before we merge it in. I'd rather not risk destabilizing master until we are more confident about this change.

Rudd-O · 2012-07-18T20:20:52Z

I think that is very reasonable.

On Wed, 2012-07-18 at 12:55 -0700, Brian Behlendorf wrote:

I think we should proceed very very very carefully here. We're talking about touching some delicate code with a fairly intrusive change. I know this would be a great feature, but it would be easy to get this subtly wrong.

Why don't we start by porting the FreeBSD patch to a ZoL branch. Then those who are interested in this feature can get some serious testing on the code before we merge it in. I'd rather not risk destabilizing master until we are more confident about this change.

Reply to this email directly or view it on GitHub:
#598 (comment)

dechamps · 2012-07-18T21:16:38Z

I agree. I'm still not sure I'll have the time to come up with the branch though, so if someone wants to jump in, go ahead. Like I said, this should by fairly easy to port.

ryao · 2012-07-19T18:20:58Z

I am in agreement with everyone else. My time is being spent on other things though. If someone does the initial port, I am willing to field test it on my desktop. The worst case is that I will have to restore it from a backup.

maxximino · 2012-07-19T19:02:37Z

My laptop is willing to test the patch. I regularly do backup.
The pool of my laptop has a single vdev, that's a luks device, over a partition on a OCZ Agility 3.
ZFS-over-LUKS (on a HD, died this afternoon) works beautifully.

Rudd-O · 2012-07-19T21:54:08Z

I can probably test this too. I, too, have ZFS-over-LUKS on an SSD,
fully replicated to my NAS for safety, and the LUKS code I personally
hacked to enable discards, so this should be EENTERESTING. Just point
me to a branch, I will merge it into a branch of my own systemd-enabled
ZFS on Linux, and I'll give it a shot.

On Thu, 2012-07-19 at 12:02 -0700, Massimo Maggi wrote:

My laptop is willing to test the patch. I regularly do backup.
The pool of my laptop has a single vdev, that's a luks device, over a partition on a OCZ Agility 3.
ZFS-over-LUKS (on a HD, died this afternoon) works beautifully.

Reply to this email directly or view it on GitHub:
#598 (comment)

dechamps · 2012-09-01T09:05:01Z

Here it is: #924. It definitely needs extensive testing, so if you're not afraid to corrupt your pool, go ahead.

Rudd-O · 2013-01-29T03:38:45Z

Status on this?

behlendorf · 2013-01-29T04:20:05Z

Largely done and working, but it needs to be rebasesd on master and seriously tested. I definitely do want to get it merged before it gets much staler.

clopez · 2013-04-12T21:31:49Z

Maybe is time to rebase it? Perhaps a beta/experimental tarball version can be released to engage more people on testing the feature.

dweeezil · 2015-07-24T20:46:09Z

I figure I'm far enough along to mention I'm working on porting Nextenta's TRIM support. It uses a rather different approach than the the one described earlier in this issue. My plan is to try to get disk and file vdevs working first and then deal with zvols. It's a good ways off yet but so far is looking good. I'll post more information to this particular issue as it becomes available.

dechamps · 2015-07-24T20:53:32Z

What do you mean by "deal with zvols"? Discards are already implemented for zvols and have been working for years now - see #334, #384, #1010. Discard on zvol is a frontend feature, which is very different from discard on vdevs, which is a backend feature.

richardelling · 2015-07-24T20:55:06Z

On Jul 24, 2015, at 1:46 PM, Tim Chase [email protected] wrote:

I figure I'm far enough along to mention I'm working on porting Nextenta's TRIM support. It uses a rather different approach than the the one described earlier in this issue. My plan is to try to get disk and file vdevs working first and then deal with zvols.

This makes no sense. Surely you mean initiating TRIM/UNMAP for vdevs?
That is totally different than zvols, as that work is already done.
-- richard

It's a good ways off yet but so far is looking good. I'll post more information to this particular issue as it becomes available.

—
Reply to this email directly or view it on GitHub #598 (comment).

dweeezil · 2015-07-24T21:10:22Z

Let me clarify the part regarding zvols: I want to evaluate whether to try replacing the existing zvol discard code with that from Nexenta. The particular use case for which I was asked to do this work is for file vdevs so I figured on porting the code and testing with file and disk vdevs first and then look at the zvol situation later.

dweeezil · 2015-07-26T15:23:16Z

This appears to be coming together rather nicely and, so far, vdev file trimming (via hole-punching) appears to be working properly and, as expected, the existing zvol code works just fine (tested with a "discard"-mounted ext4 atop a zvol). I've not dealt with vdev disks yet but am instead running a bunch of stress tests with vdev file-based pools to make sure nothing has broken.

I'll also be porting Nexenta's on-demand trim feature.

The WIP spl code is in https://github.com/dweeezil/spl/tree/ntrim and the WIP zfs code is in https://github.com/dweeezil/zfs/tree/ntrim.

richardelling · 2015-07-26T17:22:07Z

On Jul 26, 2015, at 8:23 AM, Tim Chase [email protected] wrote:

This appears to be coming together rather nicely and, so far, vdev file trimming (via hole-punching) appears to be working properly and, as expected, the existing zvol code works just fine (tested with a "discard"-mounted ext4 atop a zvol). I've not dealt with vdev disks yet but am instead running a bunch of stress tests with vdev file-based pools to make sure nothing has broken.

sigh, so many ZFS ports, so little time :-)
I had not realized that the ZoL code is so far behind illumos in this regard.
Yes, absolutely agree with what you are proposing.

FWIW, Solaris 11 has had vdev-level UNMAP/TRIM for years. By default it is disabled.
Given the large variance in drive and array performance during UNMAP/TRIM, it is
clear that default=enabled is not a good idea. Perhaps some day the drive vendors will
get it right...
-- richard

I'll also be porting Nexenta's on-demand trim feature.

The WIP spl code is in https://github.com/dweeezil/spl/tree/ntrim https://github.com/dweeezil/spl/tree/ntrim and the WIP zfs code is in https://github.com/dweeezil/zfs/tree/ntrim https://github.com/dweeezil/zfs/tree/ntrim.

—
Reply to this email directly or view it on GitHub #598 (comment).

tomposmiko · 2015-07-27T08:08:43Z

Out of curiosity will be there available an fstrim like (offline) solution?

dweeezil · 2015-08-02T17:14:22Z

The patch set seems to be working at this point. Please see #3656.

@tomposmiko With the on-demand trim feature, zpool trim <pool> does the same thing, however, I'm going to look into adding support for the Linux standard FITRIM ioctl which would allow fstrim to work properly.

dweeezil · 2015-08-03T02:02:26Z

@tomposmiko I had forgotten how weird FITRIM is. Among other things, it targets a single filesystem which in ZFS really isn't possible since space allocation is a pool attribute. I think zpool trim ought to suffice.

Bronek · 2015-08-06T19:38:34Z

@dweeezil theoretically you could perform reverse lookup from filesystem to pool(s) and the perform "zpool trim" on a result ...

CMCDragonkai · 2015-12-07T11:19:14Z

Would this work on ZFS over LUKs? We would need to enable the discard for the LUKs of course, but what else would be needed?

dasJ · 2016-11-22T12:05:23Z

@CMCDragonkai I currently run ZFS on LUKS and I just had to open the volume with --allow-discards (or discard when using crypttab). I set autotrim=on on the pool for automatic trimming.

UralZima · 2017-02-03T17:34:29Z

Hi. Can anyone provide some instructions, how to run TRIM version of ZFS in Gentoo? I urgently need it, my SSD is failing. Is it possible to add patches to portage or modify ebuild file?
@dasJ, how you running it? Maybe @dweeezil, you can help?
Thank you very much.
P.S. I already posted a question on another thread (nexenta), not sure if someone will answer soon

drescherjm · 2017-02-03T17:43:13Z

I urgently need it, my SSD is failing.

I don't believe TRIM will help fix a failing SSD.

awnz · 2017-02-04T00:39:05Z

If your SSD is failing, as @drescherjm said, TRIM won't help. What you want to do, and urgently is:

Stop writing anything to that SSD.
Copy everything off it immediately to safer storage. Start with the most important data first.
Replace the SSD with a new one.

Hopefully you are using RAID and backups.

Since this Github issue is for the ZFS TRIM feature and not for support, if you need any volunteer assistance with this, talk to me on Twitter at @aw_nz (https://twitter.com/aw_nz)

UralZima · 2017-02-04T09:18:47Z

Thanks for answers. I understood you, but anyway, I have heavy DB running on it, I don't want to trash newly bought SSD again. Of course I have backups/snaphots. I just want to enable/test TRIM on ZFS
@awnz, Thanks for your offer to help. I don't have a twitter, let's talk in this web chat room:
https://tlk.io/trimzfs anyone who interested, please join.

I think these instructions will help another testers/devs to test trim feature, so it can be on-topic here.
Is it enought to git clone ntrim branch of spl, ./configure && make && make install and do the same for zfs with ntrim-next branch? it will generate .ko files, and I need to move them to initramfs? or some cleaner way exists for gentoo?

Thank you very much.

cwedgwood · 2017-03-28T19:41:23Z

@behlendorf, @dweeezil; close (duplicate) because it's been superseded by #5925 ?

mailinglists35 · 2019-03-22T14:41:28Z

Is pull request #8419 the final one that will actually fix this issue? (I don't understand why this issue was closed last year)

behlendorf · 2019-03-22T23:19:47Z

@mailinglists35 yes, #8419 is the final version. This is was closed as a duplicate of a previous version of TRIM.

UNMAP/TRIM support is a frequently-requested feature to help prevent performance from degrading on SSDs and on various other SAN-like storage back-ends. By issuing UNMAP/TRIM commands for sectors which are no longer allocated the underlying device can often more efficiently manage itself. This TRIM implementation is modeled on the `zpool initialize` feature which writes a pattern to all unallocated space in the pool. The new `zpool trim` command uses the same vdev_xlate() code to calculate what sectors are unallocated, the same per- vdev TRIM thread model and locking, and the same basic CLI for a consistent user experience. The core difference is that instead of writing a pattern it will issue UNMAP/TRIM commands for those extents. The zio pipeline was updated to accommodate this by adding a new ZIO_TYPE_TRIM type and associated spa taskq. This new type makes is straight forward to add the platform specific TRIM/UNMAP calls to vdev_disk.c and vdev_file.c. These new ZIO_TYPE_TRIM zios are handled largely the same way as ZIO_TYPE_READs or ZIO_TYPE_WRITEs. This makes it possible to largely avoid changing the pipieline, one exception is that TRIM zio's may exceed the 16M block size limit since they contain no data. In addition to the manual `zpool trim` command, a background automatic TRIM was added and is controlled by the 'autotrim' property. It relies on the exact same infrastructure as the manual TRIM. However, instead of relying on the extents in a metaslab's ms_allocatable range tree, a ms_trim tree is kept per metaslab. When 'autotrim=on', ranges added back to the ms_allocatable tree are also added to the ms_free tree. The ms_free tree is then periodically consumed by an autotrim thread which systematically walks a top level vdev's metaslabs. Since the automatic TRIM will skip ranges it considers too small there is value in occasionally running a full `zpool trim`. This may occur when the freed blocks are small and not enough time was allowed to aggregate them. An automatic TRIM and a manual `zpool trim` may be run concurrently, in which case the automatic TRIM will yield to the manual TRIM. Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Tim Chase <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: George Wilson <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Contributions-by: Saso Kiselkov <[email protected]> Contributions-by: Tim Chase <[email protected]> Contributions-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8419 Closes #598

UNMAP/TRIM support is a frequently-requested feature to help prevent performance from degrading on SSDs and on various other SAN-like storage back-ends. By issuing UNMAP/TRIM commands for sectors which are no longer allocated the underlying device can often more efficiently manage itself. This TRIM implementation is modeled on the `zpool initialize` feature which writes a pattern to all unallocated space in the pool. The new `zpool trim` command uses the same vdev_xlate() code to calculate what sectors are unallocated, the same per- vdev TRIM thread model and locking, and the same basic CLI for a consistent user experience. The core difference is that instead of writing a pattern it will issue UNMAP/TRIM commands for those extents. The zio pipeline was updated to accommodate this by adding a new ZIO_TYPE_TRIM type and associated spa taskq. This new type makes is straight forward to add the platform specific TRIM/UNMAP calls to vdev_disk.c and vdev_file.c. These new ZIO_TYPE_TRIM zios are handled largely the same way as ZIO_TYPE_READs or ZIO_TYPE_WRITEs. This makes it possible to largely avoid changing the pipieline, one exception is that TRIM zio's may exceed the 16M block size limit since they contain no data. In addition to the manual `zpool trim` command, a background automatic TRIM was added and is controlled by the 'autotrim' property. It relies on the exact same infrastructure as the manual TRIM. However, instead of relying on the extents in a metaslab's ms_allocatable range tree, a ms_trim tree is kept per metaslab. When 'autotrim=on', ranges added back to the ms_allocatable tree are also added to the ms_free tree. The ms_free tree is then periodically consumed by an autotrim thread which systematically walks a top level vdev's metaslabs. Since the automatic TRIM will skip ranges it considers too small there is value in occasionally running a full `zpool trim`. This may occur when the freed blocks are small and not enough time was allowed to aggregate them. An automatic TRIM and a manual `zpool trim` may be run concurrently, in which case the automatic TRIM will yield to the manual TRIM. Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Tim Chase <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: George Wilson <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Contributions-by: Saso Kiselkov <[email protected]> Contributions-by: Tim Chase <[email protected]> Contributions-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes openzfs#8419 Closes openzfs#598

Signed-off-by: Paul Dagnelie <[email protected]>

dechamps mentioned this issue Sep 1, 2012

TRIM/UNMAP/DISCARD support for vdevs #924

Closed

behlendorf added the Difficulty - Hard label Oct 6, 2014

behlendorf removed this from the 0.7.0 milestone Oct 6, 2014

behlendorf removed the Difficulty - Hard label Oct 5, 2016

UralZima mentioned this issue Feb 3, 2017

TRIM/Discard support from Nexenta #3656

Closed

behlendorf closed this as completed in 43b857f May 29, 2018

jittygitty mentioned this issue Apr 30, 2022

FAST-Tracking REFLINK and Offline Deduplication, first for LINUX only #13349

Closed

pcd1193182 added a commit to pcd1193182/zfs that referenced this issue Sep 26, 2023

Expose rate limiting errors through zpool status (openzfs#598)

0d68fdb

Signed-off-by: Paul Dagnelie <[email protected]>

SATA trim for vdev members #598

SATA trim for vdev members #598

Comments

GregorKopka commented Mar 12, 2012

behlendorf commented Mar 12, 2012

Rudd-O commented Jun 10, 2012

dechamps commented Jun 11, 2012

Rudd-O commented Jun 11, 2012

ryao commented Jul 5, 2012

ryao commented Jul 13, 2012

dechamps commented Jul 15, 2012

ryao commented Jul 15, 2012

dechamps commented Jul 15, 2012

dechamps commented Jul 17, 2012

behlendorf commented Jul 18, 2012

Rudd-O commented Jul 18, 2012

dechamps commented Jul 18, 2012

ryao commented Jul 19, 2012

maxximino commented Jul 19, 2012

Rudd-O commented Jul 19, 2012

dechamps commented Sep 1, 2012

Rudd-O commented Jan 29, 2013

behlendorf commented Jan 29, 2013

clopez commented Apr 12, 2013

dweeezil commented Jul 24, 2015

dechamps commented Jul 24, 2015

richardelling commented Jul 24, 2015

dweeezil commented Jul 24, 2015

dweeezil commented Jul 26, 2015

richardelling commented Jul 26, 2015

tomposmiko commented Jul 27, 2015

dweeezil commented Aug 2, 2015

dweeezil commented Aug 3, 2015

Bronek commented Aug 6, 2015

CMCDragonkai commented Dec 7, 2015

dasJ commented Nov 22, 2016 • edited Loading

UralZima commented Feb 3, 2017

drescherjm commented Feb 3, 2017

awnz commented Feb 4, 2017

UralZima commented Feb 4, 2017

cwedgwood commented Mar 28, 2017

mailinglists35 commented Mar 22, 2019 • edited Loading

behlendorf commented Mar 22, 2019

dasJ commented Nov 22, 2016 •

edited

Loading

mailinglists35 commented Mar 22, 2019 •

edited

Loading