Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SATA trim for vdev members #598

Closed
GregorKopka opened this issue Mar 12, 2012 · 39 comments
Closed

SATA trim for vdev members #598

GregorKopka opened this issue Mar 12, 2012 · 39 comments
Labels
Type: Feature Feature request or new feature

Comments

@GregorKopka
Copy link
Contributor

ZoL should issue SATA TRIM commands to the underlying block devices when blocks are freed.

For CACHE devices (unless persistend L2ARC may be implemented at some point in the future): on cache device activation (and deactivation, like zpool remove ) a TRIM command should free the whole space (minus needed persistent superblocks) available on such devices .

Only then SSDs will be hinted of unused blocks so their wear-leveling algorithms can function as intended.

@behlendorf
Copy link
Contributor

I believe there was some work for this going on at Illumos. We should check and see what they have done, with luck we'll just be able to port their changes with some modifications for Linux.

@Rudd-O
Copy link
Contributor

Rudd-O commented Jun 10, 2012

What's the story on this? I badly need to TRIM my underlying device here.

@dechamps
Copy link
Contributor

Should be easy for L2ARC devices (just discard the whole device on add/import) and SLOG devices (discard the log when txg is synced). Not so easy for actual data devices - as I remember from some discussion on the original ZFS mailing list a few years ago, there was serious issues involved. Most importantly, immediately discarding unused blocks could make zpool disaster recovery (i.e. import -F) impossible.

@Rudd-O
Copy link
Contributor

Rudd-O commented Jun 11, 2012

What about batched discard? ZFS can know which areas of the disks are not used by a process of elimination, so it should at least implement what the fstrim command needs to do its job.

@ryao
Copy link
Contributor

ryao commented Jul 5, 2012

It should be possible to implement using the free space map. The only problem is that no one has time to do it.

@ryao
Copy link
Contributor

ryao commented Jul 13, 2012

One of the FreeBSD has developed this:

http://people.freebsd.org/~pjd/patches/zfstrim.patch

He told me that it is meant for single SSD pools. We only need to import it to gain TRIM support.

@dechamps
Copy link
Contributor

This patch seems to indeed implement the desired feature and it even delays discards by some TXGs to avoid breaking zpool import -F, so it should be good, but it's also quite complex and intrusive. Which is not an issue per se, but we should keep in mind that any bugs in this kind of code could have very dangerous consequences (i.e. data loss), especially if we accidentally issue a discard on real, non-freed data. This needs careful reviewing and heavy, thorough testing. I'm especially interested on how the author made sure to prevent some potentially devastating race conditions, like say, when discarding a range while it's being written again at the same time (for example due to command reordering in the block device queue).

So, is there any discussion on this patch somewhere? Code review? Test results? Googling "zfstrim.patch" doesn't seem to yield anything. Quite frankly, merging this patch without extensive code review and testing would be pure madness IMO.

Also, this patch doesn't discard whole cache and log devices on export/import, although that's easy to implement separately (and much less dangerous).

@ryao
Copy link
Contributor

ryao commented Jul 15, 2012

@dechamps It was a funded project in FreeBSD. pjd told me that the people who funded it have put it into production use. You can email him @freebsd.org with questions.

By the way, do you intend to try porting this? It is on my to do list, although you probably would get to it sooner than I will.

@dechamps
Copy link
Contributor

@dechamps It was a funded project in FreeBSD. pjd told me that the people who funded it have put it into production use. You can email him @freebsd.org with questions.

Mail sent.

By the way, do you intend to try porting this? It is on my to do list, although you probably would get to it sooner than I will.

I'm not quite sure yet, it depends on what my work load will be in the coming week. Anyway, as I understand it, the patch is very easy to port: it seems the only thing to do is to replace FreeBSD's BIO_DELETE operation by Linux's BLKDISCARD.

@dechamps
Copy link
Contributor

Here's pjd's answers to my questions:

I told him that this is mostly for SSD-only pools, not single-SSD pools. The customer I prepared this for is using this with pools that contain multiple SSDs.

Broadly speaking, how does your code work?

The code builds a map of regions that were freed. On every write the code consults the map and eventually removes ranges that were freed before, but are now overwritten.

Freed blocks are not TRIMed immediately. There is a tunable that defines how many txg we should wait with TRIMming freed blocks (64 by default).

There is a low priority thread that TRIMs ranges when the time comes. During TRIM we keep in-flight ranges on a list to detect colliding writes - we have to delay writes that collide with in-flight TRIMs in case something will be reordered and write will reached the disk before the TRIM. We don't have to do the same for in-flight writes, as colliding writes just remove ranges to TRIM.

What steps did you take to prevent potentially dangerous race conditions from happening (see my comment)?

The race you were talking about is handled by keeping a list of in-flight TRIMs, so we can detect that colliding write happens and delay such write until TRIM is done.

How has the code been tested?

It was and still is being tested by my customer. Some bugs were found and fixed.

Has the code been reviewed by anyone (especially experienced ZFS developers)?

No. At least not yet. The code is still pretty fresh.

Do your trust the code to be reliable and safe to use in production environments?

From what I know the customer of mine already put it in production or at least started some production tests on some limited number of machines. I agree the code could use some independent review and testing in different environments. For example it hasn't been tested with pools that contain log vdevs.

In light of this new information, I think we can proceed with the merge, but maybe we should disable the feature by default (zfs_notrim=1) and expose it as a module parameter marked experimental in the description. @behlendorf: what are your thoughts on this?

@behlendorf
Copy link
Contributor

I think we should proceed very very very carefully here. We're talking about touching some delicate code with a fairly intrusive change. I know this would be a great feature, but it would be easy to get this subtly wrong.

Why don't we start by porting the FreeBSD patch to a ZoL branch. Then those who are interested in this feature can get some serious testing on the code before we merge it in. I'd rather not risk destabilizing master until we are more confident about this change.

@Rudd-O
Copy link
Contributor

Rudd-O commented Jul 18, 2012

I think that is very reasonable.

On Wed, 2012-07-18 at 12:55 -0700, Brian Behlendorf wrote:

I think we should proceed very very very carefully here. We're talking about touching some delicate code with a fairly intrusive change. I know this would be a great feature, but it would be easy to get this subtly wrong.

Why don't we start by porting the FreeBSD patch to a ZoL branch. Then those who are interested in this feature can get some serious testing on the code before we merge it in. I'd rather not risk destabilizing master until we are more confident about this change.


Reply to this email directly or view it on GitHub:
#598 (comment)

@dechamps
Copy link
Contributor

I agree. I'm still not sure I'll have the time to come up with the branch though, so if someone wants to jump in, go ahead. Like I said, this should by fairly easy to port.

@ryao
Copy link
Contributor

ryao commented Jul 19, 2012

I am in agreement with everyone else. My time is being spent on other things though. If someone does the initial port, I am willing to field test it on my desktop. The worst case is that I will have to restore it from a backup.

@maxximino
Copy link
Contributor

My laptop is willing to test the patch. I regularly do backup.
The pool of my laptop has a single vdev, that's a luks device, over a partition on a OCZ Agility 3.
ZFS-over-LUKS (on a HD, died this afternoon) works beautifully.

@Rudd-O
Copy link
Contributor

Rudd-O commented Jul 19, 2012

I can probably test this too. I, too, have ZFS-over-LUKS on an SSD,
fully replicated to my NAS for safety, and the LUKS code I personally
hacked to enable discards, so this should be EENTERESTING. Just point
me to a branch, I will merge it into a branch of my own systemd-enabled
ZFS on Linux, and I'll give it a shot.

On Thu, 2012-07-19 at 12:02 -0700, Massimo Maggi wrote:

My laptop is willing to test the patch. I regularly do backup.
The pool of my laptop has a single vdev, that's a luks device, over a partition on a OCZ Agility 3.
ZFS-over-LUKS (on a HD, died this afternoon) works beautifully.


Reply to this email directly or view it on GitHub:
#598 (comment)

@dechamps
Copy link
Contributor

dechamps commented Sep 1, 2012

Here it is: #924. It definitely needs extensive testing, so if you're not afraid to corrupt your pool, go ahead.

@Rudd-O
Copy link
Contributor

Rudd-O commented Jan 29, 2013

Status on this?

@behlendorf
Copy link
Contributor

Largely done and working, but it needs to be rebasesd on master and seriously tested. I definitely do want to get it merged before it gets much staler.

@clopez
Copy link
Contributor

clopez commented Apr 12, 2013

Maybe is time to rebase it? Perhaps a beta/experimental tarball version can be released to engage more people on testing the feature.

@behlendorf behlendorf removed this from the 0.7.0 milestone Oct 6, 2014
@dweeezil
Copy link
Contributor

I figure I'm far enough along to mention I'm working on porting Nextenta's TRIM support. It uses a rather different approach than the the one described earlier in this issue. My plan is to try to get disk and file vdevs working first and then deal with zvols. It's a good ways off yet but so far is looking good. I'll post more information to this particular issue as it becomes available.

@dechamps
Copy link
Contributor

What do you mean by "deal with zvols"? Discards are already implemented for zvols and have been working for years now - see #334, #384, #1010. Discard on zvol is a frontend feature, which is very different from discard on vdevs, which is a backend feature.

@richardelling
Copy link
Contributor

On Jul 24, 2015, at 1:46 PM, Tim Chase [email protected] wrote:

I figure I'm far enough along to mention I'm working on porting Nextenta's TRIM support. It uses a rather different approach than the the one described earlier in this issue. My plan is to try to get disk and file vdevs working first and then deal with zvols.

This makes no sense. Surely you mean initiating TRIM/UNMAP for vdevs?
That is totally different than zvols, as that work is already done.
-- richard

It's a good ways off yet but so far is looking good. I'll post more information to this particular issue as it becomes available.


Reply to this email directly or view it on GitHub #598 (comment).

@dweeezil
Copy link
Contributor

Let me clarify the part regarding zvols: I want to evaluate whether to try replacing the existing zvol discard code with that from Nexenta. The particular use case for which I was asked to do this work is for file vdevs so I figured on porting the code and testing with file and disk vdevs first and then look at the zvol situation later.

@dweeezil
Copy link
Contributor

This appears to be coming together rather nicely and, so far, vdev file trimming (via hole-punching) appears to be working properly and, as expected, the existing zvol code works just fine (tested with a "discard"-mounted ext4 atop a zvol). I've not dealt with vdev disks yet but am instead running a bunch of stress tests with vdev file-based pools to make sure nothing has broken.

I'll also be porting Nexenta's on-demand trim feature.

The WIP spl code is in https://github.com/dweeezil/spl/tree/ntrim and the WIP zfs code is in https://github.com/dweeezil/zfs/tree/ntrim.

@richardelling
Copy link
Contributor

On Jul 26, 2015, at 8:23 AM, Tim Chase [email protected] wrote:

This appears to be coming together rather nicely and, so far, vdev file trimming (via hole-punching) appears to be working properly and, as expected, the existing zvol code works just fine (tested with a "discard"-mounted ext4 atop a zvol). I've not dealt with vdev disks yet but am instead running a bunch of stress tests with vdev file-based pools to make sure nothing has broken.

sigh, so many ZFS ports, so little time :-)
I had not realized that the ZoL code is so far behind illumos in this regard.
Yes, absolutely agree with what you are proposing.

FWIW, Solaris 11 has had vdev-level UNMAP/TRIM for years. By default it is disabled.
Given the large variance in drive and array performance during UNMAP/TRIM, it is
clear that default=enabled is not a good idea. Perhaps some day the drive vendors will
get it right...
-- richard

I'll also be porting Nexenta's on-demand trim feature.

The WIP spl code is in https://github.com/dweeezil/spl/tree/ntrim https://github.com/dweeezil/spl/tree/ntrim and the WIP zfs code is in https://github.com/dweeezil/zfs/tree/ntrim https://github.com/dweeezil/zfs/tree/ntrim.


Reply to this email directly or view it on GitHub #598 (comment).

@tomposmiko
Copy link

Out of curiosity will be there available an fstrim like (offline) solution?

@dweeezil
Copy link
Contributor

dweeezil commented Aug 2, 2015

The patch set seems to be working at this point. Please see #3656.

@tomposmiko With the on-demand trim feature, zpool trim <pool> does the same thing, however, I'm going to look into adding support for the Linux standard FITRIM ioctl which would allow fstrim to work properly.

@dweeezil
Copy link
Contributor

dweeezil commented Aug 3, 2015

@tomposmiko I had forgotten how weird FITRIM is. Among other things, it targets a single filesystem which in ZFS really isn't possible since space allocation is a pool attribute. I think zpool trim ought to suffice.

@Bronek
Copy link

Bronek commented Aug 6, 2015

@dweeezil theoretically you could perform reverse lookup from filesystem to pool(s) and the perform "zpool trim" on a result ...

@CMCDragonkai
Copy link

Would this work on ZFS over LUKs? We would need to enable the discard for the LUKs of course, but what else would be needed?

@dasJ
Copy link

dasJ commented Nov 22, 2016

@CMCDragonkai I currently run ZFS on LUKS and I just had to open the volume with --allow-discards (or discard when using crypttab). I set autotrim=on on the pool for automatic trimming.

@UralZima
Copy link

UralZima commented Feb 3, 2017

Hi. Can anyone provide some instructions, how to run TRIM version of ZFS in Gentoo? I urgently need it, my SSD is failing. Is it possible to add patches to portage or modify ebuild file?
@dasJ, how you running it? Maybe @dweeezil, you can help?
Thank you very much.
P.S. I already posted a question on another thread (nexenta), not sure if someone will answer soon

@drescherjm
Copy link

I urgently need it, my SSD is failing.

I don't believe TRIM will help fix a failing SSD.

@awnz
Copy link

awnz commented Feb 4, 2017

If your SSD is failing, as @drescherjm said, TRIM won't help. What you want to do, and urgently is:

  1. Stop writing anything to that SSD.
  2. Copy everything off it immediately to safer storage. Start with the most important data first.
  3. Replace the SSD with a new one.

Hopefully you are using RAID and backups.

Since this Github issue is for the ZFS TRIM feature and not for support, if you need any volunteer assistance with this, talk to me on Twitter at @aw_nz (https://twitter.com/aw_nz)

@UralZima
Copy link

UralZima commented Feb 4, 2017

Thanks for answers. I understood you, but anyway, I have heavy DB running on it, I don't want to trash newly bought SSD again. Of course I have backups/snaphots. I just want to enable/test TRIM on ZFS
@awnz, Thanks for your offer to help. I don't have a twitter, let's talk in this web chat room:
https://tlk.io/trimzfs anyone who interested, please join.

I think these instructions will help another testers/devs to test trim feature, so it can be on-topic here.
Is it enought to git clone ntrim branch of spl, ./configure && make && make install and do the same for zfs with ntrim-next branch? it will generate .ko files, and I need to move them to initramfs? or some cleaner way exists for gentoo?

Thank you very much.

@cwedgwood
Copy link
Contributor

@behlendorf, @dweeezil; close (duplicate) because it's been superseded by #5925 ?

@mailinglists35
Copy link

mailinglists35 commented Mar 22, 2019

Is pull request #8419 the final one that will actually fix this issue? (I don't understand why this issue was closed last year)

@behlendorf
Copy link
Contributor

@mailinglists35 yes, #8419 is the final version. This is was closed as a duplicate of a previous version of TRIM.

behlendorf added a commit that referenced this issue Mar 29, 2019
UNMAP/TRIM support is a frequently-requested feature to help
prevent performance from degrading on SSDs and on various other
SAN-like storage back-ends.  By issuing UNMAP/TRIM commands for
sectors which are no longer allocated the underlying device can
often more efficiently manage itself.

This TRIM implementation is modeled on the `zpool initialize`
feature which writes a pattern to all unallocated space in the
pool.  The new `zpool trim` command uses the same vdev_xlate()
code to calculate what sectors are unallocated, the same per-
vdev TRIM thread model and locking, and the same basic CLI for
a consistent user experience.  The core difference is that
instead of writing a pattern it will issue UNMAP/TRIM commands
for those extents.

The zio pipeline was updated to accommodate this by adding a new
ZIO_TYPE_TRIM type and associated spa taskq.  This new type makes
is straight forward to add the platform specific TRIM/UNMAP calls
to vdev_disk.c and vdev_file.c.  These new ZIO_TYPE_TRIM zios are
handled largely the same way as ZIO_TYPE_READs or ZIO_TYPE_WRITEs.
This makes it possible to largely avoid changing the pipieline,
one exception is that TRIM zio's may exceed the 16M block size
limit since they contain no data.

In addition to the manual `zpool trim` command, a background
automatic TRIM was added and is controlled by the 'autotrim'
property.  It relies on the exact same infrastructure as the
manual TRIM.  However, instead of relying on the extents in a
metaslab's ms_allocatable range tree, a ms_trim tree is kept
per metaslab.  When 'autotrim=on', ranges added back to the
ms_allocatable tree are also added to the ms_free tree.  The
ms_free tree is then periodically consumed by an autotrim
thread which systematically walks a top level vdev's metaslabs.

Since the automatic TRIM will skip ranges it considers too small
there is value in occasionally running a full `zpool trim`.  This
may occur when the freed blocks are small and not enough time
was allowed to aggregate them.  An automatic TRIM and a manual
`zpool trim` may be run concurrently, in which case the automatic
TRIM will yield to the manual TRIM.

Reviewed-by: Jorgen Lundman <[email protected]>
Reviewed-by: Tim Chase <[email protected]>
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: George Wilson <[email protected]>
Reviewed-by: Serapheim Dimitropoulos <[email protected]>
Contributions-by: Saso Kiselkov <[email protected]>
Contributions-by: Tim Chase <[email protected]>
Contributions-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #8419 
Closes #598
allanjude pushed a commit to allanjude/zfs that referenced this issue Jun 29, 2019
UNMAP/TRIM support is a frequently-requested feature to help
prevent performance from degrading on SSDs and on various other
SAN-like storage back-ends.  By issuing UNMAP/TRIM commands for
sectors which are no longer allocated the underlying device can
often more efficiently manage itself.

This TRIM implementation is modeled on the `zpool initialize`
feature which writes a pattern to all unallocated space in the
pool.  The new `zpool trim` command uses the same vdev_xlate()
code to calculate what sectors are unallocated, the same per-
vdev TRIM thread model and locking, and the same basic CLI for
a consistent user experience.  The core difference is that
instead of writing a pattern it will issue UNMAP/TRIM commands
for those extents.

The zio pipeline was updated to accommodate this by adding a new
ZIO_TYPE_TRIM type and associated spa taskq.  This new type makes
is straight forward to add the platform specific TRIM/UNMAP calls
to vdev_disk.c and vdev_file.c.  These new ZIO_TYPE_TRIM zios are
handled largely the same way as ZIO_TYPE_READs or ZIO_TYPE_WRITEs.
This makes it possible to largely avoid changing the pipieline,
one exception is that TRIM zio's may exceed the 16M block size
limit since they contain no data.

In addition to the manual `zpool trim` command, a background
automatic TRIM was added and is controlled by the 'autotrim'
property.  It relies on the exact same infrastructure as the
manual TRIM.  However, instead of relying on the extents in a
metaslab's ms_allocatable range tree, a ms_trim tree is kept
per metaslab.  When 'autotrim=on', ranges added back to the
ms_allocatable tree are also added to the ms_free tree.  The
ms_free tree is then periodically consumed by an autotrim
thread which systematically walks a top level vdev's metaslabs.

Since the automatic TRIM will skip ranges it considers too small
there is value in occasionally running a full `zpool trim`.  This
may occur when the freed blocks are small and not enough time
was allowed to aggregate them.  An automatic TRIM and a manual
`zpool trim` may be run concurrently, in which case the automatic
TRIM will yield to the manual TRIM.

Reviewed-by: Jorgen Lundman <[email protected]>
Reviewed-by: Tim Chase <[email protected]>
Reviewed-by: Matt Ahrens <[email protected]>
Reviewed-by: George Wilson <[email protected]>
Reviewed-by: Serapheim Dimitropoulos <[email protected]>
Contributions-by: Saso Kiselkov <[email protected]>
Contributions-by: Tim Chase <[email protected]>
Contributions-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#8419
Closes openzfs#598
pcd1193182 added a commit to pcd1193182/zfs that referenced this issue Sep 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Feature request or new feature
Projects
None yet
Development

No branches or pull requests