Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TRIM/Discard support from Nexenta #3656

Closed
wants to merge 8 commits into from
Closed

Conversation

dweeezil
Copy link
Contributor

@dweeezil dweeezil commented Aug 2, 2015

This patch stack includes Nextenta's support for TRIM/Discard on disk and file vdevs as well as an update to the dkio headers for appropriate Solaris compatibility. It requires the current https://github.com/dweeezil/spl/tree/ntrim patch in order to compile properly.

The usual disclaimers apply at this point: I've performed moderate testing with ext4-backed file vdevs and light testing with SSD-backed disk vdevs and it appears to work properly. Use at your own risk. It may DESTROY YOUR DATA! I'm posting the pull request because it seems to work during initial testing and I'd like the buildbots to get a chance at it (which I'm expecting to fail unless they use the corresponding SPL code).

The initial TRIM support (currently in commit 719301c) caused frequent deadlocks in ztest due to the SCL_ALL spa locking during the trim operations. The follow-on patch to support on-demand trim changed the locking scheme and I'm no longer seeing deadlocks with either ztest or normal operation.

The final last commit (currently 9e5cfd7) adds ZIL logging for zvol trim operations. This code was mostly borrowed from an older Nexenta patch (referenced in the commit log) and has been merged into the existing zvol trim function.

In order to enable the feature, you must use zpool set autotrim=on on the pool and the zfs_trim module parameter must be set to 1 (which is its default value). The zfs_trim parameter controls the lower-level vdev trimming whereas the pool property controls it at a higher level. By default, trims are batched and only applied every 32 transaction groups as controlled by the new zfs_txgs_per_trim parameter. This allows for zpool import -T to continue to be useful. Finally, by default, only regions of at least 1MiB are trimmed as set by the zfs_trim_min_ext_sz module parameter.

@dweeezil dweeezil changed the title WIP - TIM/Discard support from Nexenta WIP - TRIM/Discard support from Nexenta Aug 2, 2015
@edillmann
Copy link
Contributor

Hi @dweeezil

FYI: zpool trim rpool triggers the following warning

[ 61.844048] Large kmem_alloc(101976, 0x1000), please file an issue at:
[ 61.844048] https://github.com/zfsonlinux/zfs/issues/new
[ 61.844053] CPU: 3 PID: 392 Comm: spl_system_task Tainted: P OE 3.19.0 #4
[ 61.844054] Hardware name: Dell Inc. Dell Precision M3800/Dell Precision M3800, BIOS A07 10/14/2014
[ 61.844055] 000000000000c2d0 ffff88041557fc48 ffffffff81732ffb 0000000000000001
[ 61.844057] 0000000000000000 ffff88041557fc88 ffffffffa0209b73 ffff880400000000
[ 61.844059] 00000000000018e3 ffff880409ef9c00 ffff880409ef9c00 ffff8802c42d08c0
[ 61.844061] Call Trace:
[ 61.844068] [] dump_stack+0x45/0x57
[ 61.844091] [] spl_kmem_zalloc+0x113/0x180 [spl]
[ 61.844128] [] zio_trim+0x79/0x1b0 [zfs]
[ 61.844147] [] metaslab_exec_trim+0xa6/0xf0 [zfs]
[ 61.844166] [] metaslab_trim_all+0x10c/0x1a0 [zfs]
[ 61.844188] [] vdev_trim_all+0x13d/0x310 [zfs]
[ 61.844194] [] taskq_thread+0x205/0x450 [spl]
[ 61.844198] [] ? wake_up_state+0x20/0x20
[ 61.844203] [] ? taskq_cancel_id+0x120/0x120 [spl]
[ 61.844206] [] kthread+0xd2/0xf0
[ 61.844208] [] ? kthread_create_on_node+0x180/0x180
[ 61.844210] [] ret_from_fork+0x7c/0xb0
[ 61.844212] [] ? kthread_create_on_node+0x180/0x180

module/zfs/zio.c Outdated
return (sub_pio);

num_exts = avl_numnodes(&tree->rt_root);
dfl = kmem_zalloc(DFL_SZ(num_exts), KM_SLEEP);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need to change this to vmem_zalloc() to address the issue that @edillmann reported due to the way that we implemented kmem_zalloc(). It should be ifdefed to Linux because the O3X port that will likely merge this code is using the real kmem_zalloc().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm going to rework zio_trim() as well as the other consumers of the dkioc free lists to work with a linked list rather than an array. This would allow us to avoid the large allocations but would cause the code to diverge a bit from the upstream. Another benefit is that we'd avoid the double allocation and copy which typically occurs in zio_trim() when zfs_trim_min_ext_sz is set to a large value.

@sempervictus
Copy link
Contributor

@dweeezil: sorry to be pain, but curious to know status on this - we've got a few SSD-only pools to play with for a few days before we stuff 'em into prod (making sure our hardware doesnt screw us), so we can do a bit of testing on this without losing production data if you happen to have some test paths for us to run through. Thanks

@dweeezil
Copy link
Contributor Author

dweeezil commented Sep 9, 2015

@sempervictus I'm actively looking for feedback. The patch does need to be refreshed against a current master codebase which I'll try to do today. There's a bit of interference with the recent zvol improvements.

In my own testing, the patch does appear to work properly, although the behavior of the TRIM "batching" needs a bit better documentation and, possibly, slightly different implementation (IIRC, one or both of the parameters only has certain effect upon module load and/or pool import). I'd also like to add some kstats to help monitor its behavior.

I've used the on-demand TRIM quite a bit and it seems to work perfectly. You can TRIM a pool with zpool trim <pool> and monitor its progress with zpool status (although I'd expect it to be pretty instantaneous on most SSDs). Most of my testing, however, has used file-based vdevs but I have also used real SSDs. I just got a couple of new SSDs today which I plan on using for a bit more extensive testing.

There's also a backport to 0.6.4.2 in a branch named "ntrim-0.6.4.2" ("ntrim-0.6.4.1" for SPL).

@sempervictus
Copy link
Contributor

Soon as this is updated to reflect changes in master we'll add it to our stack. One potential caveat is that we generally utilize dm-crypt with the discard option at mount time. Any thoughts on potential side effects from this? Has this sort of setup been tested in any way?

@dweeezil dweeezil changed the title WIP - TRIM/Discard support from Nexenta TRIM/Discard support from Nexenta Sep 14, 2015
@edillmann
Copy link
Contributor

Hi @dweeezil ,

Just to let you know I have been running this pull rq since it was released, and beside the kmem_alloc warning, I did not see any problem or corruption on my test zpool (dual ssd mirror). The system has been crunching video camera recording for 2 months :-)

Is there any hope of having it rebased on master ?

@sempervictus
Copy link
Contributor

Tried to throw this into our stack today and noticed it has some conflicts with ABD in the raidz code. Rumor has it that should be merged "soon after the 0.6.5 tag" so i'm hoping by next rebase it'll be in there (nudge @behlendorf ) :).

@edillmann
Copy link
Contributor

@dweeezil I didn't see it was already rebased, thanks.

@Mic92
Copy link
Contributor

Mic92 commented Oct 18, 2015

So SATA Trim is currently not supported, according to the comments in the source or is this handled by SPL properly?

@dweeezil
Copy link
Contributor Author

@Mic92 SATA TRIM works just fine and I've tested it plenty. The documentation is still the original from Illumos. TRIM will work on any block device vdev supporting BLKDISCARD or any file vdev on which the containing filesystem supports fallocate hole punching.

@greg-hydrogen
Copy link

@dweeezil - I would love to test this patch but it conflicts with the ABD branch (pull 3441) in vdev_raidz.c

I have the .rej file from both sides (patched abd first then this patch, and then patched this patch first then abd) if that helps at all

much appreciated

@skiselkov
Copy link
Contributor

Hey guys, I wanted to get your take on the latest submission on this that we're trying to get upstreamed from Nexenta. I'd primarily like to make bottom end of the ZFS portion more accommodating to Linux & FreeBSD.
If you could, please drop by and take a look at https://reviews.csiden.org/r/263/

@dweeezil
Copy link
Contributor Author

dweeezil commented Nov 4, 2015

@greg-hydrogen I tried transplanting the relevant commits on to ABD awhile ago and other than the bio argument issues, the other main conflict is the logging I added to discards on zvols. The vdev conflicts you likely ran into are pretty easy to fix. I'll try to get an ABD-based version of this working within the next few days.

@skiselkov I'll check it out. It looks to be a port of the same Nexenta code in this pull request, correct?

@skiselkov
Copy link
Contributor

@dweeezil It is indeed, with some minor updates & fixes.

@vaLski
Copy link

vaLski commented Nov 10, 2015

Can't compile it on CentOS 6.7. Attempted to install

spl-0.6.5.3 from github
zfs-0.6.5.3 from github + dweeezil:ntrim / #3656

Used the following part "If, instead you would like to use the GIT version, use the following commands instead:" from http://zfsonlinux.org/generic-rpm.html

On the last step make rpm-utils rpm-dkms

It fails with:

Preparing... ########################################### [100%]
1:zfs-dkms ########################################### [100%]
Removing old zfs-0.6.5 DKMS files...


Deleting module version: 0.6.5

completely from the DKMS tree.

Done.
Loading new zfs-0.6.5 DKMS files...
Building for 2.6.32-573.7.1.el6.x86_64
Building initial module for 2.6.32-573.7.1.el6.x86_64
Error! Bad return status for module build on kernel: 2.6.32-573.7.1.el6.x86_64 (x86_64)
Consult /var/lib/dkms/zfs/0.6.5/build/make.log for more information.
warning: %post(zfs-dkms-0.6.5-36_gafb9fad.el6.noarch) scriptlet failed, exit status 10

Log says

CC [M] /var/lib/dkms/zfs/0.6.5/build/module/zfs/vdev_disk.o
/var/lib/dkms/zfs/0.6.5/build/module/zfs/vdev_disk.c:36:33: error: sys/dkioc_free_util.h: No such file or directory
/var/lib/dkms/zfs/0.6.5/build/module/zfs/vdev_disk.c: In function 'vdev_disk_io_start':
/var/lib/dkms/zfs/0.6.5/build/module/zfs/vdev_disk.c:693: error: 'DKIOCFREE' undeclared (first use in this function)
/var/lib/dkms/zfs/0.6.5/build/module/zfs/vdev_disk.c:693: error: (Each undeclared identifier is reported only once
/var/lib/dkms/zfs/0.6.5/build/module/zfs/vdev_disk.c:693: error: for each function it appears in.)
/var/lib/dkms/zfs/0.6.5/build/module/zfs/vdev_disk.c:696: error: 'dkioc_free_list_t' undeclared (first use in this function)
/var/lib/dkms/zfs/0.6.5/build/module/zfs/vdev_disk.c:696: error: 'dfl' undeclared (first use in this function)
make[5]: *** [/var/lib/dkms/zfs/0.6.5/build/module/zfs/vdev_disk.o] Error 1
make[4]: *** [/var/lib/dkms/zfs/0.6.5/build/module/zfs] Error 2
make[3]: *** [module/var/lib/dkms/zfs/0.6.5/build/module] Error 2
make[3]: Leaving directory /usr/src/kernels/2.6.32-573.7.1.el6.x86_64' make[2]: *** [modules] Error 2 make[2]: Leaving directory/var/lib/dkms/zfs/0.6.5/build/module'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/var/lib/dkms/zfs/0.6.5/build'
make: *** [all] Error 2

strace shows

[pid 20344] open("include/sys/dkioc_free_util.h", O_RDONLY|O_NOCTTY) = -1 ENOENT (No such file or directory)
[pid 20344] open("/usr/src/kernels/2.6.32-573.7.1.el6.x86_64/arch/x86/include/sys/dkioc_free_util.h", O_RDONLY|O_NOCTTY) = -1 ENOENT (No such file or directory)
[pid 20344] open("/var/lib/dkms/zfs/0.6.5/build/include/sys/dkioc_free_util.h", O_RDONLY|O_NOCTTY) = -1 ENOENT (No such file or directory)
[pid 20344] open("/usr/src/spl-0.6.5/include/sys/dkioc_free_util.h", O_RDONLY|O_NOCTTY) = -1 ENOENT (No such file or directory)
[pid 20344] open("/usr/src/spl-0.6.5/sys/dkioc_free_util.h", O_RDONLY|O_NOCTTY) = -1 ENOENT (No such file or directory)
[pid 20344] open("/usr/lib/gcc/x86_64-redhat-linux/4.4.7/include/sys/dkioc_free_util.h", O_RDONLY|O_NOCTTY) = -1 ENOENT (No such file or directory)
[pid 20344] write(2, "sys/dkioc_free_util.h: No such f"..., 48) = 48

Files are at

find / -name dkioc_free_util.h
/usr/include/libspl/sys/dkioc_free_util.h
/usr/src/zfs-0.6.5/lib/libspl/include/sys/dkioc_free_util.h
/usr/src/zfs/lib/libspl/include/sys/dkioc_free_util.h
/var/lib/dkms/zfs/0.6.5/build/lib/libspl/include/sys/dkioc_free_util.h

Symlinked one of the "search" locations to dkioc_free_util.h but it started throwing other errors:

In file included from /var/lib/dkms/zfs/0.6.5/build/module/zfs/vdev_disk.c:36:
/usr/src/kernels/2.6.32-573.7.1.el6.x86_64/arch/x86/include/sys/dkioc_free_util.h:25: error: expected ')' before '*' token

Attempted to solve those but to no avil.

It seems that dkioc_free_util.h is added by the "ntrim" branch as the dfl_free function is referenced in module/zfs/zio.c, module/zfs/vdev_raidz.c where trim is mentioned.

I really hope that this is the correct place to post this issue as it is directly related with this merge request.

Let me know if there is anything else I can assist with.

The new spa_unload() code add as part of "OpenZFS 7303 - dynamic metaslab
selection" (4e21fd0) would cause in-flight trim zio to fail.  This patch
makes sure each metaslab is finished trimming before removing it during
metaslab shutdown.
@dweeezil
Copy link
Contributor Author

The latest commit, c7654b5, is rebased on a current master. Also, I've re-worded a bunch of the documentation in the spirit of @ahrens' suggestions. This should also fix panicking upon export which started occurring due to 4e21fd0.

@sempervictus
Copy link
Contributor

sempervictus commented Feb 28, 2017

Looks like we have a memory leak in the zpool trim command.
Reproducer showing different bytes are leaked when run without permissions:

#/bin/env ruby
require 'open3'
poolname = ARGV[0] || "tank"
while true
  Open3.capture3("zpool trim #{poolname}").tap {|o,e,i| puts e.index(':').to_s + ' bytes: ' + e[0..(e.index(':') -1)].each_byte.map { |b| b.to_s(16) }.join }
end

This is off the current revision, on Arch Linux in a Grsec/PAX environment using --with-pic=yes.
Out of 4287 executions, 4133 produce unique data...

@dweeezil
Copy link
Contributor Author

@sempervictus The patch in 6c9f7af should fix this.

@sempervictus
Copy link
Contributor

@dweeezil: thanks, will add in to next set. I've got this running on the current test stack and seeing some decent numbers for ZVOL performance atop and SSD with autotrim. If all goes well and it doesnt eat my data, i'll get this on some 10+-disk VDEV hardware soon enough. Any specific rough edges i should be testing?

@skiselkov
Copy link
Contributor

Hey guys, just a heads up that the upstream PR has been significantly updated.

  • trim zios are now processed via vdev_queue.c and limited to at most 10 executing at the same time per vdev.
  • individual trim commands are capped at 262144 sectors (or 128MB) - recommendation from FreeBSD.
  • The rate limiter in manual trimming is now much finer, working now in at most 128MB increments, rather than one-metaslab-at-a-time.
  • The zfs_trim tunable is now examined later, just before issuing out trim commands from vdev_file/vdev_disk. This allows testing the entire pipeline up to this point.
  • I've built in changes suggested by Matt Ahrens regarding range_tree_clear and range_tree_contains.
  • The default minimum extent size to be trimmed reduced to 128k to catch smaller blocks in more fragmented metaslabs.

The most significant departure from what we have in-house at Nexenta is the zio queueing and the manual trim rate limiting. The remaining parts are largely conserved and we've been running them in production for over a year now.

@skiselkov
Copy link
Contributor

@dweeezil I'd really appreciate it if you could find time to drop by the OpenZFS PR for this and give it a look over: openzfs/openzfs#172
Sadly, I'm kinda short on reviewers. We want to share this with everybody, but if nobody steps up to the plate, then there's nothing we can do.

@dweeezil
Copy link
Contributor Author

@skiselkov Thanks for the poke. I'm definitely planning on going over the OpenZFS PR and also getting this one refreshed to match.

@skiselkov
Copy link
Contributor

@dweeezil Thanks, appreciate it a lot!

dweeezil added a commit to dweeezil/zfs that referenced this pull request Mar 19, 2017
The original implementation could overestimate the physical size
for raidz2 and raidz3 and cause too much trimming.  Update with the
implementation provided by @ironMann in openzfs#3656.
dweeezil added a commit to dweeezil/zfs that referenced this pull request Mar 20, 2017
The original implementation could overestimate the physical size
for raidz2 and raidz3 and cause too much trimming.  Update with the
implementation provided by @ironMann in openzfs#3656.
@dweeezil
Copy link
Contributor Author

This PR has gotten way too long to comfortably deal with in Github. I've just done a complete refresh of the TRIM patch stack based on the upstream PR for OpenZFS and rebased to a current ZoL master. Once I've done some testing of the new stack, this PR will be closed and will be replaced with a new one.

In the mean time, the soon-to-be-posted-PR is in dweeezil:ntrim-next-2.

@skiselkov Once I do some testing and post the new PR, I'll finally be able to start reviewing the OpenZFS PR. I tried to keep as many notes as I can as to some of the issues I've had to deal with which might be applicable upstream.

@skiselkov
Copy link
Contributor

@dweeezil Thank you, appreciate it.

dweeezil added a commit to dweeezil/zfs that referenced this pull request Mar 20, 2017
The original implementation could overestimate the physical size
for raidz2 and raidz3 and cause too much trimming.  Update with the
implementation provided by @ironMann in openzfs#3656.
dweeezil added a commit to dweeezil/zfs that referenced this pull request Mar 22, 2017
The original implementation could overestimate the physical size
for raidz2 and raidz3 and cause too much trimming.  Update with the
implementation provided by @ironMann in openzfs#3656.
dweeezil added a commit to dweeezil/zfs that referenced this pull request Mar 22, 2017
The original implementation could overestimate the physical size
for raidz2 and raidz3 and cause too much trimming.  Update with the
implementation provided by @ironMann in openzfs#3656.
dweeezil added a commit to dweeezil/zfs that referenced this pull request Mar 23, 2017
The original implementation could overestimate the physical size
for raidz2 and raidz3 and cause too much trimming.  Update with the
implementation provided by @ironMann in openzfs#3656.
dweeezil added a commit to dweeezil/zfs that referenced this pull request Mar 24, 2017
The original implementation could overestimate the physical size
for raidz2 and raidz3 and cause too much trimming.  Update with the
implementation provided by @ironMann in openzfs#3656.
@dweeezil
Copy link
Contributor Author

Replaced with #5925.

@dweeezil dweeezil closed this Mar 25, 2017
tuomari pushed a commit to tuomari/zfs that referenced this pull request Mar 30, 2017
The original implementation could overestimate the physical size
for raidz2 and raidz3 and cause too much trimming.  Update with the
implementation provided by @ironMann in openzfs#3656.
dweeezil added a commit to dweeezil/zfs that referenced this pull request Apr 2, 2017
The original implementation could overestimate the physical size
for raidz2 and raidz3 and cause too much trimming.  Update with the
implementation provided by @ironMann in openzfs#3656.
dweeezil added a commit to dweeezil/zfs that referenced this pull request Apr 2, 2017
The original implementation could overestimate the physical size
for raidz2 and raidz3 and cause too much trimming.  Update with the
implementation provided by @ironMann in openzfs#3656.
dweeezil added a commit to dweeezil/zfs that referenced this pull request Apr 7, 2017
The original implementation could overestimate the physical size
for raidz2 and raidz3 and cause too much trimming.  Update with the
implementation provided by @ironMann in openzfs#3656.
dweeezil added a commit to dweeezil/zfs that referenced this pull request Apr 8, 2017
The original implementation could overestimate the physical size
for raidz2 and raidz3 and cause too much trimming.  Update with the
implementation provided by @ironMann in openzfs#3656.
dweeezil added a commit to dweeezil/zfs that referenced this pull request Apr 14, 2017
The original implementation could overestimate the physical size
for raidz2 and raidz3 and cause too much trimming.  Update with the
implementation provided by @ironMann in openzfs#3656.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Feature request or new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.