-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TRIM/Discard support from Nexenta #3656
Conversation
Hi @dweeezil FYI: zpool trim rpool triggers the following warning [ 61.844048] Large kmem_alloc(101976, 0x1000), please file an issue at: |
module/zfs/zio.c
Outdated
return (sub_pio); | ||
|
||
num_exts = avl_numnodes(&tree->rt_root); | ||
dfl = kmem_zalloc(DFL_SZ(num_exts), KM_SLEEP); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably need to change this to vmem_zalloc()
to address the issue that @edillmann reported due to the way that we implemented kmem_zalloc()
. It should be ifdefed to Linux because the O3X port that will likely merge this code is using the real kmem_zalloc()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm going to rework zio_trim()
as well as the other consumers of the dkioc free lists to work with a linked list rather than an array. This would allow us to avoid the large allocations but would cause the code to diverge a bit from the upstream. Another benefit is that we'd avoid the double allocation and copy which typically occurs in zio_trim()
when zfs_trim_min_ext_sz
is set to a large value.
207073d
to
844ce70
Compare
@dweeezil: sorry to be pain, but curious to know status on this - we've got a few SSD-only pools to play with for a few days before we stuff 'em into prod (making sure our hardware doesnt screw us), so we can do a bit of testing on this without losing production data if you happen to have some test paths for us to run through. Thanks |
@sempervictus I'm actively looking for feedback. The patch does need to be refreshed against a current master codebase which I'll try to do today. There's a bit of interference with the recent zvol improvements. In my own testing, the patch does appear to work properly, although the behavior of the TRIM "batching" needs a bit better documentation and, possibly, slightly different implementation (IIRC, one or both of the parameters only has certain effect upon module load and/or pool import). I'd also like to add some kstats to help monitor its behavior. I've used the on-demand TRIM quite a bit and it seems to work perfectly. You can TRIM a pool with There's also a backport to 0.6.4.2 in a branch named "ntrim-0.6.4.2" ("ntrim-0.6.4.1" for SPL). |
Soon as this is updated to reflect changes in master we'll add it to our stack. One potential caveat is that we generally utilize dm-crypt with the discard option at mount time. Any thoughts on potential side effects from this? Has this sort of setup been tested in any way? |
fd80424
to
48681eb
Compare
Hi @dweeezil , Just to let you know I have been running this pull rq since it was released, and beside the kmem_alloc warning, I did not see any problem or corruption on my test zpool (dual ssd mirror). The system has been crunching video camera recording for 2 months :-) Is there any hope of having it rebased on master ? |
Tried to throw this into our stack today and noticed it has some conflicts with ABD in the raidz code. Rumor has it that should be merged "soon after the 0.6.5 tag" so i'm hoping by next rebase it'll be in there (nudge @behlendorf ) :). |
@dweeezil I didn't see it was already rebased, thanks. |
So SATA Trim is currently not supported, according to the comments in the source or is this handled by SPL properly? |
@Mic92 SATA TRIM works just fine and I've tested it plenty. The documentation is still the original from Illumos. TRIM will work on any block device vdev supporting BLKDISCARD or any file vdev on which the containing filesystem supports fallocate hole punching. |
@dweeezil - I would love to test this patch but it conflicts with the ABD branch (pull 3441) in vdev_raidz.c I have the .rej file from both sides (patched abd first then this patch, and then patched this patch first then abd) if that helps at all much appreciated |
Hey guys, I wanted to get your take on the latest submission on this that we're trying to get upstreamed from Nexenta. I'd primarily like to make bottom end of the ZFS portion more accommodating to Linux & FreeBSD. |
@greg-hydrogen I tried transplanting the relevant commits on to ABD awhile ago and other than the bio argument issues, the other main conflict is the logging I added to discards on zvols. The vdev conflicts you likely ran into are pretty easy to fix. I'll try to get an ABD-based version of this working within the next few days. @skiselkov I'll check it out. It looks to be a port of the same Nexenta code in this pull request, correct? |
@dweeezil It is indeed, with some minor updates & fixes. |
Can't compile it on CentOS 6.7. Attempted to install spl-0.6.5.3 from github Used the following part "If, instead you would like to use the GIT version, use the following commands instead:" from http://zfsonlinux.org/generic-rpm.html On the last step make rpm-utils rpm-dkms It fails with: Preparing... ########################################### [100%] Deleting module version: 0.6.5 completely from the DKMS tree.Done. Log says CC [M] /var/lib/dkms/zfs/0.6.5/build/module/zfs/vdev_disk.o strace shows [pid 20344] open("include/sys/dkioc_free_util.h", O_RDONLY|O_NOCTTY) = -1 ENOENT (No such file or directory) Files are at find / -name dkioc_free_util.h Symlinked one of the "search" locations to dkioc_free_util.h but it started throwing other errors: In file included from /var/lib/dkms/zfs/0.6.5/build/module/zfs/vdev_disk.c:36: Attempted to solve those but to no avil. It seems that dkioc_free_util.h is added by the "ntrim" branch as the dfl_free function is referenced in module/zfs/zio.c, module/zfs/vdev_raidz.c where trim is mentioned. I really hope that this is the correct place to post this issue as it is directly related with this merge request. Let me know if there is anything else I can assist with. |
The new spa_unload() code add as part of "OpenZFS 7303 - dynamic metaslab selection" (4e21fd0) would cause in-flight trim zio to fail. This patch makes sure each metaslab is finished trimming before removing it during metaslab shutdown.
Looks like we have a memory leak in the zpool trim command.
This is off the current revision, on Arch Linux in a Grsec/PAX environment using --with-pic=yes. |
@sempervictus The patch in 6c9f7af should fix this. |
@dweeezil: thanks, will add in to next set. I've got this running on the current test stack and seeing some decent numbers for ZVOL performance atop and SSD with autotrim. If all goes well and it doesnt eat my data, i'll get this on some 10+-disk VDEV hardware soon enough. Any specific rough edges i should be testing? |
Hey guys, just a heads up that the upstream PR has been significantly updated.
The most significant departure from what we have in-house at Nexenta is the zio queueing and the manual trim rate limiting. The remaining parts are largely conserved and we've been running them in production for over a year now. |
@dweeezil I'd really appreciate it if you could find time to drop by the OpenZFS PR for this and give it a look over: openzfs/openzfs#172 |
@skiselkov Thanks for the poke. I'm definitely planning on going over the OpenZFS PR and also getting this one refreshed to match. |
@dweeezil Thanks, appreciate it a lot! |
The original implementation could overestimate the physical size for raidz2 and raidz3 and cause too much trimming. Update with the implementation provided by @ironMann in openzfs#3656.
The original implementation could overestimate the physical size for raidz2 and raidz3 and cause too much trimming. Update with the implementation provided by @ironMann in openzfs#3656.
This PR has gotten way too long to comfortably deal with in Github. I've just done a complete refresh of the TRIM patch stack based on the upstream PR for OpenZFS and rebased to a current ZoL master. Once I've done some testing of the new stack, this PR will be closed and will be replaced with a new one. In the mean time, the soon-to-be-posted-PR is in dweeezil:ntrim-next-2. @skiselkov Once I do some testing and post the new PR, I'll finally be able to start reviewing the OpenZFS PR. I tried to keep as many notes as I can as to some of the issues I've had to deal with which might be applicable upstream. |
@dweeezil Thank you, appreciate it. |
The original implementation could overestimate the physical size for raidz2 and raidz3 and cause too much trimming. Update with the implementation provided by @ironMann in openzfs#3656.
The original implementation could overestimate the physical size for raidz2 and raidz3 and cause too much trimming. Update with the implementation provided by @ironMann in openzfs#3656.
The original implementation could overestimate the physical size for raidz2 and raidz3 and cause too much trimming. Update with the implementation provided by @ironMann in openzfs#3656.
The original implementation could overestimate the physical size for raidz2 and raidz3 and cause too much trimming. Update with the implementation provided by @ironMann in openzfs#3656.
The original implementation could overestimate the physical size for raidz2 and raidz3 and cause too much trimming. Update with the implementation provided by @ironMann in openzfs#3656.
Replaced with #5925. |
The original implementation could overestimate the physical size for raidz2 and raidz3 and cause too much trimming. Update with the implementation provided by @ironMann in openzfs#3656.
The original implementation could overestimate the physical size for raidz2 and raidz3 and cause too much trimming. Update with the implementation provided by @ironMann in openzfs#3656.
The original implementation could overestimate the physical size for raidz2 and raidz3 and cause too much trimming. Update with the implementation provided by @ironMann in openzfs#3656.
The original implementation could overestimate the physical size for raidz2 and raidz3 and cause too much trimming. Update with the implementation provided by @ironMann in openzfs#3656.
The original implementation could overestimate the physical size for raidz2 and raidz3 and cause too much trimming. Update with the implementation provided by @ironMann in openzfs#3656.
The original implementation could overestimate the physical size for raidz2 and raidz3 and cause too much trimming. Update with the implementation provided by @ironMann in openzfs#3656.
This patch stack includes Nextenta's support for TRIM/Discard on disk and file vdevs as well as an update to the dkio headers for appropriate Solaris compatibility. It requires the current https://github.com/dweeezil/spl/tree/ntrim patch in order to compile properly.
The usual disclaimers apply at this point: I've performed moderate testing with ext4-backed file vdevs and light testing with SSD-backed disk vdevs and it appears to work properly. Use at your own risk. It may DESTROY YOUR DATA! I'm posting the pull request because it seems to work during initial testing and I'd like the buildbots to get a chance at it (which I'm expecting to fail unless they use the corresponding SPL code).
The initial TRIM support (currently in commit 719301c) caused frequent deadlocks in ztest due to the SCL_ALL spa locking during the trim operations. The follow-on patch to support on-demand trim changed the locking scheme and I'm no longer seeing deadlocks with either ztest or normal operation.
The final last commit (currently 9e5cfd7) adds ZIL logging for zvol trim operations. This code was mostly borrowed from an older Nexenta patch (referenced in the commit log) and has been merged into the existing zvol trim function.
In order to enable the feature, you must use
zpool set autotrim=on
on the pool and thezfs_trim
module parameter must be set to 1 (which is its default value). Thezfs_trim
parameter controls the lower-level vdev trimming whereas the pool property controls it at a higher level. By default, trims are batched and only applied every 32 transaction groups as controlled by the newzfs_txgs_per_trim
parameter. This allows forzpool import -T
to continue to be useful. Finally, by default, only regions of at least 1MiB are trimmed as set by thezfs_trim_min_ext_sz
module parameter.