Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TRIM/UNMAP/DISCARD support for vdevs #924

Closed
wants to merge 43 commits into from
Closed

Conversation

dechamps
Copy link
Contributor

@dechamps dechamps commented Sep 1, 2012

NOTE : this pull request has been replaced by #1016.

This adds TRIM (a.k.a UNMAP, DISCARD, hole punching) support for vdevs. The original patch is from Pawel Jakub Dawidek ([email protected]) who wrote it for FreeBSD. I'm just porting it to ZFS On Linux. Original discussion is in #598. Note that you need openzfs/spl#168 to build this.

Clean branch (caution: history may be rewritten): https://github.com/dechamps/zfs/compare/trim-clean

This code is largely untested (especially the Linux-specific parts) and carries a high risk of data corruption since it deals with deletion of data. DO NOT APPLY THIS PATCH IF YOU CARE ABOUT YOUR DATA. USE AT YOUR OWN RISK. EXPECT DATA LOSS.

@behlendorf: I'm not asking you to merge this. At least not right now. I'm just opening a pull request to give the patch more visibility and get more testing. As usual, as soon as it's ready and received enough testing for your taste, I'll clean it and open another, final pull request.

Features

  • Automatic, fully transparent TRIM on data, slog and cache vdevs. TRIM requests are not immediately issued in order not to break pool disaster discovery, and are batched for optimal performance.
  • Whole vdev TRIM on zpool create/add/attach (similar to what mke2fs does).
  • Whole vdev TRIM on cache vdevs on export/remove/destroy.
  • Supports disk vdevs using the standard Linux DISCARD interface (which translates to ATA TRIM, SCSI UNMAP, or any other method supported by the vdev block device driver)
  • Supports file vdevs using the Linux file hole punching interfaces.

Current status

  • ztest was OK, but was broken by some of the latest optimizations.
  • zconfig.sh passes with flying colors starting from dccbfaf.
  • zfault.sh passes with flying colors starting from dccbfaf.
  • This patch is feature-complete since 4c1f8e1. The only work that remains is testing/bug hunting.

Usage instructions

Note that the new feature is disabled by default (zfs_notrim=1). To use it, you have to explicitly load the module with zfs_notrim=0. For TRIM to work, you need to have vdevs that support it, e.g. SSDs. If you don't have any, you can still test using a Linux RAM block device (rd module), as it supports discards. Note that if you choose to test with rd, make sure to use ashift=12, else you will trigger a corruption bug in rd.

There is a module parameter (trim_txg_limit) that specifies after how many TXGs the freed data is actually discarded on the device (default 32). Indeed, the algorithm doesn't discard data immediately so that pool disaster recovery (i.e. zpool -F) don't break. I suggest you test with the default and with a low value (e.g. 1) to make the algorithm more aggressive, which might trigger different bugs.

There is a module parameter (trim_txg_batch) that specifies how many TXGs are squashed together in order to batch (merge) TRIM requests (default 32). The optimal value heavily depends on your workload and the performance characteristics of your physical vdevs.

Module parameters trim_l2arc_limit and trim_l2arc_batch work the same way but are specific to cache devices. They are expressed in seconds instead of TXGs.

There is a module parameter (zfs_trim_zero) that, when enabled, makes sure TRIMmed data is zeroed. Note that for disk vdevs, this effectively means that TRIM is disabled and replaced by "write zeroes". This mode allows ztest to test TRIM even on filesystems that don't support file hole punching. Outside ztest, it can also be used to diagnose corruption issues more reliably. Note that, for obvious reasons, performance greatly suffers in this mode.

The patch introduces a new kstat page in /proc/spl/kstat/zfs/zio_trim. It allows you to quickly check if TRIM is actually working or not. Keep in mind that if you're using a high trim_txg_limit (e.g. 64), then it may take a while before TRIM requests are actually issued.

TRIM support on file vdevs (a.k.a file hole punching)

There are a few prerequisites for file (not disk) vdev support using Linux file hole punching capabilities:

  • The underlying filesystem (where the file vdevs are stored) must support file hole punching. Currently, only OCFS2, XFS and ext4 support it. tmpfs supports it via the truncate_range interface, so it should work (albeit only from kernel space). Ironically, ZFS itself doesn't (yet).
  • Of course, zfs_notrim still needs to be set to 0, no matter what the vdev type is.
  • In the case of ZFS in userspace (e.g. ztest), be sure that your installed Linux headers (i.e. /usr/include/linux) are up to date (>= 2.6.38). Also, as mentioned above, if you're using ztest, be sure to run it on a filesystem that supports file hole punching (tmpfs doesn't). Note that you can't change parameters using ztest, so you'll have to hack trim_map.c to manually set zfs_notrim=0 by default and rebuild.

A good way to be sure that TRIM support on file vdev works is to do something like this with zfs_notrim=0 and trim_txg_limit=1, assuming /store is stored on ext4 or another filesystem with file hole punching support:

# dd if=/dev/zero of=/store/pool bs=1 count=1 seek=1024M
# zpool create punchhole /store/pool
# du -h /store/pool
1.2M    /store/pool
# dd if=/dev/urandom of=/punchhole/foo bs=1M count=100
# du -h /store/pool
102M    /mnt/file
# rm /punchhole/foo
# du -h /store/pool
1.7M    /store/pool

Again, this section only applies to file vdevs, not disk (block device) vdevs. Most users have no use for TRIM support on file vdevs.

TODO list

  • Extensive testing for stability and data integrity

Note that I have no idea if I'll have enough time to see this whole thing through, so if you want to take care of some items of the TODO list, be my guest.

This adds TRIM (a.k.a UNMAP, DISCARD, hole punching) support for disk
vdevs. The original patch is from Pawel Jakub Dawidek
<[email protected]> who wrote it for FreeBSD. Etienne Dechamps
<[email protected]> ported it to ZFS On Linux.

The code builds a map of regions that were freed. On every write the
code consults the map and eventually removes ranges that were freed
before, but are now overwritten.

Freed blocks are not TRIMed immediately. There is a tunable that defines
how many txg we should wait with TRIMming freed blocks (64 by default).

There is a low priority thread that TRIMs ranges when the time comes.
During TRIM we keep in-flight ranges on a list to detect colliding
writes - we have to delay writes that collide with in-flight TRIMs in
case something will be reordered and write will reached the disk before
the TRIM. We don't have to do the same for in-flight writes, as
colliding writes just remove ranges to TRIM.

Most of the code stayed unchanged during the porting to Linux. The only
big change is in the vdev disk module, since the FreeBSD and Linux
interfaces for issuing discards to block devices is obviously different.
On FreeBSD it seems that issuing a DELETE request of any size is
sufficient; on Linux we have to be careful not to exceed maximum discard
limits. That's why we introduce a new vdev_disk_io_trim() function
inspired from the Linux blkdev_issue_discard() function and the
pre-existing vdev_disk_physio() function. The new function takes care of
splitting discard requests into smaller ones if necessary.

In theory, the code should work for main pool disk vdevs, slog disk
vdevs, L2ARC disk vdevs, and supports mirror and raidz. File vdevs are
not supported yet.

Note that the new feature is disabled by default (zfs_notrim=1). To use
it, you have to explictly set the module parameter "zfs_notrim" to "0".
Be aware that this code is largely untested and brings huge risks of
potential data corruption. Use at your own risk and expect data loss.
@dechamps
Copy link
Contributor Author

dechamps commented Sep 1, 2012

Note: you need dechamps/spl@299d642 for the build to work.

This warning appeared in cc6cd40. In
practice there's no code path where "flags" may be uninitialized, but
it seems GCC is not smart enough to notice it.
Pawel Jakub Dawidek, author of most of
cc6cd40, noticed a mistake in zio
which could lead to a NULL pointer dereference with raidz. Pawel
suggested this patch to fix it.
@cwedgwood
Copy link
Contributor

Wouldn't it make more sense to have zfs_trim=0 as a default? (having options that default enable to NOT to do something seems llike a a double-negative to me)

@dechamps
Copy link
Contributor Author

dechamps commented Sep 2, 2012

Yes it is. In the general case I would agree with you, but:

  • I'm trying to stay as close as possible to Pavel's original patch.
  • When (if) it will get merged, it will be enabled by default (I think - in the end it's Brian's call), and zfs_notrim is consistent with how other ZFS tunables are named (like zfs_nocacheflush).

Personally I really don't like options that begin with the word "no", but in this case I'm making an exception.

This adds TRIM support for file vdevs, using the newly-added VOP_SPACE
compatibility interface which translates to Linux
fallocate(FALLOC_FL_PUNCH_HOLE). When a TRIM request is issued to a
file vdev, ZFS will "punch a hole" (deallocate space) inside the
underlying file. Note that few Linux filesystems support hole punching
(currently OCFS2, XFS, and ext4).

Note that for this to work, you need to build against a kernel that
supports file hole punching (i.e. Linux >= 2.6.38). In addition, for
this to work in userspace (e.g. ztest), the installed userspace
kernel headers (i.e. /usr/include/linux) must be up-to-date.
@dechamps
Copy link
Contributor Author

dechamps commented Sep 3, 2012

Support for file vdevs is here, using the Linux file hole punching capabilities (deallocate space inside files). Note that as of 4e789e1 you need openzfs/spl#168 to build.

Note that there are a few prerequisites for this to work:

  • You need Linux 2.6.38 or greater for file hole punching support. This is technically possible using somewhat older kernels which implement the truncate_range inode operation but I don't use it. I'll probably add it if someone has a need for it. If your kernel doesn't implement the needed interfaces, the build will complete but file hole punching will be disabled.
  • The underlying file systems (where the file vdevs are stored) need to support file hole punching. Currently, only OCFS2, XFS and ext4 support it. Ironically, ZFS itself doesn't (yet).
  • Of course, zfs_notrim still needs to be set to 0, no matter what the vdev type is.
  • In the case of ZFS in userspace (e.g. ztest), be sure that your installed Linux headers (i.e. /usr/include/linux) are up to date (>= 2.6.38). Also, as mentioned above, if you're using ztest, be sure to run it on a filesystem that supports file hole punching (tmpfs doesn't). Note that you can't change parameters using ztest, so you'll have to hack trim_map.c to manually set zfs_notrim=0 by default and rebuild.

You can confirm that you're building with kernel file hole punching support if the SPL configure script indicates that FALLOC_FL_PUNCH_HOLE is defined. You can confirm that you're building with userspace file hole punching support (only useful for e.g. ztest) if the ZFS configure script indicates that FALLOC_FL_PUNCH_HOLE is defined in userspace.

A good way to be sure that TRIM support on file vdev works is to do something like this with zfs_notrim=0 and trim_txg_limit=1, assuming /store is stored on ext4 or another filesystem with file hole punching support:

# dd if=/dev/zero of=/store/pool bs=1 count=1 seek=1024M
# zpool create punchhole /store/pool
# du -h /store/pool
1.2M    /store/pool
# dd if=/dev/urandom of=/punchhole/foo bs=1M count=100
# du -h /store/pool
102M    /mnt/file
# rm /punchhole/foo
# du -h /store/pool
1.7M    /store/pool

When punching holes in file vdevs from userspace using VOP_SPACE(),
errors are not reported correctly to the caller. This patch fixes that.
In cc6cd40, vdev_disk_io_trim()
returns ENOTSUP if DISCARD isn't supported. This isn't consistent with
the the rest of the code which uses EOPNOTSUPP.
The current code doesn't expect the value of zfs_notrim to change under
its nose, so don't allow it.
ztest runs with TRIM enabled on a backing filesystem that doesn't
support hole punching revealed the following race condition:

 1. A write is issued to a vdev and added to the TRIM map inflight
    writes tree.
 2. The very first TRIM request for this vdev is issued.
 3. The TRIM request fails with EOPNOTSUPP, as expected. vdev_notrim
    is set.
 4. The write completes. trim_map_write_done() is called.
 5. trim_map_write_done() sees that vdev_notrim is set, and returns
    immediately.
 6. As a result, the write stays in the TRIM map inflight writes tree
    forever.
 7. When the vdev is closed, trim_map_destroy() tries to destroy the
    inflight writes tree, which triggers an assertion failure because
    it isn't empty.

This patch fixes the issue by removing the check on vdev_notrim in
trim_map_write_done().
ztest runs with TRIM enabled revealed a race condition which occurs when
a vdev close is pending while there are TRIM requests in flight. When
the TRIM ZIOs complete, the SPA lock is released. The SPA lock is then
immediately taken by the vdev close path, which calls trim_map_destroy()
before trim_map_vdev_commit_done() has a chance to run.
trim_map_destroy() then tries to destroy the inflight free tree, which
fails with an assertion failure because it is non-empty.

This patch fixes the issue by making trim_map_destroy() call
trim_map_vdev_commit_done() before destroying the TRIM map.
@dechamps
Copy link
Contributor Author

dechamps commented Sep 7, 2012

Bad news: even with the recent fixes, ztest runs always end up corrupting the pool after a few seconds. In other words, the code probably isn't safe in its current state. Debugging this won't be easy.

Currently, vdev_notrim is set to 0 after the actual vdev has been
opened (e.g. in vdev_disk_open()). This defeats the purpose of setting
vdev_notrim in vdev_disk_open(), so don't do it.
@dechamps
Copy link
Contributor Author

ztest -r 1 works, so the corruption issue seems to be raidz-specific. That's good news since there's not much code to debug in vdev_raidz.c.

@dechamps
Copy link
Contributor Author

After some debugging, it seems that the corruption is caused by TRIMming valid data: ztest tries to re-read data that was previously trimmed, and of course fails.

Further investigation seems to indicate that the blocks being corrupted are always gang blocks, although I'm not 100% sure. In addition, I don't understand what the connection with RAID-Z is - maybe the issue exists for other types of vdev but it only surfaces with RAID-Z, I don't know.

Since this seems related to gang blocks, I took a closer at a BP_IS_GANG() test in zio_vdev_io_start(), which doesn't register the free in the trim map if we're dealing with a gang block. However, my tests indicate that zio->io_bp is always NULL when it's being tested, so the code never actually tests for BP_IS_GANG(). I think that's normal considering we're in the ZIO for the leaf vdev, so we don't have access to BP information which is only available in the top-level vdev ZIO. If that is indeed the issue, we need a way to stop the FREE ZIO for the gang block in its tracks before it gets down to physical vdevs.

Anyway, I still don't understand why we have to create a special case for gang blocks. Apparently, ztest tries to read a gang block after it has been freed, which doesn't make much sense to me.

@behlendorf
Copy link
Contributor

@dechamps I don't have the time right now to help dig in to any of this, but it might be worth striking up a conversation with Pawel. These seems like flaws that need to be corrected in the original patch so he may be interested.

@dechamps
Copy link
Contributor Author

Well, actually, I'm keeping him informed of all the developments in this thread, including the race conditions I found earlier. For some reason he doesn't seem to want to comment on Github directly, but he does answer my emails. Or at least he did - his last sign was from 10 days ago. Maybe he's on a vacation or something. Anyway, I'm still investigating, with or without his help.

@dechamps
Copy link
Contributor Author

Confirmed: if I test for ZIO_TYPE_FREE && BP_IS_GANG() in vdev_raidz_io_start() and skip the I/O if true, then ztest succeeds. I still don't understand why, though.

@dechamps
Copy link
Contributor Author

Found it!

vdev_raidz_io_start(WRITE,   0,   49524736,   512): DVA[0]=<0:2f3b000:15c00> DVA[1]=<0:5ce800:400> [L0 other uint64[]] fletcher4 uncompressed LE gang unique single size=10000L/10000P birth=17L/17P fill=1 cksum=222b711e61:84a30879a8dfc:172367601977866:7b2aae1fdc4a9bd6
vdev_raidz_io_start(WRITE,   0,    6088704,   512): DVA[0]=<0:2f3b000:15c00> DVA[1]=<0:5ce800:400> [L0 other uint64[]] fletcher4 uncompressed LE gang unique single size=10000L/10000P birth=17L/17P fill=1 cksum=222b711e61:84a30879a8dfc:172367601977866:7b2aae1fdc4a9bd6

...

vdev_raidz_io_start(FREE ,   0,   49524736, 65536): DVA[0]=<0:2f3b000:15c00> DVA[1]=<0:5ce800:400> [L0 other uint64[]] fletcher4 uncompressed LE gang unique single size=10000L/10000P birth=17L/17P fill=1 cksum=222b711e61:84a30879a8dfc:172367601977866:7b2aae1fdc4a9bd6
vdev_raidz_io_start(FREE ,   0,    6088704, 65536): DVA[0]=<0:2f3b000:15c00> DVA[1]=<0:5ce800:400> [L0 other uint64[]] fletcher4 uncompressed LE gang unique single size=10000L/10000P birth=17L/17P fill=1 cksum=222b711e61:84a30879a8dfc:172367601977866:7b2aae1fdc4a9bd6

As you can see, when the gang block leader is written (or read), zio->io_size is 512, which is the physical size of the block, so all is good. However, when the gang block is freed, then zio->io_size is 65536, which seems to be the logical size of the gang block data. The reason lies within zio_free_sync() which is called by zio_free_gang():

zio_create(pio, spa, txg, bp, NULL, BP_GET_PSIZE(bp)

The PSIZE of a gang block leader is the logical size of the data, not the physical size of the indirect block, so that's incorrect. It didn't cause any issues until now because metaslab_free_dva() has code to handle this special case:

if (DVA_GET_GANG(dva))
    size = vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE);

Unfortunately, the TRIM code wasn't doing that, so it was issuing TRIM requests with incorrect sizes.

I think the reason this only happens with RAID-Z is because when testing with other vdev types, we get lucky and the bigger TRIM request doesn't actually touch other valid data. However, if my analysis is correct, then this bug could be a potential corruption hazard with any vdev type.

The solution is to make sure zio->io_size is correct for "gang header free" ZIOs. That's what 19afab1 does.

Now, the good news is, we're passing ztest with trim_txg_limit=64. The bad news is, ztest still corrupts the pool with low trim_txg_limit values (e.g. 8), so we still have some debugging ahead of us. But at least we're getting somewhere.

Currently, the PSIZE of the block pointer is used as the I/O size for
all free ZIOs. This is incorrect for gang header blocks, for which
PSIZE is the logical size of the gang block data.

This patch fixes the issue by using SPA_GANGBLOCKSIZE as the I/O size
for "free gang header block" ZIOs.
This test is useless since 19afab1,
which fixed gang blocks issues. Besides, the original code wasn't even
working properly since io_bp is always NULL on a physical vdev ZIO.
In eefbe9b, we fixed the I/O size for
"free gang header block" ZIOs. However, the fix also included gang data
blocks, which are of variable size. Fix the fix by distinguishing
between gang header blocks and gang data blocks.
@dechamps
Copy link
Contributor Author

Here's an excerpt from a debugging session during a ztest which failed because of the remaining corruption issue: https://gist.github.com/3706822

Apparently, ZFS is writing a "L0 DMU objset" block, then frees it, then tries to read it, which obviously fails. Here we go again...

Currently, labels are preserved when TRIMming the cache vdev on zpool
destroy. This is not what we want.
@dechamps
Copy link
Contributor Author

I'm setting up long ztest runs to investigate the assertion failure issue I reported some comments ago. Even a3adbc8 fails, so that's probably not a regression. I will be testing master overnight to make sure it's OK in order to have a solid point of reference.

Bottom line is: the branch doesn't pass a long (> 30 minutes) ztest run. It can't be considered stable yet.

@dechamps
Copy link
Contributor Author

On 63 instances of ztest -P 1200 -T 3600 on a3adbc8, 1 succeeded, and 62 failed.

45 of the failed tests (73%) ended with the error message I started investigating some comments ago:

lt-ztest: ../../module/avl/avl.c:634: avl_add: Assertion `0' failed.

12 of the failed tests (19%) ended with an error message similar to the ones in #989 (issue in master).

3 tests failed with the following message:

ztest: failed to complete within 300 seconds of deadline

1 test failed with the following message:

zdb: can't open 'zfstest_2012-09-25_155308_12714': Input/output error

1 test failed with the following message:

../../lib/libzpool/kernel.c:272: Assertion `mp->m_magic == 0x9522f51362a6e326ull (0x0 == 0x9522f51362a6e326)' failed.

We need to fix #989 before I can continue ztesting. I need reliable results.

@dechamps
Copy link
Contributor Author

dechamps commented Oct 1, 2012

I got 100% failure rate over 185 three-hour pass ztest runs on dccbfaf (and commits after that). 183 tests (99%) failed with the avl_add() assertion failure described earlier. Investigating.

@dechamps
Copy link
Contributor Author

dechamps commented Oct 2, 2012

While investigating, I also noticed that trim_map_write_done() sometimes concludes that the ZIO isn't in the tree, so it doesn't remove it. This should never happen while ztesting because vdev_notrim is never true in this case.

After further investigation, I think that's normal. If an I/O is done on an inaccessible vdev, trim_map_write_start() isn't called because zio_vdev_io_start() returns prematurely (with ENXIO). When zio_vdev_io_done() is called, trim_map_write_done() is called with a write that isn't in the tree, as expected. This should be fine, at least in theory. So the avl_add() issue might not be related to this.

@dechamps
Copy link
Contributor Author

dechamps commented Oct 2, 2012

Okay, now I'm getting somewhere. I noticed the following stack trace during one of my attempts to reproduce the issue:

#2  0x00007f8820160301 in *__GI___assert_fail (assertion=0x7f88210cf20e "0", file=<value optimized out>, line=634, 
    function=0x7f88210cf49a "avl_add") at assert.c:81
#3  0x00007f88210c8532 in avl_add (tree=0x3d6ce60, new_node=0x4d89a10) at ../../module/avl/avl.c:634
#4  0x00007f8820de3246 in trim_map_write_start (zio=0x4d89a10) at ../../module/zfs/trim_map.c:339
#5  0x00007f8820e1be60 in zio_vdev_io_start (zio=0x4d89a10) at ../../module/zfs/zio.c:2541
#6  0x00007f8820e1753d in __zio_execute (zio=0x4d89a10) at ../../module/zfs/zio.c:1332
#7  zio_nowait (zio=0x4d89a10) at ../../module/zfs/zio.c:1387
#8  0x00007f8820dfbd4e in vdev_raidz_io_done (zio=0x3a33a10) at ../../module/zfs/vdev_raidz.c:2143
#9  0x00007f8820e1c087 in zio_vdev_io_done (zio=0x3a33a10) at ../../module/zfs/zio.c:2590
#10 0x00007f8820e1ee7d in __zio_execute (zio=0xb2a7e0) at ../../module/zfs/zio.c:1332
#11 zio_notify_parent (zio=0xb2a7e0) at ../../module/zfs/zio.c:559
#12 zio_done (zio=0xb2a7e0) at ../../module/zfs/zio.c:3160
#13 0x00007f8820e16f86 in __zio_execute (zio=0xb2a7e0) at ../../module/zfs/zio.c:1332
#14 zio_execute (zio=0xb2a7e0) at ../../module/zfs/zio.c:1281
#15 0x00007f8820d4d3ad in taskq_thread (arg=0x7f8810019010) at ../../lib/libzpool/taskq.c:232
#16 0x00007f8820d47e49 in zk_thread_helper (arg=0x7f881001bab0) at ../../lib/libzpool/kernel.c:135

I also managed to locate the thread responsible for the conflicting ZIO, i.e. the one that is already in the trim map inflight writes tree which results in the assertion failure:

#3  0x00007f8820d48e1a in mutex_enter (mp=0x3d6cea8) at ../../lib/libzpool/kernel.c:276
#4  0x00007f8820de333f in trim_map_write_done (zio=0x7f87ec33e5e0) at ../../module/zfs/trim_map.c:366
#5  0x00007f8820e1bf69 in zio_vdev_io_done (zio=0x7f87ec33e5e0) at ../../module/zfs/zio.c:2563
#6  0x00007f8820e16f86 in __zio_execute (zio=0x7f87ec33e5e0) at ../../module/zfs/zio.c:1332
#7  zio_execute (zio=0x7f87ec33e5e0) at ../../module/zfs/zio.c:1281
#8  0x00007f8820d4d3ad in taskq_thread (arg=0x7f8810012cd0) at ../../lib/libzpool/taskq.c:232
#9  0x00007f8820d47e49 in zk_thread_helper (arg=0x7f881001d2d0) at ../../lib/libzpool/kernel.c:135

It resembles a race condition: the second thread was about to remove the write from the tree, but it had to wait for the trim map mutex, which is owned by the first thread adding its own conflicting write. Consequently, instead of removing then adding the write, the first thread tried to add the write before it had a chance to be removed.

There are interesting facts about these two writes:

  • The first write (the one that the first thread fails to add to the tree) is a self-healing, raidz repair I/O.
  • The second write (the one waiting for the mutex) is a DDT-related rewrite requested by ddt_sync() (more precisely ddt_repair_entry()). I can say this because the ZIO has the DDT_INHERIT and SPECULATIVE flags, and this seems to be the only way that can happen.

Both writes are targetting the same physical vdev, same offset, same size (thus the conflict).

One thing I don't understand is how such a situation can happen in the first place: if I interpret the traces correctly, it means we got two simultaneous ZIOs writing data at exactly the same offset on the the same vdev. Isn't that frowed upon? Shouldn't it result in potential data corruption due to the unpredictable ordering of write requests? I would expect the first ZIO to wait until the second ZIO is done before issuing its write request.

@dechamps
Copy link
Contributor Author

dechamps commented Oct 2, 2012

I got another occurence. The traces are the same, except this time, I got two conflicting ZIOs with resilver priority and the IO_REPAIR flag (without SELF_HEALING). So, now I got two ZIOs trying to resilver the same block on the same vdev at the same time... WTF?

In the mean time, I'm trying a patch which makes TRIM ignore repair writes. Looks like it fixes the issue, although I need more ztesting to be sure. Ignoring repair writes should be safe considering that, by definition, resilvering/self-healing doesn't write blocks that are freed, so they cannot conflict with inflight TRIM requests. At least that's the theory.

I'm going to ask the Illumos folks about this conflicting writes issue.

@ryao
Copy link
Contributor

ryao commented Oct 3, 2012

@dechamps Are you developing with assertions enabled (i.e. --enable-debug)?

@dechamps
Copy link
Contributor Author

dechamps commented Oct 3, 2012

Of course. How would I catch these assertion failures if I didn't?

The trim map inflight writes tree assumes non-conflicting writes, i.e.
that there will never be two simultaneous write I/Os to the same range
on the same vdev. This seemed like a sane assumption; however, in
actual testing, it appears that repair I/Os can very well conflict
with "normal" writes.

I'm not quite sure if these conflicting writes are supposed to happen
or not, but in the mean time, let's ignore repair writes for now. This
should be safe considering that, by definition, we never repair blocks
that are freed.
@dechamps
Copy link
Contributor Author

dechamps commented Oct 3, 2012

Okay, according to my testing, 7be8b0c definitely fixes the avl_add() assertion failure bug. Great. Now 1-hour pass ztests are OK with dabccd1. The latest version, however, fails unusually often while ztesting. The failures are all related to mp->m_magic being corrupted, which is a known ztest issue (see #989) but the frequency seems significantly higher (7 over 19, 37%) which seems to indicate that I somehow made things worse between dabccd1 and now (probably the L2ARC-related code). Not quite sure though.

This patch adds some improvements to the way the trim module considers
TXGs:

 - Free ZIOs are registered with the TXG from the ZIO itself, not the
   current SPA syncing TXG (which may be out of date);
 - L2ARC are registered with a zero TXG number, as L2ARC has no concept
   of TXGs;
 - The TXG limit for issuing TRIMs is now computed from the last synced
   TXG, not the currently syncing TXG. Indeed, under extremely unlikely
   race conditions, there is a risk we could trim blocks which have been
   freed in a TXG that has not finished syncing, resulting in potential
   data corruption in case of a crash.
Currently, the TRIM thread waits indefinitely until it is waked up by
spa_sync(). This triggers a blocked task warning in Linux if there is
no write activity on the pool for a long time.

This patch fixes the issue by making the TRIM thread wake every second
to make the kernel happy.
While the current code delays TRIMs by a number of TXGs specified by
the trim_txg_limit parameter, it always TRIMs frees from one TXG at
a time in most circumstances, i.e. it doesn't merge frees from
adjacent TXGs.

This patch allows to batch TRIMs from several consecutive TXGs in
order to merge more TRIM requests, thus reducing their number and
making them larger. This should significantly improve TRIM performance,
depending on your workload and how your physical vdev handles TRIM
requests.

The batch size, expressed in TXGs, is specified by the new
trim_txg_batch parameter. It defaults to 32. Note the default for
trim_txg_limit has been reduced from 64 to 32 to keep the worst-case
trim delay constant.
@dechamps
Copy link
Contributor Author

dechamps commented Oct 3, 2012

206dddd brings us TRIM TXG batching. This is an optimization aimed at reducing the number of TRIM requests issued by making them larger. This should significantly improve performance depending on your workload and the way your physical vdevs handle TRIM requests.

There is a module parameter (trim_txg_batch) that indicate the number of TXGs that are squashed together in order to batch (merge) TRIM requests. It is set to 32 by default. Note that trim_txg_limit has been reduced from 64 to 32 in order to keep worst case trim delay constant.

Testing shows a 50% reduction in TRIM requests over a 5-minute ztest run. This is not bad considering ztest is probably close to the worst-case scenario for this feature (heavily random I/O on a constantly changing pool).

Currently, the trim module uses the same algorithm for data and cache
devices when deciding to issue TRIM requests, based on how far in the
past the TXG is.

Unfortunately, this is not ideal for cache devices, because the L2ARC
doesn't use the concept of TXGs at all. In fact, when using a pool for
reading only, the L2ARC is written but the TXG counter doesn't
increase, and so no new TRIM requests are issued to the cache device.

This patch fixes the issue by using time instead of the TXG number as
the criteria for trimming on cache devices. The basic delay/batch
principle stays the same, but parameters are expressed in seconds
instead of TXGs. The new parameters are named trim_l2arc_limit and
trim_l2arc_batch, and both default to 30 second.
@dechamps
Copy link
Contributor Author

dechamps commented Oct 3, 2012

5445fb9 brings us time-based TRIM for cache devices. This is an optimization over the previous behavior, which used the TXG counter to schedule TRIMs to cache devices, which doesn't make a ton of sense.

We have two new module parameters: trim_l2arc_limit and trim_l2arc_batch. They work the same way as txg_trim_limit and txg_trim_batch, except that they are expressed in seconds instead of TXGs.

@dechamps
Copy link
Contributor Author

dechamps commented Oct 3, 2012

The TODO list is now empty. Everything is done, this branch is feature-complete and I implemented all the optimization ideas that came to my mind. I'll close this pull request and open a new one on trim-clean as soon as I confirm there are no more ztest issues.

@ryao
Copy link
Contributor

ryao commented Oct 3, 2012

It could be useful to implement something similar to fstrim so that an administrator can TRIM the free space map. That would allow us to rejuvenate any pools that were created on SSDs prior to this work.

Also, it might be useful to test with ashift=13 to ensure that there are no odd issues there. Modern SSDs have 8192-byte pages, so they should be using ashift=13, rather than ashift=12.

@dechamps
Copy link
Contributor Author

dechamps commented Oct 4, 2012

@ryao: I agree, but I certainly won't have the time to do what you ask. Unfortunately, I only have until tomorrow to wrap this up.

We're almost there. ztest runs on e000967660eb6bc67dd9679b2729d811284a1a9b and eeb919ae2b49faa5f1f8282798201c4e2e6ded3d indicate these commits are now 100% stable (not even a single failure over 34 1-hour passes). Yay!

Unfortunately, the latest code, that is my optimizations from yesterday, fails approximately 25% of the time with the following assertion:

lt-ztest: ../../lib/libzpool/kernel.c:277: Assertion `mp->m_owner == ((void *)((void *)0)) (0xffffffffffffffff == 0x0)' failed.

Investigating. Hopefully that's the last one!

@dechamps
Copy link
Contributor Author

dechamps commented Oct 5, 2012

Continued in #1016.

@dechamps dechamps closed this Oct 5, 2012
@Rudd-O
Copy link
Contributor

Rudd-O commented Jan 29, 2013

What is the status of this?

@behlendorf
Copy link
Contributor

See #1016 for the updated pull request

@zviratko
Copy link

Please, when discard()ing the newly added device, either make it a background job or add an option to skip it.

On my 3500GB LUN I test zfs with, discarding the whole device with fstrim takes 8 hours(!!!), 21 minutes on an already backend discarded device.

In other words, if I have a 3500GB LUN with old, dirty data and I try to discard it it will hang there for 8 hours. This can potentially be very bad if you need the filesystem NOW (like in a backup recovery etc.)

The LUN in question is connected from a Clariion CX4-480 via 4Gb Fibre. I don't see much on activity (iostat, vmstat) when fstrim runs, and there is a huge difference between a clean and dirty LUN, so the point of congestion is in the array itself.

It is of course possible that fstrim is in some way horribly inefficient, but I'd take the safe side.

Also, performance suffers horribly on ext4 with -o discard and the recommendation is to not use it and just use maintenance windows to run fstrim occasionally - I'd prefer such mechanism in ZFS instead of automatic (even if deferred/optimized) discard.

@dechamps
Copy link
Contributor Author

This pull request is obsolete. Please comment on #1016. That being said, I hear your concerns about discard performance, and we'll do our best to alleviate this kind of issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants