Optimized Large File Deletion to Prevent OOM #16708

serjponomarev · 2024-10-30T20:31:00Z

Describe the feature you would like to see added to OpenZFS

I propose adding an iterative approach for deleting large files in ZFS pools with deduplication enabled. Instead of calling unlink to remove the entire file at once, we can implement a mechanism that reduces the file size from the end, freeing blocks incrementally.

How will this feature improve OpenZFS?

This feature addresses the issue of Out-Of-Memory (OOM) errors that occur when deleting large files. Currently, when unlink is called, ZFS loads all entries from the Deduplication Data Table (DDT) related to the file into memory, which can lead to memory overload, especially on systems with limited RAM. By implementing an iterative file reduction process, we can significantly reduce memory consumption and improve stability.

Additional context

The proposed algorithm includes the following steps:

Iterative File Truncation: Implement internal logic to incrementally truncate the file from the end, allowing ZFS to load only the necessary metadata associated with the current data size, thus minimizing memory usage.
Final unlink Call: Once the file is completely truncated, perform a final unlink to remove any remaining metadata.

Benefits:

Reduces the risk of OOM errors on systems with limited memory.
Easy integration into the existing ZFS architecture without needing changes to system calls.
Enhances overall system performance by managing memory more effectively.

Experimental Evidence

The following experiment demonstrates the basis for this proposed improvement:

Environment:

Virtual machine with 8 vCPUs, 8 GB of RAM, and 3 NVMe drives configured into a 1.5 TB ZFS pool with deduplication enabled and recordsize=16K.
Debian 12.7 with ZFS version 2.1.11 from the Debian repository.

Procedure:

Populate the pool with a file containing random data to fully utilize the DDT:

fio --name=test --numjobs=1 --iodepth=8 --bs=1M --rw=write --ioengine=libaio --fallocate=0 --filename=/zpool/test.io --filesize=1T

Attempt to delete the file using rm /zpool/test.io, resulting in an OOM event.

Reboot and delete the file iteratively, reducing its size by 1 GB in each iteration before final deletion:

filename=/zpool/test.io
for i in $(seq $(du -BG $filename | cut -f1 | tr -d 'G') -1 0); do
    truncate -s "$i"G $filename
    echo truncated to $i G
done
rm -v $filename

Observation:
Memory consumption can be monitored with watch arc_summary throughout the process.

The text was updated successfully, but these errors were encountered:

robn · 2024-10-30T21:23:58Z

I assume you're talking about (at least): #6783 #16037 #16697.

If so, the problem isn't dedup as such, but a side effect of how the the free pipeline is modified for some kinds of blocks, including dedup blocks, but not only dedup blocks (see #16037 for a non-dedup example).

This specific method can't be done, as unlink() has to appear atomic to at the filesystem - it's all or nothing. That said, the technique of pacing the frees rather than dumping them all at once I suspect is at least part of the solution under the hood, but there's several complications and I haven't thought it all through yet.

gmelikov · 2024-10-30T22:24:39Z

Maybe it's a little bit off-topic, but zfs frees blocks, not files (ddt is per-block too), so you can truncate part of your file (plus iterate over whole file), and only unused blocks would be freed. Maybe it's a workaround, yes.

Hope I didn't miss something.

serjponomarev · 2024-10-30T23:24:12Z

Maybe it's a little bit off-topic, but zfs frees blocks, not files (ddt is per-block too), so you can truncate part of your file (plus iterate over whole file), and only unused blocks would be freed. Maybe it's a workaround, yes.

Hope I didn't miss something.

Yes, you’re absolutely correct.

My approach to finding a solution to this issue went as follows:

I tried various combinations of ZFS module parameters, but this didn’t resolve the problem.
I examined ZFS’s data structures at the block level and confirmed what you described — ZFS indeed frees blocks, not entire files.
To test further, I divided a 1 TB file into 1024 files of 1 GB each, then deleted them sequentially while monitoring memory usage with arc_summary. In this case, there was no excessive memory consumption.
I concluded that what I needed was a way to delete a large 1 TB file as if it were 1024 separate 1 GB files.
I realized that truncate might help achieve this, tested it, and it worked, providing the same memory-efficient behavior as deleting 1024 smaller files.

That’s why I decided to share this approach with the community — to discuss possible ways to implement such a mechanism within the ZFS codebase.

serjponomarev · 2024-10-30T23:36:54Z

I assume you're talking about (at least): #6783 #16037 #16697.

If so, the problem isn't dedup as such, but a side effect of how the the free pipeline is modified for some kinds of blocks, including dedup blocks, but not only dedup blocks (see #16037 for a non-dedup example).

This specific method can't be done, as unlink() has to appear atomic to at the filesystem - it's all or nothing. That said, the technique of pacing the frees rather than dumping them all at once I suspect is at least part of the solution under the hood, but there's several complications and I haven't thought it all through yet.

In searching for a solution to this issue, I reviewed all the issues you referenced. I understand that the problem isn’t specifically limited to deduplication; it’s broader in scope. However, in the case of deduplication, this problem is 100% reproducible and testable.

That’s why I chose a more general title for this issue.

robn · 2024-10-30T23:47:56Z

Yep, and you can do tricks with ftruncate in userspace, because you understand what the shrinking file means. It's not suitable as an alternate implementation of unlink() though, which by definition has to appear to make the file disappear entirely.

It also wouldn't solve the problem properly anyway, because the real problem is in the sheer volume of blocks we're trying to destroy in one go, not that they're from the same file. If you had destroyed your 1024 1GB files on the same transaction (sometimes tricky to arrange), it would have blown up in the same way. Similarly if you had done it with 1M 1MB files.

It's not even theoretically limited to filesystems; any object could do it. I'd be curious to know if one created a 1T zvol, filled it with random data, and then zeroed it in one go (maybe with blkdiscard), would it do the same thing? If it didn't, I expect it would be more to with the locking differences in zvols compare to filesystems, not the underlying block structure.

So yeah, if controlling this way from userspace with ftruncate is something you can do, then you have a good workaround, but that's all.

amotin · 2024-10-31T00:08:17Z

I haven't looked there lately and may misremember, but IIRC we've had a mechanisms to throttle deletes to split them between transaction groups. I am not sure it may help single huge file, but for many smaller ones it would be the proper solution.

serjponomarev · 2024-10-31T00:19:03Z

Yep, and you can do tricks with ftruncate in userspace, because you understand what the shrinking file means. It's not suitable as an alternate implementation of unlink() though, which by definition has to appear to make the file disappear entirely.

It also wouldn't solve the problem properly anyway, because the real problem is in the sheer volume of blocks we're trying to destroy in one go, not that they're from the same file. If you had destroyed your 1024 1GB files on the same transaction (sometimes tricky to arrange), it would have blown up in the same way. Similarly if you had done it with 1M 1MB files.

It's not even theoretically limited to filesystems; any object could do it. I'd be curious to know if one created a 1T zvol, filled it with random data, and then zeroed it in one go (maybe with blkdiscard), would it do the same thing? If it didn't, I expect it would be more to with the locking differences in zvols compare to filesystems, not the underlying block structure.

So yeah, if controlling this way from userspace with ftruncate is something you can do, then you have a good workaround, but that's all.

@robn
Yes, I understand the aspects you’ve mentioned.
My intention was simply to contribute to solving a broader issue, as I can indeed address my specific problem from userspace.

I currently have access to the same host described in my experiment, but with smaller NVMe drives. The maximum size for the ZFS pool I can create is approximately 1.09 TB, which would allow me to create a zvol of around 800-900 GB, assuming the pool is filled to 80-90%.

I would be happy to assist in gathering information to tackle this broader problem. Please provide the parameters for the zvol experiment, including the zvol size and block size. I will fill it with random data and then perform a blkdiscard.

Also, please clarify what specific data you are looking to obtain from this experiment. If I understand correctly, you aim to test the hypothesis regarding the sequential discarding of blocks and its impact on memory behavior.

The blkdiscard operation should mimic the behavior observed in my truncate experiment, but at the block level rather than the file level, correct?

robn · 2024-10-31T00:21:03Z

Possibly you mean the zfs_unlinked_drain stuff. Not sure; I don't fully understand it myself. Maybe something further down though, there is a lot of back and forth in the file delete path. Whatever is there didn't save #16037 though, which claims "lots of small files" so.

For big objects though, it just ends up adding the entire object length to dn_free_ranges, and then dnode_sync -> dnode_sync_free_range and beyond just blasts out a mass of frees. ("beyond" is a long way, I have notes which I'll write up before long).

Anyway, I think I have a plan now: repurpose async_destroy. I'm working on a simple prototype and test case now, hopefully something to in a day or two.

amotin · 2024-10-31T00:44:40Z

@robn I am not sure what exactly I mean, but you may see that dmu_tx_count_free() accounts not only blocks that will be modified in process of deletion, but also in a face of txh_memory_tohold how much memory will it require to hold the indirects. But it obviously does not account DDT, BRT, ZIO and other stuff. But I have feeling there was something else, just don't remember what.

robn · 2024-10-31T01:05:19Z

Ahh yeah, that might be it. And I understand why its not working here.

In zio_free_sync, any free that will create IO (gang, dedup, maybe BRT) will zio_create and put it on the pipeline. A zio_t is 1280 bytes. So if you delete a 2T file of 128K dedup blocks, that'll create 16M zio_t, so ~20G just off the zio_cache slab. (This is exactly the scenario in #16697). And of course nothing in the DMU is able to anticipate that.

I'm currently looking at async_destroy as a way of reusing an existing facility (with a nice side effect of background deletes in general, so very fast unlink() calls).

In the longer term, the whole zio pipeline needs a lot of work. Reducing zio_t size at least, maybe frees shouldn't really be done there (since they're not really IO), but also maybe stuff about generally not allocating space until we need it. I had a similar issue in a customer job a few weeks ago where I loaded up a ton of read IOs on the queue, and OOMed the system because all the ABDs needed to be allocated up front, even though they weren't needed until the IO got to vdev_io_start. There's loads to be done, but I definitely didn't want to just start down this road for this mass-free issue, because it needs real thought and input from more people than just me.

serjponomarev · 2024-10-31T02:35:30Z

@robn
I did some experiments with zvol.

  pool: zpool
 state: ONLINE
config:

        NAME        STATE     READ WRITE CKSUM
        zpool       ONLINE       0     0     0
          nvme0n1   ONLINE       0     0     0
          nvme1n1   ONLINE       0     0     0
          nvme2n1   ONLINE       0     0     0

errors: No known data errors

zfs create -s -b 16K -V 900G zpool/zvol

NAME         USED  AVAIL     REFER  MOUNTPOINT
zpool        383M  1.06T       96K  /zpool
zpool/zvol    56K  1.06T       56K  -

Filling:
fio --name=test --numjobs=1 --iodepth=8 --bs=1M --rw=write --ioengine=libaio --direct=1 --group_reporting=1 --filename=/dev/zd0

blkdiscard by default, without specifying a step, discards all data.

Without deduplication:

blkdiscard -v /dev/zd0 - works, memory consumption is almost unchanged.

With deduplication:

blkdiscard -v /dev/zd0 - OOM
blkdiscard -v --step 1G /dev/zd0 - OOM
blkdiscard -v --step 1M /dev/zd0 - works

robn · 2024-11-05T04:37:17Z

@serjponomarev thanks for all the info. The zvol/blkdiscard test supported the theory. I've been able to reproduce in the lab, and I have a patch which should help. I'm still completing testing but I should be able to post a PR later today.

If you're able, could you please rerun your test with this patch? Thanks!
robn@52beaf5

serjponomarev · 2024-11-05T08:56:48Z

@serjponomarev thanks for all the info. The zvol/blkdiscard test supported the theory. I've been able to reproduce in the lab, and I have a patch which should help. I'm still completing testing but I should be able to post a PR later today.

If you're able, could you please rerun your test with this patch? Thanks! robn@52beaf5

@robn
I’ve tested everything, and it’s working great!

Environment:

Virtual machine with 8 vCPUs, 8 GB of RAM, 3 NVMe drives configured into a 1 TB ZFS pool, with recordsize=16K.
Debian 12.7, kernel 6.1.0-26, ZFS 2.3-rc2 with your patch.

For testing, all data written was purely random to maximize the deduplication table.

I tested both native deduplication and fast-deduplication methods, performing deletions of a 900 GB file and discarding a 900 GB zvol. This resulted in four test cases in total.
In all cases, I observed asynchronous space reclamation behavior similar to what occurs with snapshot deletion. The file deletion operation completed quickly, while discarding was slower but still significantly faster than before applying this patch. Both approaches completed without OOM (Out of Memory) issues.

Additionally, the zfs_max_async_dedup_frees parameter now works correctly.
This parameter controls the number of DDT entries freed and the speed of asynchronous deduplication space reclamation. Its default value is 100000.

To test it further, I increased this parameter by 10x and then reduced it by 10x:

In the first case, I encountered an OOM, which was expected due to the limited 8 GB of RAM. However, after a reboot, the block release process resumed and completed correctly.
In the second case, memory usage dropped to 4 GB, with all cores evenly utilized at around 30-40%. The space reclamation continued slightly slower but without the risk of OOM.

I monitored the behavior of this parameter through watch -d -n 0.3 zpool status -D, where it was clear to see the DDT entries being flushed in multiples of the set parameter value.

robn · 2024-11-05T09:52:44Z

@serjponomarev this is fantastic info. You've confirmed pretty much exactly what I was hoping it would do, in more ways than I thought of. Thanks so much!

PR already posted in #16722.

shodanshok · 2024-11-08T11:20:01Z

Possibly you mean the zfs_unlinked_drain stuff.

@robn Maybe I am missing something, but is zfs_per_txg_dirty_frees_percent related to this issue?

robn · 2024-11-08T11:30:03Z

Maybe I am missing something, but is zfs_per_txg_dirty_frees_percent related to this issue?

@shodanshok I don't believe so. As I understand it, that sets the threshold for how free can go on a single txg, but it doesn't split or anything, just doesn't allow any more once you've gone past it. So if you decide to put a single 2T "free range" on a txg, it goes in and no more will be allowed, but by then its too late.

dsl_free() calls zio_free() to free the block. For most blocks, this simply calls metaslab_free() without doing any IO or putting anything on the IO pipeline. Some blocks however require additional IO to free. This at least includes gang, dedup and cloned blocks. For those, zio_free() will issue a ZIO_TYPE_FREE IO and return. If a huge number of blocks are being freed all at once, it's possible for dsl_dataset_block_kill() to be called millions of time on a single transaction (eg a 2T object of 128K blocks is 16M blocks). If those are all IO-inducing frees, that then becomes 16M FREE IOs placed on the pipeline. At time of writing, a zio_t is 1280 bytes, so for just one 2T object that requires a 20G allocation of resident memory from the zio_cache. If that can't be satisfied by the kernel, an out-of-memory condition is raised. This would be better handled by improving the cases that the dmu_tx_assign() throttle will handle, or by reducing the overheads required by the IO pipeline, or with a better central facility for freeing blocks. For now, we simply check for the cases that would cause zio_free() to create a FREE IO, and instead put the block on the pool's freelist. This is the same place that blocks from destroyed datasets go, and the async destroy machinery will automatically see them and trickle them out as normal. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#6783 Closes openzfs#16708 Closes openzfs#16722 Closes openzfs#16697

serjponomarev added the Type: Feature Feature request or new feature label Oct 30, 2024

amotin mentioned this issue Oct 31, 2024

block clone and bulk deletions with regards to files generated by Veeam. #16680

Open

serjponomarev mentioned this issue Nov 4, 2024

File truncate operation does not work with dedup on and fast dedup enabled #16718

Closed

robn mentioned this issue Nov 5, 2024

dsl_dataset: put IO-inducing frees on the pool deadlist #16722

Merged

13 tasks

jtblck90 mentioned this issue Nov 6, 2024

OOM after files remove with dedup on and fast dedup enabled #16697

Closed

behlendorf closed this as completed in #16722 Nov 13, 2024

behlendorf closed this as completed in 46c4f2c Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimized Large File Deletion to Prevent OOM #16708

Optimized Large File Deletion to Prevent OOM #16708

serjponomarev commented Oct 30, 2024

robn commented Oct 30, 2024 •

edited

Loading

gmelikov commented Oct 30, 2024

serjponomarev commented Oct 30, 2024

serjponomarev commented Oct 30, 2024

robn commented Oct 30, 2024

amotin commented Oct 31, 2024

serjponomarev commented Oct 31, 2024 •

edited

Loading

robn commented Oct 31, 2024 •

edited

Loading

amotin commented Oct 31, 2024 •

edited

Loading

robn commented Oct 31, 2024 •

edited

Loading

serjponomarev commented Oct 31, 2024

robn commented Nov 5, 2024

serjponomarev commented Nov 5, 2024

robn commented Nov 5, 2024

shodanshok commented Nov 8, 2024

robn commented Nov 8, 2024

Optimized Large File Deletion to Prevent OOM #16708

Optimized Large File Deletion to Prevent OOM #16708

Comments

serjponomarev commented Oct 30, 2024

Describe the feature you would like to see added to OpenZFS

How will this feature improve OpenZFS?

Additional context

Experimental Evidence

robn commented Oct 30, 2024 • edited Loading

gmelikov commented Oct 30, 2024

serjponomarev commented Oct 30, 2024

serjponomarev commented Oct 30, 2024

robn commented Oct 30, 2024

amotin commented Oct 31, 2024

serjponomarev commented Oct 31, 2024 • edited Loading

robn commented Oct 31, 2024 • edited Loading

amotin commented Oct 31, 2024 • edited Loading

robn commented Oct 31, 2024 • edited Loading

serjponomarev commented Oct 31, 2024

robn commented Nov 5, 2024

serjponomarev commented Nov 5, 2024

robn commented Nov 5, 2024

shodanshok commented Nov 8, 2024

robn commented Nov 8, 2024

robn commented Oct 30, 2024 •

edited

Loading

serjponomarev commented Oct 31, 2024 •

edited

Loading

robn commented Oct 31, 2024 •

edited

Loading

amotin commented Oct 31, 2024 •

edited

Loading

robn commented Oct 31, 2024 •

edited

Loading