Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized Large File Deletion to Prevent OOM #16708

Closed
serjponomarev opened this issue Oct 30, 2024 · 16 comments · Fixed by #16722
Closed

Optimized Large File Deletion to Prevent OOM #16708

serjponomarev opened this issue Oct 30, 2024 · 16 comments · Fixed by #16722
Labels
Type: Feature Feature request or new feature

Comments

@serjponomarev
Copy link

Describe the feature you would like to see added to OpenZFS

I propose adding an iterative approach for deleting large files in ZFS pools with deduplication enabled. Instead of calling unlink to remove the entire file at once, we can implement a mechanism that reduces the file size from the end, freeing blocks incrementally.

How will this feature improve OpenZFS?

This feature addresses the issue of Out-Of-Memory (OOM) errors that occur when deleting large files. Currently, when unlink is called, ZFS loads all entries from the Deduplication Data Table (DDT) related to the file into memory, which can lead to memory overload, especially on systems with limited RAM. By implementing an iterative file reduction process, we can significantly reduce memory consumption and improve stability.

Additional context

The proposed algorithm includes the following steps:

  1. Iterative File Truncation: Implement internal logic to incrementally truncate the file from the end, allowing ZFS to load only the necessary metadata associated with the current data size, thus minimizing memory usage.
  2. Final unlink Call: Once the file is completely truncated, perform a final unlink to remove any remaining metadata.

Benefits:

  • Reduces the risk of OOM errors on systems with limited memory.
  • Easy integration into the existing ZFS architecture without needing changes to system calls.
  • Enhances overall system performance by managing memory more effectively.

Experimental Evidence

The following experiment demonstrates the basis for this proposed improvement:

Environment:

  • Virtual machine with 8 vCPUs, 8 GB of RAM, and 3 NVMe drives configured into a 1.5 TB ZFS pool with deduplication enabled and recordsize=16K.
  • Debian 12.7 with ZFS version 2.1.11 from the Debian repository.

Procedure:

  1. Populate the pool with a file containing random data to fully utilize the DDT:

    fio --name=test --numjobs=1 --iodepth=8 --bs=1M --rw=write --ioengine=libaio --fallocate=0 --filename=/zpool/test.io --filesize=1T
  2. Attempt to delete the file using rm /zpool/test.io, resulting in an OOM event.

  3. Reboot and delete the file iteratively, reducing its size by 1 GB in each iteration before final deletion:

    filename=/zpool/test.io
    for i in $(seq $(du -BG $filename | cut -f1 | tr -d 'G') -1 0); do
        truncate -s "$i"G $filename
        echo truncated to $i G
    done
    rm -v $filename
    

Observation:
Memory consumption can be monitored with watch arc_summary throughout the process.

@serjponomarev serjponomarev added the Type: Feature Feature request or new feature label Oct 30, 2024
@robn
Copy link
Member

robn commented Oct 30, 2024

I assume you're talking about (at least): #6783 #16037 #16697.

If so, the problem isn't dedup as such, but a side effect of how the the free pipeline is modified for some kinds of blocks, including dedup blocks, but not only dedup blocks (see #16037 for a non-dedup example).

This specific method can't be done, as unlink() has to appear atomic to at the filesystem - it's all or nothing. That said, the technique of pacing the frees rather than dumping them all at once I suspect is at least part of the solution under the hood, but there's several complications and I haven't thought it all through yet.

@gmelikov
Copy link
Member

Maybe it's a little bit off-topic, but zfs frees blocks, not files (ddt is per-block too), so you can truncate part of your file (plus iterate over whole file), and only unused blocks would be freed. Maybe it's a workaround, yes.

Hope I didn't miss something.

@serjponomarev
Copy link
Author

Maybe it's a little bit off-topic, but zfs frees blocks, not files (ddt is per-block too), so you can truncate part of your file (plus iterate over whole file), and only unused blocks would be freed. Maybe it's a workaround, yes.

Hope I didn't miss something.

Yes, you’re absolutely correct.

My approach to finding a solution to this issue went as follows:

  1. I tried various combinations of ZFS module parameters, but this didn’t resolve the problem.
  2. I examined ZFS’s data structures at the block level and confirmed what you described — ZFS indeed frees blocks, not entire files.
  3. To test further, I divided a 1 TB file into 1024 files of 1 GB each, then deleted them sequentially while monitoring memory usage with arc_summary. In this case, there was no excessive memory consumption.
  4. I concluded that what I needed was a way to delete a large 1 TB file as if it were 1024 separate 1 GB files.
  5. I realized that truncate might help achieve this, tested it, and it worked, providing the same memory-efficient behavior as deleting 1024 smaller files.

That’s why I decided to share this approach with the community — to discuss possible ways to implement such a mechanism within the ZFS codebase.

@serjponomarev
Copy link
Author

I assume you're talking about (at least): #6783 #16037 #16697.

If so, the problem isn't dedup as such, but a side effect of how the the free pipeline is modified for some kinds of blocks, including dedup blocks, but not only dedup blocks (see #16037 for a non-dedup example).

This specific method can't be done, as unlink() has to appear atomic to at the filesystem - it's all or nothing. That said, the technique of pacing the frees rather than dumping them all at once I suspect is at least part of the solution under the hood, but there's several complications and I haven't thought it all through yet.

In searching for a solution to this issue, I reviewed all the issues you referenced. I understand that the problem isn’t specifically limited to deduplication; it’s broader in scope. However, in the case of deduplication, this problem is 100% reproducible and testable.

That’s why I chose a more general title for this issue.

@robn
Copy link
Member

robn commented Oct 30, 2024

Yep, and you can do tricks with ftruncate in userspace, because you understand what the shrinking file means. It's not suitable as an alternate implementation of unlink() though, which by definition has to appear to make the file disappear entirely.

It also wouldn't solve the problem properly anyway, because the real problem is in the sheer volume of blocks we're trying to destroy in one go, not that they're from the same file. If you had destroyed your 1024 1GB files on the same transaction (sometimes tricky to arrange), it would have blown up in the same way. Similarly if you had done it with 1M 1MB files.

It's not even theoretically limited to filesystems; any object could do it. I'd be curious to know if one created a 1T zvol, filled it with random data, and then zeroed it in one go (maybe with blkdiscard), would it do the same thing? If it didn't, I expect it would be more to with the locking differences in zvols compare to filesystems, not the underlying block structure.

So yeah, if controlling this way from userspace with ftruncate is something you can do, then you have a good workaround, but that's all.

@amotin
Copy link
Member

amotin commented Oct 31, 2024

I haven't looked there lately and may misremember, but IIRC we've had a mechanisms to throttle deletes to split them between transaction groups. I am not sure it may help single huge file, but for many smaller ones it would be the proper solution.

@serjponomarev
Copy link
Author

serjponomarev commented Oct 31, 2024

Yep, and you can do tricks with ftruncate in userspace, because you understand what the shrinking file means. It's not suitable as an alternate implementation of unlink() though, which by definition has to appear to make the file disappear entirely.

It also wouldn't solve the problem properly anyway, because the real problem is in the sheer volume of blocks we're trying to destroy in one go, not that they're from the same file. If you had destroyed your 1024 1GB files on the same transaction (sometimes tricky to arrange), it would have blown up in the same way. Similarly if you had done it with 1M 1MB files.

It's not even theoretically limited to filesystems; any object could do it. I'd be curious to know if one created a 1T zvol, filled it with random data, and then zeroed it in one go (maybe with blkdiscard), would it do the same thing? If it didn't, I expect it would be more to with the locking differences in zvols compare to filesystems, not the underlying block structure.

So yeah, if controlling this way from userspace with ftruncate is something you can do, then you have a good workaround, but that's all.

@robn
Yes, I understand the aspects you’ve mentioned.
My intention was simply to contribute to solving a broader issue, as I can indeed address my specific problem from userspace.

I currently have access to the same host described in my experiment, but with smaller NVMe drives. The maximum size for the ZFS pool I can create is approximately 1.09 TB, which would allow me to create a zvol of around 800-900 GB, assuming the pool is filled to 80-90%.

I would be happy to assist in gathering information to tackle this broader problem. Please provide the parameters for the zvol experiment, including the zvol size and block size. I will fill it with random data and then perform a blkdiscard.

Also, please clarify what specific data you are looking to obtain from this experiment. If I understand correctly, you aim to test the hypothesis regarding the sequential discarding of blocks and its impact on memory behavior.

The blkdiscard operation should mimic the behavior observed in my truncate experiment, but at the block level rather than the file level, correct?

@robn
Copy link
Member

robn commented Oct 31, 2024

Possibly you mean the zfs_unlinked_drain stuff. Not sure; I don't fully understand it myself. Maybe something further down though, there is a lot of back and forth in the file delete path. Whatever is there didn't save #16037 though, which claims "lots of small files" so.

For big objects though, it just ends up adding the entire object length to dn_free_ranges, and then dnode_sync -> dnode_sync_free_range and beyond just blasts out a mass of frees. ("beyond" is a long way, I have notes which I'll write up before long).

Anyway, I think I have a plan now: repurpose async_destroy. I'm working on a simple prototype and test case now, hopefully something to in a day or two.

@amotin
Copy link
Member

amotin commented Oct 31, 2024

@robn I am not sure what exactly I mean, but you may see that dmu_tx_count_free() accounts not only blocks that will be modified in process of deletion, but also in a face of txh_memory_tohold how much memory will it require to hold the indirects. But it obviously does not account DDT, BRT, ZIO and other stuff. But I have feeling there was something else, just don't remember what.

@robn
Copy link
Member

robn commented Oct 31, 2024

Ahh yeah, that might be it. And I understand why its not working here.

In zio_free_sync, any free that will create IO (gang, dedup, maybe BRT) will zio_create and put it on the pipeline. A zio_t is 1280 bytes. So if you delete a 2T file of 128K dedup blocks, that'll create 16M zio_t, so ~20G just off the zio_cache slab. (This is exactly the scenario in #16697). And of course nothing in the DMU is able to anticipate that.

I'm currently looking at async_destroy as a way of reusing an existing facility (with a nice side effect of background deletes in general, so very fast unlink() calls).

In the longer term, the whole zio pipeline needs a lot of work. Reducing zio_t size at least, maybe frees shouldn't really be done there (since they're not really IO), but also maybe stuff about generally not allocating space until we need it. I had a similar issue in a customer job a few weeks ago where I loaded up a ton of read IOs on the queue, and OOMed the system because all the ABDs needed to be allocated up front, even though they weren't needed until the IO got to vdev_io_start. There's loads to be done, but I definitely didn't want to just start down this road for this mass-free issue, because it needs real thought and input from more people than just me.

@serjponomarev
Copy link
Author

@robn
I did some experiments with zvol.

  pool: zpool
 state: ONLINE
config:

        NAME        STATE     READ WRITE CKSUM
        zpool       ONLINE       0     0     0
          nvme0n1   ONLINE       0     0     0
          nvme1n1   ONLINE       0     0     0
          nvme2n1   ONLINE       0     0     0

errors: No known data errors

zfs create -s -b 16K -V 900G zpool/zvol

NAME         USED  AVAIL     REFER  MOUNTPOINT
zpool        383M  1.06T       96K  /zpool
zpool/zvol    56K  1.06T       56K  -

Filling:
fio --name=test --numjobs=1 --iodepth=8 --bs=1M --rw=write --ioengine=libaio --direct=1 --group_reporting=1 --filename=/dev/zd0

blkdiscard by default, without specifying a step, discards all data.

Without deduplication:

  1. blkdiscard -v /dev/zd0 - works, memory consumption is almost unchanged.

With deduplication:

  1. blkdiscard -v /dev/zd0 - OOM
  2. blkdiscard -v --step 1G /dev/zd0 - OOM
  3. blkdiscard -v --step 1M /dev/zd0 - works

@robn
Copy link
Member

robn commented Nov 5, 2024

@serjponomarev thanks for all the info. The zvol/blkdiscard test supported the theory. I've been able to reproduce in the lab, and I have a patch which should help. I'm still completing testing but I should be able to post a PR later today.

If you're able, could you please rerun your test with this patch? Thanks!
robn@52beaf5

@serjponomarev
Copy link
Author

@serjponomarev thanks for all the info. The zvol/blkdiscard test supported the theory. I've been able to reproduce in the lab, and I have a patch which should help. I'm still completing testing but I should be able to post a PR later today.

If you're able, could you please rerun your test with this patch? Thanks! robn@52beaf5

@robn
I’ve tested everything, and it’s working great!

Environment:

  • Virtual machine with 8 vCPUs, 8 GB of RAM, 3 NVMe drives configured into a 1 TB ZFS pool, with recordsize=16K.
  • Debian 12.7, kernel 6.1.0-26, ZFS 2.3-rc2 with your patch.

For testing, all data written was purely random to maximize the deduplication table.

I tested both native deduplication and fast-deduplication methods, performing deletions of a 900 GB file and discarding a 900 GB zvol. This resulted in four test cases in total.
In all cases, I observed asynchronous space reclamation behavior similar to what occurs with snapshot deletion. The file deletion operation completed quickly, while discarding was slower but still significantly faster than before applying this patch. Both approaches completed without OOM (Out of Memory) issues.

Additionally, the zfs_max_async_dedup_frees parameter now works correctly.
This parameter controls the number of DDT entries freed and the speed of asynchronous deduplication space reclamation. Its default value is 100000.

To test it further, I increased this parameter by 10x and then reduced it by 10x:

  • In the first case, I encountered an OOM, which was expected due to the limited 8 GB of RAM. However, after a reboot, the block release process resumed and completed correctly.
  • In the second case, memory usage dropped to 4 GB, with all cores evenly utilized at around 30-40%. The space reclamation continued slightly slower but without the risk of OOM.

I monitored the behavior of this parameter through watch -d -n 0.3 zpool status -D, where it was clear to see the DDT entries being flushed in multiples of the set parameter value.

@robn
Copy link
Member

robn commented Nov 5, 2024

@serjponomarev this is fantastic info. You've confirmed pretty much exactly what I was hoping it would do, in more ways than I thought of. Thanks so much!

PR already posted in #16722.

@shodanshok
Copy link
Contributor

Possibly you mean the zfs_unlinked_drain stuff.

@robn Maybe I am missing something, but is zfs_per_txg_dirty_frees_percent related to this issue?

@robn
Copy link
Member

robn commented Nov 8, 2024

Maybe I am missing something, but is zfs_per_txg_dirty_frees_percent related to this issue?

@shodanshok I don't believe so. As I understand it, that sets the threshold for how free can go on a single txg, but it doesn't split or anything, just doesn't allow any more once you've gone past it. So if you decide to put a single 2T "free range" on a txg, it goes in and no more will be allowed, but by then its too late.

behlendorf pushed a commit to behlendorf/zfs that referenced this issue Nov 15, 2024
dsl_free() calls zio_free() to free the block. For most blocks, this
simply calls metaslab_free() without doing any IO or putting anything on
the IO pipeline.

Some blocks however require additional IO to free. This at least
includes gang, dedup and cloned blocks. For those, zio_free() will issue
a ZIO_TYPE_FREE IO and return.

If a huge number of blocks are being freed all at once, it's possible
for dsl_dataset_block_kill() to be called millions of time on a single
transaction (eg a 2T object of 128K blocks is 16M blocks). If those are
all IO-inducing frees, that then becomes 16M FREE IOs placed on the
pipeline. At time of writing, a zio_t is 1280 bytes, so for just one 2T
object that requires a 20G allocation of resident memory from the
zio_cache. If that can't be satisfied by the kernel, an out-of-memory
condition is raised.

This would be better handled by improving the cases that the
dmu_tx_assign() throttle will handle, or by reducing the overheads
required by the IO pipeline, or with a better central facility for
freeing blocks.

For now, we simply check for the cases that would cause zio_free() to
create a FREE IO, and instead put the block on the pool's freelist. This
is the same place that blocks from destroyed datasets go, and the async
destroy machinery will automatically see them and trickle them out as
normal.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes openzfs#6783
Closes openzfs#16708
Closes openzfs#16722 
Closes openzfs#16697
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Feature request or new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants