Large Deletes & Memory Consumption #6783

BloodBlight · 2017-10-25T02:42:06Z

First time posting to GitHub, be gentle. :)

System information

Type	Version/Name
Distribution Name	Ubuntu
Distribution Version	16.04 LTS / 16.10 on USB
Linux Kernel	Will post soon, current in LTS / 4.8.0-22 on USB
Architecture	Intel Xeon
ZFS Version	Will post soon, current in LTS / zfsutils-linux 0.6.5.8 on USB
SPL Version	Will post soon, current in LTS / ?
RAM:	96GBs of ECC RAM

Other Config Information

2 Mirrored VDEVs for a total of 10TBs usable.
L2ARC, metadata only
Dedup and compression are enabled. (Ya I know… But I have 96GBs, that should be plenty).
About 11TBs of data and about 2TBs free (after dedup and compression).
Just a handful of apps and users. No major load on the system (it just hosts files).
Exact details to follow once I can get back in.
Will post these as soon as I can:
modinfo zfs | grep -iw version
modinfo spl | grep -iw version

Trigger

Delete a large file 1TB+

Issue

System will slowly consume all memory over the course of several hours (about 12) and hard lock. This happens both after the delete and while importing zpool on reboot.

I have had this happen before, I added a 32GB swap file (on SSD) and that seemed to help. It eventually cleared up after several attempts to reboot (took about two weeks, 12 hours a pop). I made the assumption that the delete was working, but something was causing the memory to not be released. So eventually...

This time I booted off of a live boot USB, added zfs-utils and I was surprised that not only did it attempt to mount the zpool right away (while in apt), but after about an hour it succeeded!

I thought “Cool, it cleared!” and rebooted. No go, 12 hours later, out of memory and locked (still at the boot screen with an out of memory error).

Alright, booted back into the USB stick, again, hung for about an hour, then booted! “Alright, that’s odd.”

At this point I noticed that the mount point for the tank was already taken and I could not access the volume. So, I exported the zpool, took a bit and completed. Moved the folder and re-mount. Watched the memory slowly climb and lock after 12 hours.

I move the USB boot to another system and removed ZFS. I now have the box booted again, re-blocked the mount point and have just re-installed ZFS. I am waiting for the mount to complete. I am hopping it will complete in an hour or so.

FYI, I will be on vacation for several days and unable to access the server after tomorrow.

What else should I grab as I am limited in what I can get right now.? Is this a known issue? Should I go to a newer build?

I have looked at several other open and closed issues including:
#3725
#5706
#5449
#3976
#5923

MyPod-zz · 2017-10-25T09:51:05Z

Deleting that much data in a single go while having dedup enabled is known to cause issues such as this and to be quite memory and time hungry. AFAIK this is not an issue as there's plenty of warnings in all the documentation regarding deduplication, and I also recall a few instances in which pools had to be moved on a platform with a greater amount of RAM to successfully finish a zfs destroy.

I'd suggest to disable the L2ARC at least temporarily in an attempt to drop the need to store the metadata for it in the RAM, and possibly to bump up the amount of it in the system.

BloodBlight · 2017-10-25T17:26:46Z

Some updates.

There is no L2ARC, my mistake. It used to have one but it was removed.

Blocking the mount point did indeed allow the import to complete in about one hour with minimal memory usage (about 9GBs). There does not appear to be any pending IO after several hours. It is accepting zdb commands and zpool modification commands.

At this point I am not sure what the deal is, this is really odd!

As for memory consumption and dedup, yep, I am aware. However, I have 81.2M allocated blocks so the entire DDT should be under 29GBs. Even if you use the referenced block count (102M) and 512 bytes rather than 380 that is still under 49GBs. So with 98GBs (right now 88GBs free) that should not be an issue.

I am more confused by the change in behavior based on the mount state. I am 90%+ confident that is I export, move the folder, and re-import that it will tank the box…

In #3725 this was discussed and the issue was supposedly patched. He was using 0.6.4, and after more digging, it sounds like this ( #5706 ) was fixed in zfs-0.7.0-rc4 and may be a starvation issue causing memory to not be freed.

I think I am going to have to block the mounting of the drive and try upgrading… But I am still really confused by the change in behavior due to the mount point location being available...

BloodBlight · 2017-10-25T21:10:16Z

Alright, it looks like the pending transaction is process on mount and not import. So that explains that...

Looking at upgrade from source process. :/

BloodBlight · 2017-10-26T05:12:37Z

Installed 0.7.2.3, started the mount several hours ago. Looks like we are on our way to a lockup, but I will get one last look at it tomorrow morning.

Here is data from my current slabtop:

OBJS	ACTIVE	USE OBJ	SIZE	SLABS	OBJ/SLAB	CACHE SIZE	NAME
12537657	8479926	67.00%	1.09K	432333	29	13,834,656 KB	zio_cache
3165056	3165053	99.00%	4.00K	395632	8	12,660,224 KB	kmalloc-4096
5293908	5293908	100.00%	0.44K	147053	36	2,352,848 KB	xfrm_dst_cache
3228096	3228096	100.00%	0.32K	134504	24	1,076,032 KB	taskstats
3119283	3119283	100.00%	0.29K	115529	27	924,232 KB	dmu_buf_impl_t
2894976	2894970	99.00%	0.25K	90468	32	723,744 KB	kmalloc-256
12547190	8490904	67.00%	0.05K	147614	85	590,456 KB	ftrace_event_field

And from /proc/spl/kstat/zfs/arcstats:

name	type	data
hits	4	23699472
misses	4	3308407
demand_data_hits	4	0
demand_data_misses	4	0
demand_metadata_hits	4	23699472
demand_metadata_misses	4	3308398
prefetch_data_hits	4	0
prefetch_data_misses	4	0
prefetch_metadata_hits	4	0
prefetch_metadata_misses	4	9
mru_hits	4	4862763
mru_ghost_hits	4	0
mfu_hits	4	18836709
mfu_ghost_hits	4	0
deleted	4	65
mutex_miss	4	0
evict_skip	4	30
evict_not_enough	4	0
evict_l2_cached	4	0
evict_l2_eligible	4	730112
evict_l2_ineligible	4	0
evict_l2_skip	4	0
hash_elements	4	3201684
hash_elements_max	4	3201684
hash_collisions	4	323909
hash_chains	4	261002
hash_chain_max	4	4
p	4	25328530432
c	4	50657060864
c_min	4	33554432
c_max	4	50657060864
size	4	18326494616
hdr_size	4	1357514440
data_size	4	0
metadata_size	4	16052901376
other_size	4	916078800
anon_size	4	409600
anon_evictable_data	4	0
anon_evictable_metadata	4	0
mru_size	4	5683290112
mru_evictable_data	4	0
mru_evictable_metadata	4	4941213696
mru_ghost_size	4	0
mru_ghost_evictable_data	4	0
mru_ghost_evictable_metadata	4	0
mfu_size	4	10369201664
mfu_evictable_data	4	0
mfu_evictable_metadata	4	10369193472
mfu_ghost_size	4	0
mfu_ghost_evictable_data	4	0
mfu_ghost_evictable_metadata	4	0
l2_hits	4	0
l2_misses	4	0
l2_feeds	4	0
l2_rw_clash	4	0
l2_read_bytes	4	0
l2_write_bytes	4	0
l2_writes_sent	4	0
l2_writes_done	4	0
l2_writes_error	4	0
l2_writes_lock_retry	4	0
l2_evict_lock_retry	4	0
l2_evict_reading	4	0
l2_evict_l1cached	4	0
l2_free_on_write	4	0
l2_cdata_free_on_write	4	0
l2_abort_lowmem	4	0
l2_cksum_bad	4	0
l2_io_error	4	0
l2_size	4	0
l2_asize	4	0
l2_hdr_size	4	0
l2_compress_successes	4	0
l2_compress_zeros	4	0
l2_compress_failures	4	0
memory_throttle_count	4	0
duplicate_buffers	4	0
duplicate_buffers_size	4	0
duplicate_reads	4	0
memory_direct_count	4	0
memory_indirect_count	4	0
arc_no_grow	4	0
arc_tempreserve	4	0
arc_loaned_bytes	4	0
arc_prune	4	0
arc_meta_used	4	18326494616
arc_meta_limit	4	37992795648
arc_meta_max	4	10889914016
arc_meta_min	4	8589934592
arc_need_free	4	0
arc_sys_free	4	1583030272

BloodBlight · 2017-10-26T15:11:49Z

Dead.

BloodBlight · 2017-11-04T02:58:32Z

No one wants to chime in on this? This has been a known bug in the past. A 1TB delete really should not be a issue with this much memory...

I have been online for about 24 hours on this last run (last one died around 10 hours in). It's still crunching along at the moment.

Memory right where I would expect it. I am using 24GBs of ARC for 81.2M DDT Entries. So that sounds perfect for the math! 81.2M * 320 / 1024^3 = 24.2GBs

arc_summary:

ZFS Subsystem Report			Fri Nov 03 19:44:43 2017
ARC Summary: (HEALTHY)
Memory Throttle Count:			0

ARC Misc:
Deleted:				32
Mutex Misses:				0
Evict Skips:				0

ARC Size:				60.65%	28.61	GiB
Target Size: (Adaptive)			100.00%	47.18	GiB
Min Size (Hard Limit):			0.07%	32.00	MiB
Max Size (High Water):			1509:1	47.18	GiB

ARC Size Breakdown:
Recently Used Cache Size:		50.00%	23.59	GiB
Frequently Used Cache Size:		50.00%	23.59	GiB

ARC Hash Breakdown:
Elements Max:				4.40m
Elements Current:			10.21%	449.47k
Collisions:				745.12k
Chain Max:				5
Chains:					5.55k

ARC Total accesses:			88.52m
Cache Hit Ratio:			93.36%	82.65m
Cache Miss Ratio:			6.64%	5.87m
Actual Hit Ratio:			93.36%	82.65m

Data Demand Efficiency:			100.00%	0

CACHE HITS BY CACHE LIST:
  Most Recently Used:			10.33%	8.54m
  Most Frequently Used:			89.67%	74.11m
  Most Recently Used Ghost:		0.00%	0
  Most Frequently Used Ghost:		0.00%	0

CACHE HITS BY DATA TYPE:
  Demand Data:				0.00%	0
  Prefetch Data:			0.00%	0
  Demand Metadata:			100.00%	82.65m
  Prefetch Metadata:			0.00%	0

CACHE MISSES BY DATA TYPE:
  Demand Data:				0.00%	0
  Prefetch Data:			0.00%	0
  Demand Metadata:			98.31%	5.78m
  Prefetch Metadata:			1.69%	99.21k

MyPod-zz · 2017-11-04T16:03:50Z

The math might not be as simple as 81.2M * 320 / 1024^3 = 24.2GBs. I don't recall whether this applies for the DDT or not, but certain kind of data can take only up to a certain amount of the ARC by default - i.e. metadata by default goes up to 1/4. Your system might be capped like that, in which case you might want to find the related tunables. Sadly I can't offer help wrt dedup as it's something I've never dealt with.

BloodBlight · 2017-11-04T18:42:57Z

Yep, totally understand.

From everything I have read, and what the devs have stated here, ZFS should slow to a crawl, not crash the box (as it needs to constantly go to disk for every meta op). And it’s so common for ZFS to be under build for RAM and then folks ask why it doesn’t work. So I tried to take the max possible usage and then double it (really, it just worked out that way). This is a prototyping box for other builds, so I am less interested in getting it back online as I am on ether specing the system proper, or fixing whatever the issue is. Really, I can get this box back online by just rebooting it a bunch / adding a bunch of RAM (I think).

Last night, shortly after making that post consumption spiked rapidly and tanked the box. So, what I am going to do it install Splunk and see if I can come up with a script to watch all of the key details. I am still not sure exactly what is going nuts at the end.

So I will try and watch the slab, arcstats, iostats, and anything else I can think of. Some of this will take come work on my part due to formating.

Any recommendations on what I should be capturing?

Also, looks like my tunables got chopped in editing on my last post! OOPS!

ZFS Tunable:
metaslab_debug_load	0
zfs_arc_min_prefetch_lifespan	0
zfetch_max_streams	8
zfs_nopwrite_enabled	1
zfetch_min_sec_reap	2
zfs_dbgmsg_enable	0
zfs_dirty_data_max_max_percent	25
zfs_arc_p_aggressive_disable	1
spa_load_verify_data	1
zfs_zevent_cols	80
zfs_dirty_data_max_percent	10
zfs_sync_pass_dont_compress	5
l2arc_write_max	8388608
zfs_vdev_scrub_max_active	2
zfs_vdev_sync_write_min_active	12
zvol_prefetch_bytes	131072
metaslab_aliquot	524288
zfs_no_scrub_prefetch	0
zfs_arc_shrink_shift	0
zfetch_block_cap	256
zfs_txg_history	0
zfs_delay_scale	500000
zfs_vdev_async_write_active_min_dirty_percent	30
metaslab_debug_unload	0
zfs_read_history	0
zvol_max_discard_blocks	16384
zfs_recover	0
l2arc_headroom	2
zfs_deadman_synctime_ms	1000000
zfs_scan_idle	50
zfs_free_min_time_ms	1000
zfs_dirty_data_max	10131410944
zfs_vdev_async_read_min_active	12
zfs_mg_noalloc_threshold	0
zfs_dedup_prefetch	0
zfs_vdev_max_active	1000
l2arc_write_boost	8388608
zfs_resilver_min_time_ms	3000
zfs_vdev_async_write_max_active	12
zil_slog_limit	1048576
zfs_prefetch_disable	1
zfs_resilver_delay	2
metaslab_lba_weighting_enabled	1
zfs_mg_fragmentation_threshold	85
l2arc_feed_again	1
zfs_zevent_console	0
zfs_immediate_write_sz	32768
zfs_dbgmsg_maxsize	4194304
zfs_free_leak_on_eio	0
zfs_deadman_enabled	1
metaslab_bias_enabled	1
zfs_arc_p_dampener_disable	1
zfs_object_mutex_size	64
zfs_metaslab_fragmentation_threshold	70
zfs_no_scrub_io	0
metaslabs_per_vdev	200
zfs_dbuf_state_index	0
zfs_vdev_sync_read_min_active	12
metaslab_fragmentation_factor_enabled	1
zvol_inhibit_dev	0
zfs_vdev_async_write_active_max_dirty_percent	60
zfs_vdev_cache_size	0
zfs_vdev_mirror_switch_us	10000
zfs_dirty_data_sync	67108864
spa_config_path	/etc/zfs/zpool.cache
zfs_dirty_data_max_max	25328527360
zfs_arc_lotsfree_percent	10
zfs_zevent_len_max	384
zfs_scan_min_time_ms	1000
zfs_arc_sys_free	0
zfs_arc_meta_strategy	1
zfs_vdev_cache_bshift	16
zfs_arc_meta_adjust_restarts	4096
zfs_max_recordsize	1048576
zfs_vdev_scrub_min_active	1
zfs_vdev_read_gap_limit	32768
zfs_arc_meta_limit	0
zfs_vdev_sync_write_max_active	12
l2arc_norw	0
zfs_arc_meta_prune	10000
metaslab_preload_enabled	1
l2arc_nocompress	0
zvol_major	230
zfs_vdev_aggregation_limit	131072
zfs_flags	0
spa_asize_inflation	24
zfs_admin_snapshot	0
l2arc_feed_secs	1
zio_taskq_batch_pct	75
zfs_sync_pass_deferred_free	2
zfs_disable_dup_eviction	0
zfs_arc_grow_retry	0
zfs_read_history_hits	0
zfs_vdev_async_write_min_active	12
zfs_vdev_async_read_max_active	12
zfs_scrub_delay	4
zfs_delay_min_dirty_percent	60
zfs_free_max_blocks	10000
zfs_vdev_cache_max	16384
zio_delay_max	30000
zfs_top_maxinflight	32
spa_slop_shift	5
zfs_vdev_write_gap_limit	4096
spa_load_verify_metadata	1
spa_load_verify_maxinflight	10000
l2arc_noprefetch	1
zfs_vdev_scheduler	noop
zfs_expire_snapshot	300
zfs_sync_pass_rewrite	2
zil_replay_disable	0
zfs_nocacheflush	0
zfs_arc_max	0
zfs_arc_min	0
zfs_read_chunk_size	1048576
zfs_txg_timeout	5
zfs_pd_bytes_max	52428800
l2arc_headroom_boost	200
zfs_send_corrupt_data	0
l2arc_feed_min_ms	200
zfs_arc_meta_min	8589934592
zfs_arc_average_blocksize	8192
zfetch_array_rd_sz	1048576
zfs_autoimport_disable	1
zfs_arc_p_min_shift	0
zio_requeue_io_start_cut_in_line	1
zfs_vdev_sync_read_max_active	12
zfs_mdcomp_disable	0
zfs_arc_num_sublists_per_state	8

BloodBlight · 2017-11-04T21:44:17Z

Alright, capturing data every 60 seconds, lets see...

BloodBlight · 2017-11-05T18:53:57Z

Crashed just after 3:04 AM:

All numbers are in GBs.

Not sure that this really exposes anything though... :(

dweeezil · 2017-11-06T02:02:24Z

One of the clues here is the 12537657 objects in zio_cache. A 1TiB file with a 128KiB recordsize will have 8.3M blocks, each of which need to be freed. When dedup is enabled, ZFS launches a furious storm of ZIO free operations to release the space. However, when a blkptr has been deduped, a ZAP lookup must first be performed. It's a common misconception that simply having "enough RAM" to hold "the" dedup table will allow some form of reasonable performance. Unfortunately, a ZAP lookup is a very expensive process which involves lots of locking, etc. and when there are many to be performed simultaneously (one needs to be done for each blkptr freed), there's a crazy amount of lock contention. Of course, all those zios need to be stored somewhere and that's in the zio_cache. I ran a few rudimentary tests on a system with 40 cores and 80 threads and deleting a 100GiB file frequently caused spikes of 80K+ entries in zio_cache but that system had a whole lot of horsepower to perform those ZAP lookups.

I suggest that anyone contemplating dedup view Matt's talk at https://www.youtube.com/watch?v=PYxFDBgxFS8 (slides at http://open-zfs.org/w/images/8/8d/ZFS_dedup.pdf). One of the key points is that "on-disk hash tables suck". The ZAPs in which are the structures in which they're stored are a glorified hash table but much more complex.

Here's a flame graph that demonstrates how much time is spent spinning for locking when deleting a meager 100GiB (approx) deduped file: https://gist.github.com/dweeezil/d663bb371cb927a4f3ccc4d124effbd3

I'd also like to reference #6823 and also any other issue which involves deleting large files with dedup enabled.

Without turning this into a full-fledged wiki page, here's a few suggestions to anyone wanting to deploy dedup: First off, giant files can be a real problem. If your application involves huge files consider whether dedup will buy anything at all. Large blocksizes can help because they'll reduce the numbers of entries in the dedup ZAPs for large files. Dedup can be very useful in certain circumstances but it requires a great deal of understanding to determine what those actually are. Finally (almost), better dedup will probably happen (see Matt's talk).

It does seem that something could be done to help the delete situation with the current dedup situation. A new delete throttle to prevent too many zios from piling up seems like it would help quite a bit.

BloodBlight · 2017-11-11T23:17:05Z

Sorry for the long delay.. :( Thanks for the info, I have tried to educate myself as best as possible and know of the extreme memory requirements to keep things moving even slowly, and that not having enough can drag your system to a crawl. Been there, done that, 50KBps of IO to an 9 spindle VDEV (was intentionally trying to break it).

Just to be clear, I am not seeing extremely heavy load (everything is golden until the crash). This is hard locking the box, as in there is no disk activity (triple checking that now), you cannot ssh into it, the kernel goes though and kills all threads (effectively panicing the box).

It is my understanding this is not expected behavior.

BloodBlight · 2017-11-12T20:49:58Z

Confirmed, zero disk activity.

BloodBlight · 2017-11-12T20:54:23Z

Looks like I see a small amount of write IO just before the lock:

root@MediaServer:~# cat zpool.stat
[2017-11-11 20:44:13] capacity operations bandwidth
[2017-11-11 20:44:13] pool alloc free read write read write
[2017-11-11 20:44:13] ---------- ----- ----- ----- ----- ----- -----
[2017-11-11 20:44:13] MediaTank 7.71T 1.37T 218 70 1.07M 157K
[2017-11-11 20:45:13] MediaTank 7.71T 1.37T 62 0 133K 0
[2017-11-11 20:46:13] MediaTank 7.71T 1.37T 61 0 133K 0
[2017-11-11 20:47:13] MediaTank 7.71T 1.37T 62 0 133K 0
[2017-11-11 20:48:13] MediaTank 7.71T 1.37T 62 0 134K 0
[2017-11-11 20:49:13] MediaTank 7.71T 1.37T 62 0 132K 0
[2017-11-11 20:50:13] MediaTank 7.71T 1.37T 61 0 132K 0
--- all zeros ---
[2017-11-12 04:18:13] MediaTank 7.71T 1.37T 61 0 130K 0
[2017-11-12 04:19:13] MediaTank 7.71T 1.37T 61 0 131K 0
[2017-11-12 04:20:13] MediaTank 7.71T 1.37T 62 0 133K 0
[2017-11-12 04:21:13] MediaTank 7.71T 1.37T 61 0 132K 0
[2017-11-12 04:22:13] MediaTank 7.71T 1.37T 62 0 132K 0
[2017-11-12 04:23:13] MediaTank 7.71T 1.37T 62 0 133K 0
[2017-11-12 04:24:13] MediaTank 7.71T 1.37T 62 0 133K 0
[2017-11-12 04:25:13] MediaTank 7.71T 1.37T 61 0 131K 0
[2017-11-12 04:26:13] MediaTank 7.71T 1.37T 16 27.4K 36.6K 38.0M

BloodBlight · 2017-11-13T02:27:22Z

Makes sense, but I was under the I'm under the impression that @behlendorf added a fix in #5449 and #5706 in 7.0rc4 for this scenario.

if (zfs_per_txg_dirty_frees_percent <= 100)

But that is just my interpretation, I may be reading it wrong. :(

BloodBlight · 2017-11-19T01:42:53Z

Alright, some interesting developments…

I was ready to give up and just move on. Before I could get some additional memory for the server I wanted to get some files off (as it would be a while).

To do so, I blocked the mount, imported the pool, and marked it as read only.

I then unblocked the mount and run the mount command. Interestingly it started doing HUGE amounts of writes! I am not sure exactly how much, but over the course of an hour I would guess somewhere around a GB worth of writes… After an hour, the file system mounted…

Alright…. Wasn’t what I would have expected…

So, I said WTH, set the pool to R/W and waited. Zero IO… Hmmm… K… So I rebooted the box to clear any cache.

I have auto import disabled, so once the system was back, I ran the import command. It took a few minutes, but the system mounted! But… it started doing the IO churn again in the background; however, this time space is slowly being freed and I am hovering around 30% memory usage after several hours with about 600GBs of the 1TB delete freed.

There is SOMETHING up with the delete logic...

BloodBlight · 2017-11-19T01:51:45Z

Maybe it cleared because I upgraded but was in a bugged state until then??? IDK.

nwf · 2017-11-19T19:49:06Z

Once upon a time, I crudely forced the system's hand by forcing a txg commit periodically during large deletes: nwf@66dbeba ; the official fix 194202e uses a percentage based threshold rather than just counting, but amounts to something similar. Clamping down on the number of frees in a txg for dedup'd data means decreasing the number of DDT lookups and mutations that must be done in sync phase, which helped me survive exactly the situation described here.

I'd have been curious to know how things would have evolved if you'd set zfs_per_txg_dirty_frees_percent to "absurdly small" (e.g., 1) and clamped down on zfs_dirty_data_max (down from its current 10G).

(It'd be better, of course, to have a DDT-aware estimate of the amount of work to be done per free operation, which I don't believe is present.)

BloodBlight · 2017-11-20T00:15:05Z

I see where to set zfs_per_txg_dirty_frees_percent, but not zfs_dirty_data_max, I will keep looking and re-compile if/when I find it. I will start with just zfs_per_txg_dirty_frees_percent for now and let you know (compiling now).

More interesting notes.

It eventually hard locked last night.

If I block it and set it to mark it RO, then mount (now taking just a few seconds) and then mark R/W, everything works perfectly! I wrote a small file to the volume, exported the pool, and re-imported using this method. The file is there…

If I import the volume WITHOUT blocking the mount while the volume is in R/W mode, it goes right into the “loop”, and eventually locks the box. But is IS mounting now (before it was blocked).

If I put the volume into RO before exporting, it also re-imports immediately without issue.

If I attempt to put the volume into RO mode while it is in this state, the command hangs and never completes (appears to at least, not going to wait until it locks).

nwf · 2017-11-20T00:16:58Z

You shouldn't need to recompile; these are tunable at module insertion time (and perhaps more generally during system execution).

BloodBlight · 2017-11-20T00:18:57Z

OH!

BloodBlight · 2017-11-20T00:24:23Z

Changed, rebooting, will watch...

/etc/modprobe.d/zfs.conf

#options zfs zfs_arc_meta_max=51539607552
options zfs zfs_arc_meta_min=8589934592
#options zfs arc_meta_max=51539607554
options zfs arc_meta_min=8589934594

#debug
options zfs zfs_arc_meta_max=8589934592
options zfs arc_meta_max=8589934592
options zfs zfs_prefetch_disable=1
options zfs zfs_dedup_prefetch=0
options zfs zfs_free_max_blocks=100
options spl spl_kmem_cache_slab_limit=16384
options zfs zfs_per_txg_dirty_frees_percent=1
options zfs zfs_dirty_data_max=1073741824
options zfs zfs_vdev_async_read_max_active=2
options zfs zfs_vdev_async_read_min_active=1
options zfs zfs_vdev_async_write_max_active=2
options zfs zfs_vdev_async_write_min_active=1
options zfs zfs_vdev_sync_read_max_active=2
options zfs zfs_vdev_sync_read_min_active=1
options zfs zfs_vdev_sync_write_max_active=2
options zfs zfs_vdev_sync_write_min_active=1
options zfs zfs_dirty_data_max_percent=10

BloodBlight · 2017-11-20T03:23:42Z

We are back to the not mounting behavior. :/

So, I let it run for a while (maybe an hour) and zpool iostat reported zero writes.

Should I let it continue? Anything I can be watching?

nwf · 2017-11-20T04:28:14Z

When I tripped over this problem myself, I saw lots of random reads (for bits of the DDT) saturating the disk's IOPS. Writes will be "bursty" as the iteration through the DDT happens (in sync phase, I think) and are then all flushed to disk at once. The goal of tamping down on dirty_frees_percent is to let each of these syncs have smaller working sets.

It's entirely possible, on a machine with gobs of RAM, that even 1% is too much, but I don't know. For my hacked patch I think I let through very few deletes per TXG, as proof of concept.

BloodBlight · 2017-11-21T03:02:53Z

I let it run, and I got zero writes until just before it hard locked and then had a sudden burst of writes.

In-fact, once it goes into this state, I haven't found any combination that generates any writes until the very last second.

nwf · 2017-11-21T04:08:05Z

That matches my experience: the deletes translate into changes to the DDT, which, being a hash table, is accessed randomly, resulting in a very seek-y, read-only workload while changes accumulate in RAM and then are all flushed out at once when the transaction commits. This will happen repeatedly for every transaction group that commits. You should be able to see this effect, too, by tracking the anon_size in /proc/spl/kstat/zfs/arcstats (I think).

When you say "hard locked and then", I am confused; do you just mean that it experienced a (brief?) pause (of I/O?) before beginning a large series of writes? Usually one uses "locked" in this context to mean a more permanent state of affairs, a la dead- or live-locked.

If the stalls are still too much for you, further tamping down on the amount of dirty data permitted per transaction may be useful, as might changing the denominator of dirty_frees_percent from 100 to 1000 or larger, to really limit the number of DDT mutations that can arise from deletions. I have found that, even ignoring the DDT issues, that ZoL (at least, as of the 0.6.5 series) is wildly optimistic about how many IOPS my disks can actually achieve, and so tends to be a little stop-and-go when things start to saturate.

BloodBlight · 2017-12-03T23:16:06Z

Again, apologizes for the extended delay. It's been a hard month...

Here is what happens as best as I can tell (when importing in R/W mode):

Prolonged reads as it figures out what it needs to do (12-36 hours with zero writes), the file system is not available during this time.
The system runs out of memory and starts killing processes.
At this point the system becomes non-responsive, but some tasks are still working (such as Splunk).
Part of ZFS gets killed (*assumption) and now another part is free to write and I see a short burst of writes.
The system completely stops responding, all disks go silent, all threads stop.

From stage 2 to stage 5 is less than 5 minutes, probably closer to 60 seconds.

Some more information on what I am seeing right now.

If I bring the pool in as R/O and switch it to R/W, I can perform deletes. These deletes trigger the normal delete process without issue. I have observed, with the changes made so far, that it will consume upwards of 20GBs of RAM purging a small batched of “normal” sized files. This memory is returned abruptly once completed.

If I stop and let everything quiet down, set the pool to R/O, export and re-import, I am keeping freed space. So, deletes are working!

So I am guessing there is a "stuck" transaction group that wants to roll out when I import in R/W mode. Is there any way to beak up this pending commit into smaller chucks, or should it be doing this already? Or, perhaps trigger it manually after bringing it online? Does the logic to process these commits differ from startup to normal run time?

I am confused on zfs_per_txg_dirty_frees_percent, should I be setting it low or high? Right now I set it very low to:

options zfs zfs_per_txg_dirty_frees_percent=1
options zfs zfs_dirty_data_max=1073741824 #1GB

gmelikov · 2017-12-04T05:25:55Z

zfs_per_txg_dirty_frees_percent - A value of zero will disable this throttle. 1% is the worst thing if you have starvation on delete, try 0.

Larger zfs_dirty_data_max would give benefits too (by default it's 10 percent of all memory, capped at \fBzfs_dirty_data_max_max\fR).

nwf · 2017-12-04T06:00:03Z

@gmelikov The desire to minimize dirty_frees_percent, and dirty data in general, is an attempt to force a DDT-modification-heavy workload across as many txgs as possible, so that the amount of DDT paged back in, and dirtied, per txg is small. The DDT mutations are not properly accounted during frees, and so will act as a large amplification factor between ZFS's estimate of disk traffic and the actual traffic. Right now, the system is running out of memory attempting to perform a transaction, and I think the culprit is all the DDT mutations; it has been, at least, in my experience with such things. In light of that aim, do you still think that maximizing dirty_data_max and frees_percent is the correct action?

@BloodBlight It sounds like import might be attempting to replay the ZIL all at once; maybe dirty_frees_percent has no effect on ZIL replay, which may be why you OOM on import. I don't know if it's possible to force ZIL replay across several txgs.

BloodBlight · 2017-12-04T21:00:40Z

I don't have a ZIL, but I still assume there is a replay log of some sort happening here (the pending delete). Would that still be called a ZIL (for my future reference)?

Filesystem      Size  Used Avail Use% Mounted on
MediaTank        12T  9.0T  2.2T  82% /mnt
##################################
NAME        SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
MediaTank  9.97T  7.51T  2.46T         -    70%    75%  1.21x  ONLINE  -

I am not 100% sure on how to intrerprit the differences in the sizes here (probably doesn't help I extended the volume during this), but I would assume there is somewhere between 260GB and 1.8TBs of data to process.

Alright, that is what I was fearing. From one stance it makes since process deletes at mount time, but processing deletes before mounting the volume also seems problematic. I would understand if it was required to replay everything linearly for consistency, but because I can still mount and write to the volume I can be fairly sure that isn't the case. Seems like this could be improved. At the very least, if it is required, it should be done so in the most memory effecient way possible (even if extreamly slow), with some way to monitor process.

This box is now slated for being re-purposed and I plan to migrate the data off. But I still have some time with it (a week or so maybe). Should we continue, or just call it?

nwf · 2017-12-05T08:20:46Z

What I think you don't have is a "separate log" device. The ZIL is intrinsic to the operation of ZFS.

You might see if "zpool get freeing" says something and, notably, if it's decreasing in all the attempts at import.

As a last-ditch effort, you might try cherry-picking nwf@66dbeba. I cannot guarantee that it lands cleanly these days, but the changes are pretty straightforward if not. The goal there is to really restrict how many deletions can be done in a txg: set zfs_dmu_free_long_range_yield to a tiny value (like 1000?) before importing and see if that helps?

stale · 2020-08-25T08:35:05Z

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

BloodBlight · 2020-08-30T22:19:45Z

Looks like I never responded. Not sure if this should die or not. To the best of my knowledge, this is still an issue.

stale · 2021-08-31T00:47:19Z

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

dsl_free() calls zio_free() to free the block. For most blocks, this simply calls metaslab_free() without doing any IO or putting anything on the IO pipeline. Some blocks however require additional IO to free. This at least includes gang, dedup and cloned blocks. For those, zio_free() will issue a ZIO_TYPE_FREE IO and return. If a huge number of blocks are being freed all at once, it's possible for dsl_dataset_block_kill() to be called millions of time on a single transaction (eg a 2T object of 128K blocks is 16M blocks). If those are all IO-inducing frees, that then becomes 16M FREE IOs placed on the pipeline. At time of writing, a zio_t is 1280 bytes, so for just one 2T object that requires a 20G allocation of resident memory from the zio_cache. If that can't be satisfied by the kernel, an out-of-memory condition is raised. This would be better handled by improving the cases that the dmu_tx_assign() throttle will handle, or by reducing the overheads required by the IO pipeline, or with a better central facility for freeing blocks. For now, we simply check for the cases that would cause zio_free() to create a FREE IO, and instead put the block on the pool's freelist. This is the same place that blocks from destroyed datasets go, and the async destroy machinery will automatically see them and trickle them out as normal. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#6783 Closes openzfs#16708 Closes openzfs#16722 Closes openzfs#16697

stale bot added the Status: Stale No recent activity for issue label Aug 25, 2020

stale bot removed the Status: Stale No recent activity for issue label Aug 30, 2020

stale bot added the Status: Stale No recent activity for issue label Aug 31, 2021

behlendorf added Bot: Not Stale Override for the stale bot Component: Memory Management kernel memory management and removed Status: Stale No recent activity for issue labels Sep 2, 2021

fake-name mentioned this issue May 30, 2023

OOM Panic leading to unmountable zvol #14914

Open

egadsthefuzz mentioned this issue Apr 14, 2024

OOM / Panic on files remove #16037

Open

This was referenced Oct 29, 2024

OOM after files remove with dedup on and fast dedup enabled #16697

Closed

Optimized Large File Deletion to Prevent OOM #16708

Closed

dsl_dataset: put IO-inducing frees on the pool deadlist #16722

Merged

behlendorf closed this as completed in 46c4f2c Nov 13, 2024

behlendorf closed this as completed in #16722 Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large Deletes & Memory Consumption #6783

Large Deletes & Memory Consumption #6783

BloodBlight commented Oct 25, 2017

MyPod-zz commented Oct 25, 2017

BloodBlight commented Oct 25, 2017 •

edited

Loading

BloodBlight commented Oct 25, 2017

BloodBlight commented Oct 26, 2017 •

edited

Loading

BloodBlight commented Oct 26, 2017

BloodBlight commented Nov 4, 2017

MyPod-zz commented Nov 4, 2017

BloodBlight commented Nov 4, 2017

BloodBlight commented Nov 4, 2017

BloodBlight commented Nov 5, 2017

dweeezil commented Nov 6, 2017

BloodBlight commented Nov 11, 2017 •

edited

Loading

BloodBlight commented Nov 12, 2017

BloodBlight commented Nov 12, 2017

BloodBlight commented Nov 13, 2017

BloodBlight commented Nov 19, 2017

BloodBlight commented Nov 19, 2017

nwf commented Nov 19, 2017 •

edited

Loading

BloodBlight commented Nov 20, 2017

nwf commented Nov 20, 2017

BloodBlight commented Nov 20, 2017

BloodBlight commented Nov 20, 2017

BloodBlight commented Nov 20, 2017

nwf commented Nov 20, 2017

BloodBlight commented Nov 21, 2017

nwf commented Nov 21, 2017

BloodBlight commented Dec 3, 2017

gmelikov commented Dec 4, 2017

nwf commented Dec 4, 2017 •

edited

Loading

BloodBlight commented Dec 4, 2017

nwf commented Dec 5, 2017

stale bot commented Aug 25, 2020

BloodBlight commented Aug 30, 2020

stale bot commented Aug 31, 2021

Large Deletes & Memory Consumption #6783

Large Deletes & Memory Consumption #6783

Comments

BloodBlight commented Oct 25, 2017

System information

Other Config Information

Trigger

Issue

MyPod-zz commented Oct 25, 2017

BloodBlight commented Oct 25, 2017 • edited Loading

BloodBlight commented Oct 25, 2017

BloodBlight commented Oct 26, 2017 • edited Loading

BloodBlight commented Oct 26, 2017

BloodBlight commented Nov 4, 2017

arc_summary:

MyPod-zz commented Nov 4, 2017

BloodBlight commented Nov 4, 2017

BloodBlight commented Nov 4, 2017

BloodBlight commented Nov 5, 2017

dweeezil commented Nov 6, 2017

BloodBlight commented Nov 11, 2017 • edited Loading

BloodBlight commented Nov 12, 2017

BloodBlight commented Nov 12, 2017

BloodBlight commented Nov 13, 2017

BloodBlight commented Nov 19, 2017

BloodBlight commented Nov 19, 2017

nwf commented Nov 19, 2017 • edited Loading

BloodBlight commented Nov 20, 2017

nwf commented Nov 20, 2017

BloodBlight commented Nov 20, 2017

BloodBlight commented Nov 20, 2017

BloodBlight commented Nov 20, 2017

nwf commented Nov 20, 2017

BloodBlight commented Nov 21, 2017

nwf commented Nov 21, 2017

BloodBlight commented Dec 3, 2017

gmelikov commented Dec 4, 2017

nwf commented Dec 4, 2017 • edited Loading

BloodBlight commented Dec 4, 2017

nwf commented Dec 5, 2017

stale bot commented Aug 25, 2020

BloodBlight commented Aug 30, 2020

stale bot commented Aug 31, 2021

BloodBlight commented Oct 25, 2017 •

edited

Loading

BloodBlight commented Oct 26, 2017 •

edited

Loading

BloodBlight commented Nov 11, 2017 •

edited

Loading

nwf commented Nov 19, 2017 •

edited

Loading

nwf commented Dec 4, 2017 •

edited

Loading