Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARC prefetch usage accounting and larger lifespan until usage in limit. #11980

Open
jsai20 opened this issue Apr 30, 2021 · 0 comments
Open

ARC prefetch usage accounting and larger lifespan until usage in limit. #11980

jsai20 opened this issue Apr 30, 2021 · 0 comments
Labels
Type: Feature Feature request or new feature

Comments

@jsai20
Copy link
Contributor

jsai20 commented Apr 30, 2021

Describe the feature would like to see added to OpenZFS

Problem Statement:
Many ZFS workflows do prefetch data/metadata blocks in ARC, which are expected to be needed sooner in that particular workflow, like block traversal, metaslab preload and primary/live workload like readdir.
When such workflows does prefetch in parallel via multiple streams and if ARC consumption is at its max (arc_size >= arc_c_max), then to accommodate prefetched data/metadata in ARC, it would have a need of evicting existing arc buffers.
Due to multiple parallel streams prefetch data/metadata, It’s possible that buffers prefetched from one stream/workflow, could get evicted before its demand read done, possibly due to arc space needed for prefetched data/metadata block from another workflow stream. It’s possible, if prefetched block from one workflow stream is not accessed for the time duration of zfs_arc_min_prefetch_ms ticks.

So, as a quick work-around to this problem zfs_arc_min_prefetch_ms can be increased to a larger value such that prefetched data remains cached until not consumed (demand read). But because of zfs_arc_min_prefetch_ms is used for indirect blocks as well along with prefetched blocks, so increasing its value could have side effects for indirect blocks cached. Also increasing the zfs_arc_min_prefetch_ms without putting any limit, could overflow ARC with prefetched blocks and cause other side effects. So, we need a new parameter to define a lifespan only for prefetched blocks and also mechanisms to control/limit max ARC consumption by prefetched blocks.

Higher level design:
When a prefetch read is issued for arc buffers, then a flag ARC_FLAG_PREFETCH is set in arc_buf_hdr->b_flags. This flag is cleared, when the buffer is accessed as non-prefetched (demand) read. So, when setting ARC_FLAG_PREFETCH, account corresponding arc buffers usage in the new arc stat counter, say arc_prefetch_size. And decrease the arc buffer usage when the flag is cleared (demand read is done).

Account estimated prefetch target from different workflows like block traversal, metaslab preload and live workloads. Let’s call it, arc_prefetch_evictskip_target.
It could be hard to exactly estimate target from all live/primary workflow streams, so we may just set a roughly consolidated estimated prefetch target to accommodate all live workflows. Say, ZFS_LIVE_PREFETCH_TARGET.
So, basically, account for roughly estimated prefetch target from different workflows in arc_prefetch_evictskip_target.

To avoid prefetch target going beyond certain limit, define its max limit, say, arc_prefetch_evictskip_limit.
When system is under memory pressure, ARC gets shrunk aggressively and arc_c (arc target) gets reduced toward its minimum (arc_c_min). So, It seems better to define arc_prefetch_evictskip_limit as a function of arc_c, so that it also gets reduced along with arc_c accordingly, when needed. E.g.: define arc_prefetch_evictskipt_limit as 10% of arc_c.

Now, define a bit larger life span for prefetch buffers. Let’s call it, arc_min_prefetch_inlimit_ms. Set it bit higher value (value in minutes). Probably in the range of 1 to 5 minutes, whatever work well as per your primary/secondary workload characteristics on system.

Now, During arc buffer eviction, check if the buffer being evicted has ARC_FLAG_PREFETCH set and if arc_prefetch_size is less than equal to MIN(arc_prefetch_evictskip_target, arc_prefetch_evictskip_limit) and life span of buffer selected is less than arc_min_prefetch_inlimit_ms, then skip the buffer being evicted, except for certain specific scenarios, where arc buffers eviction can’t be skipped. Like arc_no_grow is set. When arc_no_grow is set, then it’s indicator of system under memory pressure and arc shrink is necessary, so arc buffer eviction can’t be skipped. Similarly, if arc buffer eviction happening in arc_flush() context, when a particular spa is either being unloaded or the module is being unloaded and hence all corresponding arc buffers needs to evicted and it can’t be skipped.

For any reasons, If workflow stream which prefetched block, gets terminated half way, before its demand reads done, then such prefetched buffers would stay as prefetched in ARC, but it’s OK, considering, these are corner cases and these buffers would any way be eligible to get evicted after the zfs_arc_min_prefetch_inlimit_lifespan.
When prefetched buffers is evicted from ARC, reduce arc_prefetch_size appropriately, during ABD (arc buf data) attached to buffer header is freed. And when arc read is done for corresponding block, then increase the arc_prefetch_size again during ABD allocation for corresponding buffer header. This assure, arc_prefetch_size accounting remains consistent after prefetched buffer eviction (State moved to ghost) and getting it back (moving back to normal state).

With this prefetch usage accounting and appropriately skipping prefetch buffers until prefetch usage is with in evict skip limit, considering there are no exceptions as explained, effective hit rate in ARC for demand read would be increased

How will this feature improve OpenZFS?

It optimizes core ARC code around prefetch buffers usage accounting. This helps in defining prefetch usage target, limits and helps avoiding prefetch buffers eviction when not necessary (considering no exceptions, as explained in design part). This effectively increases arc buffers hit rate on dmand read done in context of workflow stream.

Additional context

Scope of code changes:

arc_init(void); --- Initializing Usage/Target/Limit Variables.
arc_fini(void); --- De-initializing same.

arc_read(zio_t *pio, spa_t *spa, const blkptr_t *bp,
arc_read_done_func_t *done, void *private, zio_priority_t priority,
int zio_flags, arc_flags_t *arc_flags, const zbookmark_phys_t *zb); -- Around ARC_FLAG_PREFETCH set.
arc_access(arc_buf_hdr_t *hdr, kmutex_t *hash_lock); -- Around ARC_FLAG_PREFETCH reset.

arc_hdr_set_flags(arc_buf_hdr_t *hdr, arc_flags_t flags); -- arc_prefetch_size increment;
arc_hdr_clear_flags(arc_buf_hdr_t *hdr, arc_flags_t flags); -- arc_prefetch_size decrement;
arc_hdr_alloc_abd(arc_buf_hdr_t *hdr, int alloc_flags); --- arc_prefetch_size decrement;
arc_hdr_free_abd(arc_buf_hdr_t *hdr, boolean_t free_rdata); -- arc_prefetch_size increment;
arc_evict_hdr(arc_buf_hdr_t *hdr, kmutex_t *hash_lock); -- Changes to skip prefetch buffer eviction;

New arc_stats {} fields around accounting prefetch size, prefetch evictskip target/limit.
Arc stat counters showing prefetched buffers evict skipped, prefetch buffers evicted and so on..

New Module parameters.

@jsai20 jsai20 added the Type: Feature Feature request or new feature label Apr 30, 2021
jsai20 added a commit to jsai20/zfs that referenced this issue Jun 11, 2021
Acount ARC uage for prefetched buffers (arcstat_prefetch_size)
and allow to live prefetched buffers for larger lifespan
(zfs_arc_min_prefetch_inlimit_lifespan), if usage within target/limit
(MIN(zfs_arc_prefetch_target, zfs_arc_prefetch_evictskip_limit)), except
for cases, when eviction can't be skipped, like when arc_no_grow is set
in case of system under memory pressure Or if it's arc_flush context,
where eviction can't be skipped.

Signed-off-by: Jitendra Patidar <[email protected]>
Closes openzfs#11980
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Feature request or new feature
Projects
None yet
Development

No branches or pull requests

1 participant