Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't use pure managed memory for the target threshold #7421

Open
crusaderky opened this issue Dec 19, 2022 · 0 comments · Fixed by dask/zict#101
Open

Don't use pure managed memory for the target threshold #7421

crusaderky opened this issue Dec 19, 2022 · 0 comments · Fixed by dask/zict#101
Assignees
Labels

Comments

@crusaderky
Copy link
Collaborator

crusaderky commented Dec 19, 2022

Current design

  1. Every time a new key is inserted in Worker.data, if the managed memory (output of sizeof) exceeds the target threshold, keys are spilled from the bottom of the LRU cache until the managed memory goes below target.
    This is a synchronous process that does not release the event loop. This isn't great, but it's bounded in the sense that it's never going to spill more bytes than the size of the key that has just been inserted.

  2. Every 100ms (distributed.worker.memory.monitor-interval), measure the process memory through psutil. If the process memory exceeds the spill threshold, start spilling keys until the process memory goes below the target threshold (hysteresis cycle). Re-measure process memory, call garbage collection, and release the event loop multiple times in this process, which can potentially take many seconds.

The intent of this design is to have a very responsive, cheap, but inaccurate first threshold and a slow-to-notice, expensive, but accurate second one. The design however is problematic:

  1. when unmanaged memory (process - managed) is very high, e.g. due to a leak, high heap from the running user functions, or underestimated output of sizeof(). In the extreme cases of memory leaking, you're going to reach the spill threshold without having ever hit the target threshold and then spill the whole contents of Worker.data all at once.
  2. when unmanaged memory is negative, due to overestimated output of sizeof(). This will cause target to start spilling too soon, when there's plenty of memory still available.

Proposed design

In zict:

  • Add an offset property to zict.LRU. This property is added to total_weights for the purpose of eviction.

In distributed.worker_memory:

  • Every 100ms, measure process memory and calculate unmanaged memory.
  • If process memory is above the spill threshold and there is data in Worker.fast, garbage collect and re-measure it.
  • Update Worker.data.fast.offset to the amount of unmanaged memory.
  • Manually trigger spilling in zict.

In distributed.worker_state_machine._transition_to_memory, distributed.Worker.execute, and distributed.Worker.get_data: no change, but now the offset is considered every time a key is inserted in fast.

Notes

  • This could cause zict to synchronously spill many GiBs at once, without ever releasing the event loop. This change should be paired with Asynchronous Disk Access in Workers #4424.
  • Leaving the current thresholds unchanged, you'll start spilling a lot earlier. Effectively, target is the new spill. I think it's safe to bump both by 0.1 (making spill the same as pause)
  • We should rename "spill" to "aggressive_gc" to clarify its new meaning.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant