Skip to content
Peter Kehl edited this page Mar 20, 2023 · 28 revisions

Cooperative (de/re)allocation

Summary

  • Primary goal: Improve (CPU) cache usage on de-allocation and resize.
  • Secondary goal (but less likely): Decrease CPU cycles.
  • Benefits not confirmed yet - to be benchmarked.
  • Backwards compatible (both source code & binary if using defaults).
  • For any application using heap (whether std or no_std).
  • Controversial:
    • Wide changes to Rust's core and alloc (and hence std, if used in std applications).
    • Beneficial only with allocators that support it (but fallbacks to default for any classic allocators).
    • Hooking into/exposing the allocator's internals, and depending on the language & libraries to ensure data consistency. Hence, Rust is the ideal language for this. (Probably possible also in languages/methodologies with RAII, with support for custom allocators & good const generics...).

Disambiguation

Not related/specific to Rust co-routines (async/await), neither to any cooperative multithreading/multitasking/scheduling.

Goal

Leverage Rust memory & data safety guarantees to streamline de-allocation (and resizing of data on heap).

The problem/waste being solved

When de-allocating, all that Rust passes through the (standard) allocator API is a pointer (to the data being de-allocated). The allocator has to look it up for its housekeeping in its internal data structures (often hash-based).

How (in short)

Rust could keep some extra metadata along every smart pointer. (This metadata would come from the allocator's new functions co_allocate, co_allocate_zeroed, co_grow, co_grow_zeroed, co_shrink.) Then Rust could pass that metadata back (to the allocator) when de-allocating (or resizing). That hopefully saves the allocator (listing the highest benefit first):

  • data cache access (looking up auxiliary/hashmap-like record(s) for the given pointer, or its hash, and/or the allocated data length...), and
  • code cache access (no need to re-calculate the hash, or its precursor), and
  • CPU cycles (calculating the hash, or its precursor, and locating the related entries) - least likely benefit (since this gain is decreased by CPU cycles needed to move the metadata around).

The size of the metadata (per smart pointer) must be known in compile time. Hence, this is supported only for the global allocator, which must define the metadata and make it (somehow) known to the compiler. That may need some new compiler configuration/directives and/or rustc target configuration.

However, this doesn't affect Rust language itself (other than fixing some rustc internal compiler errors). (Would you like this to be clearer? Get in touch & help shape this, or at least shine the light, please.)

Controversy: CPU cycles & Data size

Of course, this increases the size of (affected) smart pointers (wherever used/stored/passed). So, the initial candidates are primarily non-leaf smart pointers (Vec and other containers):

  • Passed around (up & down the stack, or stored/moved around on the heap), in total, less than leaf objects. (As data flow accumulated over time. Well designed data structures have more leaves than non-leaves for common input.)
  • Small in total size (as per the previous). Then any small metadata (a small multiple of usize, adjacent to those smart pointers) may not increase RAM requirements too much.
  • So, (as per the previous two) this metadata may not excessively affect stack depth, neither the data flow & CPU cycles (for many applications).
  • Due to the cache waiting times, handling the metadata may increase execution time even less so than it increases the number of CPU cycles (since the CPU waits for the cache anyway). Especially so when these smart pointers are stored on stack (which grows/shrinks sequentially). Therefore the runtime speed cost of handling the metadata may not be as high as it may sound.
  • Resizable (which can make use of the metadata): Vec (and hopefully HashMap/HashSet, BTreeMap/BTreeSet).
  • (De)allocated or resized frequently. Hence String, or any frequently resized Vec of primitive or static data, too (even though it's a leaf in terms of smart pointer "levels".)
  • Of course, leaf smart pointers are more frequently (de)allocated than non-leaf ones (in well designed data structures). But the overall data flow may prohibit them (from using cooperative (de)allocation - unless they are resizable).

Hence, (in general) this would be much less beneficial (or even detrimental) for Box, Rc, Arc. (Or, if anyone implements this for those smart pointers, too, we would see when we are benchmarking this.)

Fringe benefit?

Allocators can store metadata (together with the allocated data) - even without passing it around. How? They can put it "left" of the allocated data: At (consecutive) addresses just below the returned pointer. Then the allocator could read such metadata (and update it, if necessary) on deallocation/resize. This is allocator-agnostic, as the size of metadata can vary from one allocator to another.

However, such allocation, deallocation and resize has to access that metadata from that location (next to the user's allocated data). If the application is not using the allocated data near that moment (soon after allocation, or just before deallocation or resizing), this increases the cache reloads/noise.

Also, many use cases of Vec (and similar resizable containers) initiate them at multiples of (mainstream) cache line. (And even if initialized smaller, if such a Vec, or anything based on RawVec, grows, it very soon gets to multiples of cache line, since RawVec::grow_amortized(...) grows exponentially.) But, if this (adjacent) metadata size varies between allocators, the overall data size may be more difficult to optimize, so as to make the total size ("pure" + meta data) a multiple of cache line.

TODO Combine both: Passing metadata around, or storing it left of the allocated data (or neither). Have this configurable: Instead of a bool const generic, pass an enum representing this preference. Then have two where bounds:

  • one for usize of metadata passed around, and
  • one for the usize of metadata stored left of the allocated data.

Then change the minimum "pure" data size (in RawVec::grow_amortized, initialization, and similar): Subtract the size of metadata left of the allocated data (so that it all fits into a multiple of cache line).

Controversy & How: Choice, Complexity, Safety, Compatible

Choice

Whether a type is "cooperative" is highly configurable with const generics. The user (developer) provides value of a const generic bool parameter called COOP_PREFERRED. That indicates whether this type should use cooperative (de)allocation.

There can also be "weighted" versions (for Vec such version is WeVec - name can be discussed). Such types accept an integral const generic "weight" (suggested u8). Then the actual rustc target, and/or some configuration parameter (to be discussed), specify the threshold. "Weighted" types with their "weight" above the threshold operate as "cooperative"; otherwise they operate as "non-cooperative".

However, we don't have "weighted" types yet. As of early 2023, they would depend on #![feature(generic_const_exprs)], which would be enforced on any consumer crates, too. Give thumbs up to its tracking issue, please.

If the user doesn't provide a const generic parameter COOP_PREFERRED, then the type uses its default. Again, that will (hopefully) be configurable per build (or at least per target).

Rust internal complexity

Adding the new functions to the Allocator API (listed above; with defaults/fallbacks) is easy. Implementing them is up to the allocator, and out of scope here. (Do you know of any allocators written primarily in Rust - rather than in C - with developers open to adding new functionality? Please connect us.)

However, most of the work in Rust (alloc/std and core) is about storing and handling the metadata, and adding the const generic parameters (and their where bounds)for the developer's choices and defaults.

This affects a lot of Vec, VecDeque (and RawVec) in library/alloc(which is aliased tolibrary/std). Implementing this for String affectsCowandToOwned(inlibrary/core`), too.

Questions

  • Names of new methods (Vec::new_co) and types (CoVec, ... - see above)...
  • There are smart pointers (and similar/related) structs/enums/traits that currently only work with the global allocator (for example, String). When adding the COOP_PREFERRED preference flag, should we add the allocator generic parameter, too? (Then we'd need related methods that accept the allocator instance, similar to the existing Vec::new_in.) If we ever want this, it has to be now.

Application complexity

Types with different COOP_PREFERRED are not compatible.

"Downgrading" from a "cooperative" type to its "non-cooperative" version is zero cost.

However, "upgrading" is not possible. That's unless we make the "cooperative" version always check its metadata to be non-null first. (That would have a fallback to non-cooperative allocator API.)

Safety

HIGHLY safe. TODO explain here. Safe Rust code will work fine. unsafe code will work, even if it dismantles raw parts of Vec (and similar), as far as it uses the type's functions (rather than using unsafe to cast the smart pointer parts to some other types and extract parts manually).

However, TODO Fix/check MVP code where constructing Vec and similar from raw parts. Investigate if we could make it work for non-cooperative versions only. (E.g. panic! if called to construct a cooperative version.)

Compatible

Existing crates will stay compatible.

HIGHLY compatible with const generic default values. If a crate doesn't specify COOP_PREFERRED, it will work with whatever COOP_PREFERRED the consumer crate chooses. TODO explain more.

Scope & Status

  • Primarily, and initially, Vec, VecDeque and RawVec (MVP mostly complete, not tested, not benchmarked. Please help.).
  • Possibly String and related (Cow, ToOwned). String: Should it get an Allocator optional generic parameter, too?
  • Hopefully HashMap/HashSet, BTreeMap/BTreeSet)
  • Unlikely (possible, but with little benefit - as per above), and would need major changes in the compiler: Box, Rc, Arc.

Development

See Contributing.

Side Fruit

Help, please

Please,

  • upvote/help move forward the above issues (in #Side_Fruit),
  • upvote/help move forward rust-lang/rust #76560 for generic_const_exprs,
  • suggest how to benchmark this fast on x64 Linux.