Move to an in-memory, native Directory structure (aka: lazily persist Directories) #13112

stuhood · 2021-10-05T04:20:39Z

Exploring profiles of larger runs showed significant amounts of time spent in intrinsics::merge_digests_request_to_digest and intrinsics::digest_to_snapshot (13% and 5% respectively of the total runtime), mostly in database access.

After briefly exploring moving to storing Trees rather than Directorys (which would be less space efficient, but more time efficient), it occurred to me that many of the Directorys that we persist will never be used in places that actually need persistence (sent to the network, included as the output (not input) of a cache entry, etc), and will instead just be manipulated in memory.

This suggests that in the medium term we could use a more efficient in-memory-only structure to represent Directorys, and only digest them and commit them to disk when we need to use them in a relevant position (while Files would of course stay digested and on disk). To get to this future, it's possible that a first incremental step would be to introduce a memory-only Store for Directorys, and introduce an explicit "persist this and get me a Digest" step. But: it will be important to avoid increasing the complexity of the code too much... it seems possible that some codepaths would become significantly simpler when implemented on a memory-only structure, and that that would justify using custom non-protobuf types.

This relates to Bazel's depsets, in that the structure underlying a DAG of in-memory Directorys could be a generic, native "nested set". That native nested set structure could also be used for usecases like #13087 and #13492.

The text was updated successfully, but these errors were encountered:

stuhood · 2021-11-05T00:07:39Z

This showed up in profiles again recently. There is no Node created for snapshot/digest merge operations (they're not memoized), so when profiling workunits, the time spent here will be blamed to whichever @rule is requesting the work. So pants.core.util_rules.source_files.determine_source_files shows up warmer in profiles due to its usage of MergeDigests.

stuhood · 2021-11-05T16:14:06Z

It also occurs to me that this is potentially related to #13087 and #13492: the Directorys that we avoid persisting would likely be similar in structure to those recursive dataclasses (although I think that we'll still want to do this on the Rust side).

stuhood · 2022-02-27T17:52:03Z

This showed up again in profiles recently, so I'm going to try tackling it this week. Design sketch below.

A new DigestTree structure will be added, with Arcs for structure sharing. It will be very similar in structure to the protobuf Directory entry, except with Arc'd references to child nodes. All operations which currently manipulate or create Digests by persisting intermediate nodes to the Store will be updated to instead operate on DigestTrees. Each DigestTree entry will eagerly compute its own Digest, but will not persist it to disk.

The Digest struct will gain an optional DigestTree reference which contains its contents. If a Digest has a DigestTree reference, then that Digest might not be persisted to the Store. If the Digest does not hold a DigestTree, then that Digest must have been persisted to the Store (either locally or remotely). The field thus acts likes a cache in some cases, but in other cases is an indication that the tree must first be persisted (or loaded) before the Digest may be operated on.

This intentionally does not change the Python API (yet): we'll continue to have separate Snapshot and Digest types, since the tree being optional allows for lazily loading it from a remote, for example. But Snapshot and DigestEntries can each begin to contain a DigestTree rather than a list of PathStats, which should improve memory usage.

cc @tdyas, @Eric-Arellano, @illicitonion, @jsirois for feedback.

stuhood · 2022-02-28T06:03:00Z

Slight adjustment to the above. We had begun to differentiate "file digests" from "directory digests" in the Python API, but had not yet begun doing that on the Rust side. Rather than modifying hashing::Digest (which is used interchangeably for files and directories) to add the optional field, I've taken the opportunity to add a DirectoryDigest type with the optional DigestTree.

stuhood · 2022-03-01T22:50:36Z

The first of probably three or four PRs is now posted as #14654. The next one or two PRs will port a few of the Digest/Directory manipulating operations to consuming a DigestTrie (MergeDigests, DigestSubset, etc).

stuhood · 2022-03-04T20:52:07Z

Ok, the stack of PRs is now:

Benchmarking on #14677 shows that the performance without IO contention is about equal to main. Given that the changes also significantly simplify all of the snapshot-manipulation implementations, I think we're ready to start landing them.

…14654) As described in #13112: we currently persist _all_ `remexec::Directory` structures to disk, although many of them are intermediate results which never actually need to be persisted across restarts (the only position which actually needs to be persisted is a cache output). That extra IO (and complexity) has performance costs: roughly 20% of some profiles. In order to skip persisting `remexec::Directory`s to disk, we need an in-memory structure to replace them (which we will persist or load from a `Store` when necessary). To that end, this change introduces a `DigestTrie` type, which is a trie representing a `Digest`ed filesystem tree. A `DigestTrie` is sufficient to replace the list of `PathStats` previously held by a `Snapshot`. But because it uses interned names for path members, and is structure shared using `Arc`s, it does so using a lot less memory. This is the first of three or four PRs which will replace all methods which recursively load/store `remexec::Directory` with operations on `DigestTrie`s instead (e.g. `MergeDigests`, `DigestSubset`, etc). As methods are ported, their boundaries will be adjusted to skip calling `store.record_digest_tree` (which persists the `DigestTree` to disk using our existing format), until only the local cache persists directory structures.

…14697) #13112 is porting more operations on `Digest`/`Snapshot` to use of `DigestTrie` in order to reduce IO by allowing them to operate entirely in memory. When we have a `DigestTrie` in memory, we are able to persist all of its recursive `remexec::Directory` structures in parallel. To that end, this change adds batch persistence to the local LMDB store, which allows us to use a single context switch and write transaction (*per LMDB shard) to persist an entire `DigestTrie`.

…4677) This change continues #13112 by porting most of `Snapshot`'s operations to use of `DigestTrie`, including: * `merge_directories` * `add_prefix` * `remove_prefix` * `contents_for_directory` * `entries_for_directory` But because many operations still assume that an input `Digest` has been persisted (with boundaries marked by `DirectoryDigest::{todo_as_digest,todo_from_digest}`), modified methods in this change will continue to persist `DigestTrie`s that they produce. #14723 will remove that persistence.

stuhood · 2022-03-09T05:27:40Z

Preliminary micro-benchmarks of pants.core.util_rules.source_files.determine_source_files with #14723 show a 56% improvement. Memory usage according to top drops by ~10%. Will work to stabilize and land #14723 tomorrow.

This change ports the most meaningful remaining methods for #13112 to use of `DigestTrie`: in particular: * `Process` inputs and outputs * `materialize_directory` A few other notable usecases remain un-ported for followup work (marked by #13112 TODOs and `todo_as_digest`/`todo_from_digest`), but this change shows a speedup of 56% for a microbenchmark of `pants.core.util_rules.source_files.determine_source_files`, and drops `top`-reported memory usage for common cases by 10%.

stuhood · 2022-03-10T22:01:30Z

Ok, #14758 will likely be the last one here for a while. The last two-ish methods (subset, ensure_remote_has_recursive) have diminishing returns for performance purposes... and subset might be a bit of a bear. Let's leave the ticket open until those can be resolved, but I'll unassign to pick up some other 2.11.x work.

Continues #13112 by porting operations which consume and produce `remexec::Tree`s to `DigestTrie`. In particular: conversions in both directions are implemented via: ```rust impl TryFrom<remexec::Tree> for DigestTrie { .. } impl From<&DigestTrie> for remexec::Tree { .. } ```

stuhood · 2022-03-23T21:24:54Z

#14889 fixes the last major case here: while there is one other usage of DirectoryDigest::todo_as_digest, it doesn't block closing this issue.

… by subsetting (#14889) #14858 reported that an invalid `Snapshot` was being created, and reproducing in debug mode triggered `debug_assertions` related to the validity of `remexec::Directory`s created by the subset matching code. Although porting the existing subset-matching code to `DigestTrie` _and_ fixing the bug would be one option, we've wanted to remove duplication between the "apply globs to a `Snapshot`" and "apply globs to the filesystem" codepaths for a long time (see #9967), and fixing the bug by unifying the codepaths kills two birds with one stone. This change ports to implementing subset matching using `Vfs::expand_globs`, followed by the creation of a new `Snapshot` from the matches. This is definitely not as optimized as the direct subset matching was (`./cargo bench -p store -- subset` reports a change from ~20ms to ~100ms for 10k files), but it opens the door to unified optimization of _both_ our glob-expansion code and in-memory glob matching in parallel: see #14890. Fixes #9967, fixes #14858, fixes #12462, and fixes #13112.

… by subsetting (pantsbuild#14889) pantsbuild#14858 reported that an invalid `Snapshot` was being created, and reproducing in debug mode triggered `debug_assertions` related to the validity of `remexec::Directory`s created by the subset matching code. Although porting the existing subset-matching code to `DigestTrie` _and_ fixing the bug would be one option, we've wanted to remove duplication between the "apply globs to a `Snapshot`" and "apply globs to the filesystem" codepaths for a long time (see pantsbuild#9967), and fixing the bug by unifying the codepaths kills two birds with one stone. This change ports to implementing subset matching using `Vfs::expand_globs`, followed by the creation of a new `Snapshot` from the matches. This is definitely not as optimized as the direct subset matching was (`./cargo bench -p store -- subset` reports a change from ~20ms to ~100ms for 10k files), but it opens the door to unified optimization of _both_ our glob-expansion code and in-memory glob matching in parallel: see pantsbuild#14890. Fixes pantsbuild#9967, fixes pantsbuild#14858, fixes pantsbuild#12462, and fixes pantsbuild#13112.

… by subsetting (cherrypick of #14889) (#14896) #14858 reported that an invalid `Snapshot` was being created, and reproducing in debug mode triggered `debug_assertions` related to the validity of `remexec::Directory`s created by the subset matching code. Although porting the existing subset-matching code to `DigestTrie` _and_ fixing the bug would be one option, we've wanted to remove duplication between the "apply globs to a `Snapshot`" and "apply globs to the filesystem" codepaths for a long time (see #9967), and fixing the bug by unifying the codepaths kills two birds with one stone. This change ports to implementing subset matching using `Vfs::expand_globs`, followed by the creation of a new `Snapshot` from the matches. This is definitely not as optimized as the direct subset matching was (`./cargo bench -p store -- subset` reports a change from ~20ms to ~100ms for 10k files), but it opens the door to unified optimization of _both_ our glob-expansion code and in-memory glob matching in parallel: see #14890. Fixes #9967, fixes #14858, fixes #12462, and fixes #13112. [ci skip-build-wheels]

stuhood mentioned this issue Oct 5, 2021

Native support for Trees in Store #13006

Closed

stuhood added this to 2022 Performance Push Jan 5, 2022

stuhood added the estimate: >3W label Jan 5, 2022

stuhood changed the title ~~Only persist Directories when necessary~~ Move to an in-memory, native Directory structure (aka: only persist Directories when necessary) Jan 5, 2022

stuhood changed the title ~~Move to an in-memory, native Directory structure (aka: only persist Directories when necessary)~~ Move to an in-memory, native Directory structure (aka: lazily persist Directories) Jan 5, 2022

stuhood mentioned this issue Feb 22, 2022

Consider recursively merging sources/requirements for Python #14561

Open

stuhood self-assigned this Feb 28, 2022

stuhood mentioned this issue Feb 28, 2022

High total runtime for test, even with very high cache rate toolchainlabs/issues#12

Closed

stuhood mentioned this issue Mar 1, 2022

[internal] Implement Snapshot in terms of a new DigestTrie type #14654

Merged

stuhood mentioned this issue Mar 4, 2022

[internal] Store DigestTries to local storage using batched writes #14697

Merged

stuhood mentioned this issue Mar 9, 2022

[internal] Port most Snapshot operations to use of DigestTrie #14677

Merged

stuhood mentioned this issue Mar 10, 2022

[internal] Port Process operations to use of DigestTrie #14723

Merged

stuhood mentioned this issue Mar 10, 2022

[internal] Port remexec::Tree interactions to DigestTrie #14758

Merged

stuhood removed their assignment Mar 10, 2022

stuhood mentioned this issue Mar 23, 2022

Unify glob-matching implementations to fix malformed snapshot created by subsetting #14889

Merged

stuhood closed this as completed in #14889 Mar 23, 2022

stuhood moved this to Done in 2022 Performance Push Mar 23, 2022

stuhood mentioned this issue Mar 23, 2022

Unify glob-matching implementations to fix malformed snapshot created by subsetting (cherrypick of #14889) #14896

Merged

stuhood mentioned this issue Aug 23, 2022

Optimize CreateDigest implementation. #16617

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move to an in-memory, native Directory structure (aka: lazily persist Directories) #13112

Move to an in-memory, native Directory structure (aka: lazily persist Directories) #13112

stuhood commented Oct 5, 2021 •

edited

Loading

stuhood commented Nov 5, 2021

stuhood commented Nov 5, 2021

stuhood commented Feb 27, 2022 •

edited

Loading

stuhood commented Feb 28, 2022

stuhood commented Mar 1, 2022

stuhood commented Mar 4, 2022 •

edited

Loading

stuhood commented Mar 9, 2022 •

edited

Loading

stuhood commented Mar 10, 2022

stuhood commented Mar 23, 2022

Move to an in-memory, native Directory structure (aka: lazily persist Directories) #13112

Move to an in-memory, native Directory structure (aka: lazily persist Directories) #13112

Comments

stuhood commented Oct 5, 2021 • edited Loading

stuhood commented Nov 5, 2021

stuhood commented Nov 5, 2021

stuhood commented Feb 27, 2022 • edited Loading

stuhood commented Feb 28, 2022

stuhood commented Mar 1, 2022

stuhood commented Mar 4, 2022 • edited Loading

stuhood commented Mar 9, 2022 • edited Loading

stuhood commented Mar 10, 2022

stuhood commented Mar 23, 2022

stuhood commented Oct 5, 2021 •

edited

Loading

stuhood commented Feb 27, 2022 •

edited

Loading

stuhood commented Mar 4, 2022 •

edited

Loading

stuhood commented Mar 9, 2022 •

edited

Loading