-
-
Notifications
You must be signed in to change notification settings - Fork 638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move to an in-memory, native Directory structure (aka: lazily persist Directories) #13112
Comments
This showed up in profiles again recently. There is no |
This showed up again in profiles recently, so I'm going to try tackling it this week. Design sketch below. A new The This intentionally does not change the Python API (yet): we'll continue to have separate cc @tdyas, @Eric-Arellano, @illicitonion, @jsirois for feedback. |
Slight adjustment to the above. We had begun to differentiate "file digests" from "directory digests" in the Python API, but had not yet begun doing that on the Rust side. Rather than modifying |
The first of probably three or four PRs is now posted as #14654. The next one or two PRs will port a few of the |
Ok, the stack of PRs is now:
Benchmarking on #14677 shows that the performance without IO contention is about equal to |
…14654) As described in #13112: we currently persist _all_ `remexec::Directory` structures to disk, although many of them are intermediate results which never actually need to be persisted across restarts (the only position which actually needs to be persisted is a cache output). That extra IO (and complexity) has performance costs: roughly 20% of some profiles. In order to skip persisting `remexec::Directory`s to disk, we need an in-memory structure to replace them (which we will persist or load from a `Store` when necessary). To that end, this change introduces a `DigestTrie` type, which is a trie representing a `Digest`ed filesystem tree. A `DigestTrie` is sufficient to replace the list of `PathStats` previously held by a `Snapshot`. But because it uses interned names for path members, and is structure shared using `Arc`s, it does so using a lot less memory. This is the first of three or four PRs which will replace all methods which recursively load/store `remexec::Directory` with operations on `DigestTrie`s instead (e.g. `MergeDigests`, `DigestSubset`, etc). As methods are ported, their boundaries will be adjusted to skip calling `store.record_digest_tree` (which persists the `DigestTree` to disk using our existing format), until only the local cache persists directory structures.
…14697) #13112 is porting more operations on `Digest`/`Snapshot` to use of `DigestTrie` in order to reduce IO by allowing them to operate entirely in memory. When we have a `DigestTrie` in memory, we are able to persist all of its recursive `remexec::Directory` structures in parallel. To that end, this change adds batch persistence to the local LMDB store, which allows us to use a single context switch and write transaction (*per LMDB shard) to persist an entire `DigestTrie`.
…4677) This change continues #13112 by porting most of `Snapshot`'s operations to use of `DigestTrie`, including: * `merge_directories` * `add_prefix` * `remove_prefix` * `contents_for_directory` * `entries_for_directory` But because many operations still assume that an input `Digest` has been persisted (with boundaries marked by `DirectoryDigest::{todo_as_digest,todo_from_digest}`), modified methods in this change will continue to persist `DigestTrie`s that they produce. #14723 will remove that persistence.
This change ports the most meaningful remaining methods for #13112 to use of `DigestTrie`: in particular: * `Process` inputs and outputs * `materialize_directory` A few other notable usecases remain un-ported for followup work (marked by #13112 TODOs and `todo_as_digest`/`todo_from_digest`), but this change shows a speedup of 56% for a microbenchmark of `pants.core.util_rules.source_files.determine_source_files`, and drops `top`-reported memory usage for common cases by 10%.
Ok, #14758 will likely be the last one here for a while. The last two-ish methods ( |
Continues #13112 by porting operations which consume and produce `remexec::Tree`s to `DigestTrie`. In particular: conversions in both directions are implemented via: ```rust impl TryFrom<remexec::Tree> for DigestTrie { .. } impl From<&DigestTrie> for remexec::Tree { .. } ```
#14889 fixes the last major case here: while there is one other usage of |
… by subsetting (#14889) #14858 reported that an invalid `Snapshot` was being created, and reproducing in debug mode triggered `debug_assertions` related to the validity of `remexec::Directory`s created by the subset matching code. Although porting the existing subset-matching code to `DigestTrie` _and_ fixing the bug would be one option, we've wanted to remove duplication between the "apply globs to a `Snapshot`" and "apply globs to the filesystem" codepaths for a long time (see #9967), and fixing the bug by unifying the codepaths kills two birds with one stone. This change ports to implementing subset matching using `Vfs::expand_globs`, followed by the creation of a new `Snapshot` from the matches. This is definitely not as optimized as the direct subset matching was (`./cargo bench -p store -- subset` reports a change from ~20ms to ~100ms for 10k files), but it opens the door to unified optimization of _both_ our glob-expansion code and in-memory glob matching in parallel: see #14890. Fixes #9967, fixes #14858, fixes #12462, and fixes #13112.
… by subsetting (pantsbuild#14889) pantsbuild#14858 reported that an invalid `Snapshot` was being created, and reproducing in debug mode triggered `debug_assertions` related to the validity of `remexec::Directory`s created by the subset matching code. Although porting the existing subset-matching code to `DigestTrie` _and_ fixing the bug would be one option, we've wanted to remove duplication between the "apply globs to a `Snapshot`" and "apply globs to the filesystem" codepaths for a long time (see pantsbuild#9967), and fixing the bug by unifying the codepaths kills two birds with one stone. This change ports to implementing subset matching using `Vfs::expand_globs`, followed by the creation of a new `Snapshot` from the matches. This is definitely not as optimized as the direct subset matching was (`./cargo bench -p store -- subset` reports a change from ~20ms to ~100ms for 10k files), but it opens the door to unified optimization of _both_ our glob-expansion code and in-memory glob matching in parallel: see pantsbuild#14890. Fixes pantsbuild#9967, fixes pantsbuild#14858, fixes pantsbuild#12462, and fixes pantsbuild#13112.
… by subsetting (cherrypick of #14889) (#14896) #14858 reported that an invalid `Snapshot` was being created, and reproducing in debug mode triggered `debug_assertions` related to the validity of `remexec::Directory`s created by the subset matching code. Although porting the existing subset-matching code to `DigestTrie` _and_ fixing the bug would be one option, we've wanted to remove duplication between the "apply globs to a `Snapshot`" and "apply globs to the filesystem" codepaths for a long time (see #9967), and fixing the bug by unifying the codepaths kills two birds with one stone. This change ports to implementing subset matching using `Vfs::expand_globs`, followed by the creation of a new `Snapshot` from the matches. This is definitely not as optimized as the direct subset matching was (`./cargo bench -p store -- subset` reports a change from ~20ms to ~100ms for 10k files), but it opens the door to unified optimization of _both_ our glob-expansion code and in-memory glob matching in parallel: see #14890. Fixes #9967, fixes #14858, fixes #12462, and fixes #13112. [ci skip-build-wheels]
Exploring profiles of larger runs showed significant amounts of time spent in
intrinsics::merge_digests_request_to_digest
andintrinsics::digest_to_snapshot
(13% and 5% respectively of the total runtime), mostly in database access.After briefly exploring moving to storing
Tree
s rather thanDirectory
s (which would be less space efficient, but more time efficient), it occurred to me that many of theDirectory
s that we persist will never be used in places that actually need persistence (sent to the network, included as the output (not input) of a cache entry, etc), and will instead just be manipulated in memory.This suggests that in the medium term we could use a more efficient in-memory-only structure to represent
Directory
s, and only digest them and commit them to disk when we need to use them in a relevant position (while Files would of course stay digested and on disk). To get to this future, it's possible that a first incremental step would be to introduce a memory-onlyStore
forDirectory
s, and introduce an explicit "persist this and get me aDigest
" step. But: it will be important to avoid increasing the complexity of the code too much... it seems possible that some codepaths would become significantly simpler when implemented on a memory-only structure, and that that would justify using custom non-protobuf types.This relates to Bazel's
depsets
, in that the structure underlying a DAG of in-memoryDirectory
s could be a generic, native "nested set". That native nested set structure could also be used for usecases like #13087 and #13492.The text was updated successfully, but these errors were encountered: