-
-
Notifications
You must be signed in to change notification settings - Fork 638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[internal] Implement Snapshot
in terms of a new DigestTrie
type
#14654
Conversation
ed91450
to
4ddf601
Compare
Snapshot
in terms of a new DigestTree
typeSnapshot
in terms of a new DigestTrie
type
src/rust/engine/fs/src/directory.rs
Outdated
/// Creates a DirectoryDigest which asserts that the given Digest represents a Directory structure | ||
/// which is persisted in a Store. | ||
pub fn new(digest: Digest) -> Self { | ||
Self { digest, tree: None } | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method could almost have been called DirectoryDigest::todo_port_to_consuming_digest_trees
, as all usages other than usage at a remote execution boundary represent positions where we should be holding a DirectoryDigest
already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that name, if it isn't too much of a pain to update w/ merge conflicts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went ahead and did (roughly) this, because it will really help with tracking down the various cases where we are converting.
src/rust/engine/fs/src/directory.rs
Outdated
fn from_sorted_paths( | ||
prefix: PathBuf, | ||
paths: Vec<TypedPath>, | ||
file_digests: &HashMap<PathBuf, Digest>, | ||
) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method was ported from Snapshot::ingest_sorted_path_stats
. Most of #13112 involves porting methods like this from operating against the Store
(and lazy loading/storing remexec::Directory
s) to synchronous in-memory operations.
I know this is one of several PRs, but thoughts on a benchmark? |
I'm fairly certain that this will be slower, since it is both creating the tree in memory and then persisting it to disk. It basically only adds work. The next PR will begin to remove work by allowing the tree to be used in e.g. We also don't have to land this until more of the followup work is posted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I do recommend considering the variable/function renames here, but fine to punt on unit tests in a followup if necessary.
src/rust/engine/fs/src/directory.rs
Outdated
/// Creates a DirectoryDigest which asserts that the given Digest represents a Directory structure | ||
/// which is persisted in a Store. | ||
pub fn new(digest: Digest) -> Self { | ||
Self { digest, tree: None } | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that name, if it isn't too much of a pain to update w/ merge conflicts.
/// | ||
/// If this DirectoryDigest has been persisted to disk (i.e., does not have a DigestTrie) then | ||
/// this will only include the root. | ||
pub fn digests(&self) -> Vec<Digest> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe test this? The stack code looks good but is non-trivial.
a3bb4c4
to
f713e82
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unit tests would still be really good to add in a followup because there are lots of small functions w/ somewhat non-trivial functionality. But fine for a followup
f713e82
to
9ca6a49
Compare
9ca6a49
to
8e6137b
Compare
The `--stats-memory-summary` added in #14638/#14652 was [reporting surprisingly large sizes](#12662 (comment)) for native `NodeKey` structs -- even when excluding the actual Python values that they held. Investigation showed that both the `Task` and `Entry` structs were contributing significantly to the size of the `Task` struct. The [`internment` crate](https://crates.io/crates/internment) used here (and in #14654) is an alternative to giving these values integer IDs. They become pointers to a unique, shared (technically: leaked) copy of the value. They are consequently 1) much smaller, 2) much faster to compare. The `top`-reported memory usage of `./pants dependencies --transitive ::`: * `313M` before (summary [before.txt](https://github.com/pantsbuild/pants/files/8175461/before.txt)) * `220M` after (summary [after.txt](https://github.com/pantsbuild/pants/files/8175462/after.txt)) [ci skip-build-wheels]
…uild#14683) The `--stats-memory-summary` added in pantsbuild#14638/pantsbuild#14652 was [reporting surprisingly large sizes](pantsbuild#12662 (comment)) for native `NodeKey` structs -- even when excluding the actual Python values that they held. Investigation showed that both the `Task` and `Entry` structs were contributing significantly to the size of the `Task` struct. The [`internment` crate](https://crates.io/crates/internment) used here (and in pantsbuild#14654) is an alternative to giving these values integer IDs. They become pointers to a unique, shared (technically: leaked) copy of the value. They are consequently 1) much smaller, 2) much faster to compare. The `top`-reported memory usage of `./pants dependencies --transitive ::`: * `313M` before (summary [before.txt](https://github.com/pantsbuild/pants/files/8175461/before.txt)) * `220M` after (summary [after.txt](https://github.com/pantsbuild/pants/files/8175462/after.txt)) [ci skip-build-wheels]
…pick of #14683) (#14689) The `--stats-memory-summary` added in #14638/#14652 was [reporting surprisingly large sizes](#12662 (comment)) for native `NodeKey` structs -- even when excluding the actual Python values that they held. Investigation showed that both the `Task` and `Entry` structs were contributing significantly to the size of the `Task` struct. The [`internment` crate](https://crates.io/crates/internment) used here (and in #14654) is an alternative to giving these values integer IDs. They become pointers to a unique, shared (technically: leaked) copy of the value. They are consequently 1) much smaller, 2) much faster to compare. The `top`-reported memory usage of `./pants dependencies --transitive ::`: * `313M` before (summary [before.txt](https://github.com/pantsbuild/pants/files/8175461/before.txt)) * `220M` after (summary [after.txt](https://github.com/pantsbuild/pants/files/8175462/after.txt)) [ci skip-build-wheels]
Snapshot
in terms of a new DigestTrie
typeSnapshot
in terms of a new DigestTrie
type
[ci skip-build-wheels]
…ore a `DirectoryDigest` in `PyDigest`. [ci skip-build-wheels]
[ci skip-build-wheels]
[ci skip-build-wheels]
8e6137b
to
1e033f2
Compare
I've posted a few more PRs from the stack, and it looks like although performance is not better (yet!) it at least hasn't regressed: #13112 (comment) Landing. |
As described in #13112: we currently persist all
remexec::Directory
structures to disk, although many of them are intermediate results which never actually need to be persisted across restarts (the only position which actually needs to be persisted is a cache output). That extra IO (and complexity) has performance costs: roughly 20% of some profiles.In order to skip persisting
remexec::Directory
s to disk, we need an in-memory structure to replace them (which we will persist or load from aStore
when necessary). To that end, this change introduces aDigestTrie
type, which is a trie representing aDigest
ed filesystem tree. ADigestTrie
is sufficient to replace the list ofPathStats
previously held by aSnapshot
. But because it uses interned names for path members, and is structure shared usingArc
s, it does so using a lot less memory.This is the first of three or four PRs which will replace all methods which recursively load/store
remexec::Directory
with operations onDigestTrie
s instead (e.g.MergeDigests
,DigestSubset
, etc). As methods are ported, their boundaries will be adjusted to skip callingstore.record_digest_tree
(which persists theDigestTree
to disk using our existing format), until only the local cache persists directory structures.