Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[internal] Implement Snapshot in terms of a new DigestTrie type #14654

Merged
merged 4 commits into from
Mar 4, 2022

Conversation

stuhood
Copy link
Member

@stuhood stuhood commented Mar 1, 2022

As described in #13112: we currently persist all remexec::Directory structures to disk, although many of them are intermediate results which never actually need to be persisted across restarts (the only position which actually needs to be persisted is a cache output). That extra IO (and complexity) has performance costs: roughly 20% of some profiles.

In order to skip persisting remexec::Directorys to disk, we need an in-memory structure to replace them (which we will persist or load from a Store when necessary). To that end, this change introduces a DigestTrie type, which is a trie representing a Digested filesystem tree. A DigestTrie is sufficient to replace the list of PathStats previously held by a Snapshot. But because it uses interned names for path members, and is structure shared using Arcs, it does so using a lot less memory.

This is the first of three or four PRs which will replace all methods which recursively load/store remexec::Directory with operations on DigestTries instead (e.g. MergeDigests, DigestSubset, etc). As methods are ported, their boundaries will be adjusted to skip calling store.record_digest_tree (which persists the DigestTree to disk using our existing format), until only the local cache persists directory structures.

@stuhood stuhood force-pushed the stuhood/digest-tree branch 5 times, most recently from ed91450 to 4ddf601 Compare March 1, 2022 21:23
@stuhood stuhood changed the title Implement Snapshot in terms of a new DigestTree type Implement Snapshot in terms of a new DigestTrie type Mar 1, 2022
@stuhood stuhood marked this pull request as ready for review March 1, 2022 22:47
Comment on lines 63 to 97
/// Creates a DirectoryDigest which asserts that the given Digest represents a Directory structure
/// which is persisted in a Store.
pub fn new(digest: Digest) -> Self {
Self { digest, tree: None }
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method could almost have been called DirectoryDigest::todo_port_to_consuming_digest_trees, as all usages other than usage at a remote execution boundary represent positions where we should be holding a DirectoryDigest already.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that name, if it isn't too much of a pain to update w/ merge conflicts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and did (roughly) this, because it will really help with tracking down the various cases where we are converting.

Comment on lines 210 to 214
fn from_sorted_paths(
prefix: PathBuf,
paths: Vec<TypedPath>,
file_digests: &HashMap<PathBuf, Digest>,
) -> Self {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method was ported from Snapshot::ingest_sorted_path_stats. Most of #13112 involves porting methods like this from operating against the Store (and lazy loading/storing remexec::Directorys) to synchronous in-memory operations.

@Eric-Arellano
Copy link
Contributor

I know this is one of several PRs, but thoughts on a benchmark?

@stuhood
Copy link
Member Author

stuhood commented Mar 1, 2022

I know this is one of several PRs, but thoughts on a benchmark?

I'm fairly certain that this will be slower, since it is both creating the tree in memory and then persisting it to disk. It basically only adds work. The next PR will begin to remove work by allowing the tree to be used in e.g. MergeDigests without being loaded: will include benchmarks there.

We also don't have to land this until more of the followup work is posted.

Copy link
Contributor

@Eric-Arellano Eric-Arellano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I do recommend considering the variable/function renames here, but fine to punt on unit tests in a followup if necessary.

src/rust/engine/fs/src/tree.rs Outdated Show resolved Hide resolved
src/rust/engine/fs/src/tree.rs Outdated Show resolved Hide resolved
src/rust/engine/fs/src/tree.rs Outdated Show resolved Hide resolved
src/rust/engine/fs/src/tree.rs Outdated Show resolved Hide resolved
src/rust/engine/fs/src/tree.rs Outdated Show resolved Hide resolved
src/rust/engine/fs/src/directory.rs Show resolved Hide resolved
Comment on lines 63 to 97
/// Creates a DirectoryDigest which asserts that the given Digest represents a Directory structure
/// which is persisted in a Store.
pub fn new(digest: Digest) -> Self {
Self { digest, tree: None }
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that name, if it isn't too much of a pain to update w/ merge conflicts.

src/rust/engine/fs/src/directory.rs Outdated Show resolved Hide resolved
///
/// If this DirectoryDigest has been persisted to disk (i.e., does not have a DigestTrie) then
/// this will only include the root.
pub fn digests(&self) -> Vec<Digest> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe test this? The stack code looks good but is non-trivial.

src/rust/engine/fs/src/directory.rs Outdated Show resolved Hide resolved
src/rust/engine/fs/src/directory.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@Eric-Arellano Eric-Arellano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit tests would still be really good to add in a followup because there are lots of small functions w/ somewhat non-trivial functionality. But fine for a followup

src/rust/engine/fs/src/directory.rs Outdated Show resolved Hide resolved
src/rust/engine/fs/src/directory.rs Show resolved Hide resolved
stuhood added a commit that referenced this pull request Mar 3, 2022
The `--stats-memory-summary` added in #14638/#14652 was [reporting surprisingly large sizes](#12662 (comment))  for native `NodeKey` structs -- even when excluding the actual Python values that they held. Investigation showed that both the `Task` and `Entry` structs were contributing significantly to the size of the `Task` struct.

The [`internment` crate](https://crates.io/crates/internment) used here (and in #14654) is an alternative to giving these values integer IDs. They become pointers to a unique, shared (technically: leaked) copy of the value. They are consequently 1) much smaller, 2) much faster to compare.

The `top`-reported memory usage of `./pants dependencies --transitive ::`:
* `313M` before (summary [before.txt](https://github.com/pantsbuild/pants/files/8175461/before.txt))
* `220M` after (summary [after.txt](https://github.com/pantsbuild/pants/files/8175462/after.txt))

[ci skip-build-wheels]
stuhood added a commit to stuhood/pants that referenced this pull request Mar 3, 2022
…uild#14683)

The `--stats-memory-summary` added in pantsbuild#14638/pantsbuild#14652 was [reporting surprisingly large sizes](pantsbuild#12662 (comment))  for native `NodeKey` structs -- even when excluding the actual Python values that they held. Investigation showed that both the `Task` and `Entry` structs were contributing significantly to the size of the `Task` struct.

The [`internment` crate](https://crates.io/crates/internment) used here (and in pantsbuild#14654) is an alternative to giving these values integer IDs. They become pointers to a unique, shared (technically: leaked) copy of the value. They are consequently 1) much smaller, 2) much faster to compare.

The `top`-reported memory usage of `./pants dependencies --transitive ::`:
* `313M` before (summary [before.txt](https://github.com/pantsbuild/pants/files/8175461/before.txt))
* `220M` after (summary [after.txt](https://github.com/pantsbuild/pants/files/8175462/after.txt))

[ci skip-build-wheels]
stuhood added a commit that referenced this pull request Mar 3, 2022
…pick of #14683) (#14689)

The `--stats-memory-summary` added in #14638/#14652 was [reporting surprisingly large sizes](#12662 (comment))  for native `NodeKey` structs -- even when excluding the actual Python values that they held. Investigation showed that both the `Task` and `Entry` structs were contributing significantly to the size of the `Task` struct.

The [`internment` crate](https://crates.io/crates/internment) used here (and in #14654) is an alternative to giving these values integer IDs. They become pointers to a unique, shared (technically: leaked) copy of the value. They are consequently 1) much smaller, 2) much faster to compare.

The `top`-reported memory usage of `./pants dependencies --transitive ::`:
* `313M` before (summary [before.txt](https://github.com/pantsbuild/pants/files/8175461/before.txt))
* `220M` after (summary [after.txt](https://github.com/pantsbuild/pants/files/8175462/after.txt))

[ci skip-build-wheels]
@stuhood stuhood changed the title Implement Snapshot in terms of a new DigestTrie type [internal] Implement Snapshot in terms of a new DigestTrie type Mar 4, 2022
[ci skip-build-wheels]
…ore a `DirectoryDigest` in `PyDigest`.

[ci skip-build-wheels]
[ci skip-build-wheels]
@stuhood
Copy link
Member Author

stuhood commented Mar 4, 2022

I know this is one of several PRs, but thoughts on a benchmark?

I'm fairly certain that this will be slower, since it is both creating the tree in memory and then persisting it to disk. It basically only adds work. The next PR will begin to remove work by allowing the tree to be used in e.g. MergeDigests without being loaded: will include benchmarks there.

We also don't have to land this until more of the followup work is posted.

I've posted a few more PRs from the stack, and it looks like although performance is not better (yet!) it at least hasn't regressed: #13112 (comment)

Landing.

@stuhood stuhood merged commit b599eb5 into pantsbuild:main Mar 4, 2022
@stuhood stuhood deleted the stuhood/digest-tree branch March 4, 2022 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants