Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize glob matching for the DigestTrie subset case (and others) #14890

Open
stuhood opened this issue Mar 23, 2022 · 0 comments
Open

Optimize glob matching for the DigestTrie subset case (and others) #14890

stuhood opened this issue Mar 23, 2022 · 0 comments

Comments

@stuhood
Copy link
Member

stuhood commented Mar 23, 2022

In #14889, we removed an optimization from glob matching of Snapshots: the ability to "match everything in a directory recursively" if we encounter a recursive wildcard which is guaranteed to match everything.

We should reintroduce this optimization (likely as an optional method of Vfs which can return "all Stats recursively under a directory"), but we should also see what other optimizations can be made to the Vfs interface and glob matching to improve performance. For example:

  1. Holding onto a generic state parameter and passing it to Vfs::scandir as we recurse in glob matching (which impl Vfs for DigestTrie could use to maintain the current directory::Entry, rather than repeatedly recursing down to it via DigestTrie::entry).
  2. Adjusting glob matching to use a recursive-group-by strategy similar to DigestTrie::merge, rather than recursing independently for each PathGlob.
stuhood added a commit that referenced this issue Mar 23, 2022
… by subsetting (#14889)

#14858 reported that an invalid `Snapshot` was being created, and reproducing in debug mode triggered `debug_assertions` related to the validity of `remexec::Directory`s created by the subset matching code.

Although porting the existing subset-matching code to `DigestTrie` _and_ fixing the bug would be one option, we've wanted to remove duplication between the "apply globs to a `Snapshot`" and "apply globs to the filesystem" codepaths for a long time (see #9967), and fixing the bug by unifying the codepaths kills two birds with one stone.

This change ports to implementing subset matching using `Vfs::expand_globs`, followed by the creation of a new `Snapshot` from the matches. This is definitely not as optimized as the direct subset matching was (`./cargo bench -p store -- subset` reports a change from ~20ms to ~100ms for 10k files), but it opens the door to unified optimization of _both_ our glob-expansion code and in-memory glob matching in parallel: see #14890.

Fixes #9967, fixes #14858, fixes #12462, and fixes #13112.
stuhood added a commit to stuhood/pants that referenced this issue Mar 23, 2022
… by subsetting (pantsbuild#14889)

pantsbuild#14858 reported that an invalid `Snapshot` was being created, and reproducing in debug mode triggered `debug_assertions` related to the validity of `remexec::Directory`s created by the subset matching code.

Although porting the existing subset-matching code to `DigestTrie` _and_ fixing the bug would be one option, we've wanted to remove duplication between the "apply globs to a `Snapshot`" and "apply globs to the filesystem" codepaths for a long time (see pantsbuild#9967), and fixing the bug by unifying the codepaths kills two birds with one stone.

This change ports to implementing subset matching using `Vfs::expand_globs`, followed by the creation of a new `Snapshot` from the matches. This is definitely not as optimized as the direct subset matching was (`./cargo bench -p store -- subset` reports a change from ~20ms to ~100ms for 10k files), but it opens the door to unified optimization of _both_ our glob-expansion code and in-memory glob matching in parallel: see pantsbuild#14890.

Fixes pantsbuild#9967, fixes pantsbuild#14858, fixes pantsbuild#12462, and fixes pantsbuild#13112.
stuhood added a commit that referenced this issue Mar 24, 2022
… by subsetting (cherrypick of #14889) (#14896)

#14858 reported that an invalid `Snapshot` was being created, and reproducing in debug mode triggered `debug_assertions` related to the validity of `remexec::Directory`s created by the subset matching code.

Although porting the existing subset-matching code to `DigestTrie` _and_ fixing the bug would be one option, we've wanted to remove duplication between the "apply globs to a `Snapshot`" and "apply globs to the filesystem" codepaths for a long time (see #9967), and fixing the bug by unifying the codepaths kills two birds with one stone.

This change ports to implementing subset matching using `Vfs::expand_globs`, followed by the creation of a new `Snapshot` from the matches. This is definitely not as optimized as the direct subset matching was (`./cargo bench -p store -- subset` reports a change from ~20ms to ~100ms for 10k files), but it opens the door to unified optimization of _both_ our glob-expansion code and in-memory glob matching in parallel: see #14890.

Fixes #9967, fixes #14858, fixes #12462, and fixes #13112.

[ci skip-build-wheels]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant