Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WORK-IN-PROGRESS] Introduce the path walk API into Git for Windows #5146

Closed
wants to merge 45 commits into from

Conversation

dscho
Copy link
Member

@dscho dscho commented Sep 16, 2024

This is a huge feature, and it would be good to let users play with it. To make it safer to play with, it needs to be guarded behind command-line/config options.

derrickstolee and others added 30 commits September 15, 2024 18:46
In anticipation of a few planned applications, introduce the most basic form
of a path-walk API. It currently assumes that there are no UNINTERESTING
objects, and does not include any complicated filters. It calls a function
pointer on groups of tree and blob objects as grouped by path. This only
includes objects the first time they are discovered, so an object that
appears at multiple paths will not be included in two batches.

There are many future adaptations that could be made, but they are left for
future updates when consumers are ready to take advantage of those features.

RFC TODO: It would be helpful to create a test-tool that allows printing of
each batch for strong testing.

Signed-off-by: Derrick Stolee <[email protected]>
In anticipation of implementing 'git backfill', populate the necessary files
with the boilerplate of a new builtin.

RFC TODO: When preparing this for a full implementation, make sure it is
based on the newest standards introduced by [1].

[1] https://lore.kernel.org/git/[email protected]/T/#m606036ea2e75a6d6819d6b5c90e729643b0ff7f7
    [PATCH 1/3] builtin: add a repository parameter for builtin functions

Signed-off-by: Derrick Stolee <[email protected]>
The default behavior of 'git backfill' is to fetch all missing blobs that
are reachable from HEAD. Document and test this behavior.

The implementation is a very simple use of the path-walk API, initializing
the revision walk at HEAD to start the path-walk from all commits reachable
from HEAD. Ignore the object arrays that correspond to tree entries,
assuming that they are all present already.

Signed-off-by: Derrick Stolee <[email protected]>
Users may want to specify a minimum batch size for their needs. This is only
a minimum: the path-walk API provides a list of OIDs that correspond to the
same path, and thus it is optimal to allow delta compression across those
objects in a single server request.

We could consider limiting the request to have a maximum batch size in the
future.

Signed-off-by: Derrick Stolee <[email protected]>
One way to significantly reduce the cost of a Git clone and later fetches is
to use a blobless partial clone and combine that with a sparse-checkout that
reduces the paths that need to be populated in the working directory. Not
only does this reduce the cost of clones and fetches, the sparse-checkout
reduces the number of objects needed to download from a promisor remote.

However, history investigations can be expensie as computing blob diffs will
trigger promisor remote requests for one object at a time. This can be
avoided by downloading the blobs needed for the given sparse-checkout using
'git backfill' and its new '--sparse' mode, at a time that the user is
willing to pay that extra cost.

Note that this is distinctly different from the '--filter=sparse:<oid>'
option, as this assumes that the partial clone has all reachable trees and
we are using client-side logic to avoid downloading blobs outside of the
sparse-checkout cone. This avoids the server-side cost of walking trees
while also achieving a similar goal. It also downloads in batches based on
similar path names, presenting a resumable download if things are
interrupted.

This augments the path-walk API to have a possibly-NULL 'pl' member that may
point to a 'struct pattern_list'. This could be more general than the
sparse-checkout definition at HEAD, but 'git backfill --sparse' is currently
the only consumer.

Be sure to test this in both cone mode and not cone mode. Cone mode has the
benefit that the path-walk can skip certain paths once they would expand
beyond the sparse-checkout.

Signed-off-by: Derrick Stolee <[email protected]>
This adds the ability to ask for the commits as a single list. This will
also reduce the calls in 'git backfill' to be a BUG() statement if called
with anything other than blobs.

Signed-off-by: Derrick Stolee <[email protected]>
The previous change introduced the '--[no-]sparse' option for the 'git
backfill' command, but did not assume it as enabled by default. However,
this is likely the behavior that users will most often want to happen.
Without this default, users with a small sparse-checkout may be confused
when 'git backfill' downloads every version of every object in the full
history.

However, this is left as a separate change so this decision can be reviewed
independently of the value of the '--[no-]sparse' option.

Add a test of adding the '--sparse' option to a repo without sparse-checkout
to make it clear that supplying it without a sparse-checkout is an error.

Signed-off-by: Derrick Stolee <[email protected]>
In anticipation of using the path-walk API to analyze tags or include
them in a pack-file, add the ability to walk the tags that were included
in the revision walk.

Signed-off-by: Derrick Stolee <[email protected]>
Start work on a new `git survey` command to scan the repository
for monorepo performance and scaling problems.  The goal is to
measure the various known "dimensions of scale" and serve as a
foundation for adding additional measurements as we learn more
about Git monorepo scaling problems.

The initial goal is to complement the scanning and analysis performed
by the GO-based `git-sizer` (https://github.com/github/git-sizer) tool.
It is hoped that by creating a builtin command, we may be able to take
advantage of internal Git data structures and code that is not
accessible from GO to gain further insight into potential scaling
problems.

RFC TODO: Adapt this boilerplat to match the upcoming changes to builtin
methods that include a 'struct repository' pointer.

Co-authored-by: Derrick Stolee <[email protected]>
Signed-off-by: Jeff Hostetler <[email protected]>
Signed-off-by: Derrick Stolee <[email protected]>
By default we will scan all references in "refs/heads/", "refs/tags/"
and "refs/remotes/".

Add command line opts let the use ask for all refs or a subset of them
and to include a detached HEAD.

Signed-off-by: Jeff Hostetler <[email protected]>
Signed-off-by: Derrick Stolee <[email protected]>
Collect the set of requested branches, tags, and etc into a ref_array and
collect the set of requested patterns into a strvec.

RFC TODO: This patch has some changes that should be in the previous patch,
to make the diff look a lot better.

Co-authored-by: Derrick Stolee <[email protected]>
Signed-off-by: Jeff Hostetler <[email protected]>
Signed-off-by: Derrick Stolee <[email protected]>
When 'git survey' provides information to the user, this will be presented
in one of two formats: plaintext and JSON. The JSON implementation will be
delayed until the functionality is complete for the plaintext format.

The most important parts of the plaintext format are headers specifying the
different sections of the report and tables providing concreted data.

Create a custom table data structure that allows specifying a list of
strings for the row values. When printing the table, check each column for
the maximum width so we can create a table of the correct size from the
start.

The table structure is designed to be flexible to the different kinds of
output that will be implemented in future changes.

Signed-off-by: Derrick Stolee <[email protected]>
At the moment, nothing is obvious about the reason for the use of the
path-walk API, but this will become more prevelant in future iterations. For
now, use the path-walk API to sum up the counts of each kind of object.

For example, this is the reachable object summary output for my local repo:

REACHABLE OBJECT SUMMARY
========================
Object Type |  Count
------------+-------
       Tags |      0
    Commits | 178573
      Trees | 312745
      Blobs | 183035

(Note: the "Tags" are zero right now because the path-walk API has not been
integrated to walk tags yet. This will be fixed in a later change.)

RFC TODO: make sure tags are walked before this change.

Signed-off-by: Derrick Stolee <[email protected]>
The sparse tree walk algorithm was created in d5d2e93 (revision:
implement sparse algorithm, 2019-01-16) and involves using the
mark_trees_uninteresting_sparse() method. This method takes a repository
and an oidset of tree IDs, some of which have the UNINTERESTING flag and
some of which do not.

Create a method that has an equivalent set of preconditions but uses a
"dense" walk (recursively visits all reachable trees, as long as they
have not previously been marked UNINTERESTING). This is an important
difference from mark_tree_uninteresting(), which short-circuits if the
given tree has the UNINTERESTING flag.

A use of this method will be added in a later change, with a condition
set whether the sparse or dense approach should be used.

Signed-off-by: Derrick Stolee <[email protected]>
Now that we have explored objects by count, we can expand that a bit more to
summarize the data for the on-disk and inflated size of those objects. This
information is helpful for diagnosing both why disk space (and perhaps
clone or fetch times) is growing but also why certain operations are slow
because the inflated size of the abstract objects that must be processed is
so large.

Signed-off-by: Derrick Stolee <[email protected]>
This option causes the path-walk API to act like the sparse tree-walk
algorithm implemented by mark_trees_uninteresting_sparse() in
list-objects.c.

Starting from the commits marked as UNINTERESTING, their root trees and
all objects reachable from those trees are UNINTERSTING, at least as we
walk path-by-path. When we reach a path where all objects associated
with that path are marked UNINTERESTING, then do no continue walking the
children of that path.

We need to be careful to pass the UNINTERESTING flag in a deep way on
the UNINTERESTING objects before we start the path-walk, or else the
depth-first search for the path-walk API may accidentally report some
objects as interesting.

Signed-off-by: Derrick Stolee <[email protected]>
In order to more easily compute delta bases among objects that appear at the
exact same path, add a --path-walk option to 'git pack-objects'.

This option will use the path-walk API instead of the object walk given by
the revision machinery. Since objects will be provided in batches
representing a common path, those objects can be tested for delta bases
immediately instead of waiting for a sort of the full object list by
name-hash. This has multiple benefits, including avoiding collisions by
name-hash.

The objects marked as UNINTERESTING are included in these batches, so we
are guaranteeing some locality to find good delta bases.

After the individual passes are done on a per-path basis, the default
name-hash is used to find other opportunistic delta bases that did not
match exactly by the full path name.

RFC TODO: It is important to note that this option is inherently
incompatible with using a bitmap index. This walk probably also does not
work with other advanced features, such as delta islands.

Getting ahead of myself, this option compares well with --full-name-hash
when the packfile is large enough, but also performs at least as well as
the default in all cases that I've seen.

RFC TODO: this should probably be recording the batch locations to another
list so they could be processed in a second phase using threads.

RFC TODO: list some examples of how this outperforms previous pack-objects
strategies. (This is coming in later commits that include performance
test changes.)

Signed-off-by: Derrick Stolee <[email protected]>
In future changes, we will make use of these methods. The intention is to
keep track of the top contributors according to some metric. We don't want
to store all of the entries and do a sort at the end, so track a
constant-size table and remove rows that get pushed out depending on the
chosen sorting algorithm.

Co-authored-by: Jeff Hostetler <[email protected]>
Signed-off-by; Jeff Hostetler <[email protected]>
Signed-off-by: Derrick Stolee <[email protected]>
Since we are already walking our reachable objects using the path-walk API,
let's now collect lists of the paths that contribute most to different
metrics. Specifically, we care about

 * Number of versions.
 * Total size on disk.
 * Total inflated size (no delta or zlib compression).

This information can be critical to discovering which parts of the
repository are causing the most growth, especially on-disk size. Different
packing strategies might help compress data more efficiently, but the toal
inflated size is a representation of the raw size of all snapshots of those
paths. Even when stored efficiently on disk, that size represents how much
information must be processed to complete a command such as 'git blame'.

Since the on-disk size is likely to be fragile, stop testing the exact
output of 'git survey' and check that the correct set of headers is
output.

Signed-off-by: Derrick Stolee <[email protected]>
There are many tests that validate whether 'git pack-objects' works as
expected. Instead of duplicating these tests, add a new test environment
variable, GIT_TEST_PACK_PATH_WALK, that implies --path-walk by default
when specified.

This was useful in testing the implementation of the --path-walk
implementation, especially in conjunction with test such as:

 - t5322-pack-objects-sparse.sh : This demonstrates the effectiveness of
   the --sparse option and how it combines with --path-walk.

RFC TODO: list other helpful test cases, as well as the ones where the
behavior breaks if this is enabled...

Signed-off-by: Derrick Stolee <[email protected]>
To test the benefits of the new --path-walk option in 'git
pack-objects', create a performance test that times the process but also
compares the size of the output.

Against the microsoft/fluentui repo [1] against a particular commit [2],
this has reproducible results of a similar scale:

Test                                            this tree
---------------------------------------------------------------
5313.2: thin pack                               0.39(0.48+0.03)
5313.3: thin pack size                                     1.2M
5313.4: thin pack with --path-walk              0.09(0.07+0.01)
5313.5: thin pack size with --path-walk                   20.8K
5313.6: big recent pack                         2.13(8.29+0.26)
5313.7: big recent pack size                              17.7M
5313.8: big recent pack with --path-walk        3.18(4.21+0.22)
5313.9: big recent pack size with --path-walk             15.0M

[1] https://github.com/microsoft/reactui
[2] e70848ebac1cd720875bccaa3026f4a9ed700e08

RFC TODO: Note that the path-walk version is slower for the big case,
but the delta calculation is single-threaded with the current
implementation! It's still faster for the small case that mimics a
typical push.

Signed-off-by: Derrick Stolee <[email protected]>
Since 'git pack-objects' supports a --path-walk option, allow passing it
through in 'git repack'. This presents interesting testing opportunities for
comparing the different repacking strategies against each other.

For the microsoft/fluentui repo [1], the results are very interesting:

Test                                            this tree
-------------------------------------------------------------------
5313.10: full repack                            97.91(663.47+2.83)
5313.11: full repack size                                449.1K
5313.12: full repack with --path-walk           105.42(120.49+0.95)
5313.13: full repack size with --path-walk               159.1K

[1] https://github.com/microsoft/fluentui

This repo suffers from having a lot of paths that collide in the name
hash, so examining them in groups by path leads to better deltas. Also,
in this case, the single-threaded implementation is competitive with the
full repack. This is saving time diffing files that have significant
differences from each other.

Signed-off-by: Derrick Stolee <[email protected]>
Users may want to enable the --path-walk option for 'git pack-objects' by
default, especially underneath commands like 'git push' or 'git repack'.

This should be limited to client repositories, since the --path-walk option
disables bitmap walks, so would be bad to include in Git servers when
serving fetches and clones. There is potential that it may be helpful to
consider when repacking the repository, to take advantage of improved deltas
across historical versions of the same files.

Much like how "pack.useSparse" was introduced and included in
"feature.experimental" before being enabled by default, use the repository
settings infrastructure to make the new "pack.usePathWalk" config enabled by
"feature.experimental" and "feature.manyFiles".

Signed-off-by: Derrick Stolee <[email protected]>
RFC NOTE: this is essentially the same as the patch introduced
independently of the RFC, but now is on top of the --path-walk option
instead. This is included in the RFC for comparison purposes.

RFC NOTE: As you can see from the details below, the --full-name-hash
option essentially attempts to do similar things as the --path-walk
option, but sometimes misses the mark. Collisions still happen with the
--full-name-hash option, leading to some misses. However, in cases where
the default name-hash algorithm has low collision rates and deltas are
actually desired across objects with similar names but different full
names, the --path-walk option can still take advantage of the default
name hash approach.

Here are the new performance details simulating a single push in an
internal monorepo using a lot of paths that collide in the default name
hash. We can see that --full-name-hash gets close to the --path-walk
option's size.

Test                                           this tree
--------------------------------------------------------------
5313.2: thin pack                              2.43(2.92+0.14)
5313.3: thin pack size                                    4.5M
5313.4: thin pack with --full-name-hash        0.31(0.49+0.12)
5313.5: thin pack size with --full-name-hash             15.5K
5313.6: thin pack with --path-walk             0.35(0.31+0.04)
5313.7: thin pack size with --path-walk                  14.2K

However, when simulating pushes on repositories that do not have issues
with name-hash collisions, the --full-name-hash option presents a
potential of worse delta calculations, such as this example using my
local Git repository:

Test                                           this tree
--------------------------------------------------------------
5313.2: thin pack                              0.03(0.01+0.01)
5313.3: thin pack size                                     475
5313.4: thin pack with --full-name-hash        0.02(0.01+0.01)
5313.5: thin pack size with --full-name-hash             14.8K
5313.6: thin pack with --path-walk             0.02(0.01+0.01)
5313.7: thin pack size with --path-walk                    475

Note that the path-walk option found the same delta bases as the default
options in this case.

In the full repack case, the --full-name-hash option may be preferable
because it interacts well with other advanced features, such as using
bitmap indexes and tracking delta islands.

Signed-off-by: Derrick Stolee <[email protected]>
Using this tool, we can count how many distinct name-hash values exist
within a list of paths. Examples include

 git ls-tree -r --name-only HEAD | \
	     test-tool name-hash | \
  	      awk "{print \$1;}" | \
  		 sort -ns | uniq | wc -l

which outputs the number of distinct name-hash values that appear at
HEAD. Or, the following which presents the resulting name-hash values of
maximum multiplicity:

 git ls-tree -r --name-only HEAD | \
	     test-tool name-hash | \
	      awk "{print \$1;}" | \
	       sort -n | uniq -c | sort -nr | head -n 25

For an internal monorepo with around a quarter million paths at HEAD,
the highest multiplicity for the standard name-hash function was 14,424
while the full name-hash algorithm had only seven hash values with any
collision, with a maximum multiplicity of two.

Signed-off-by: Derrick Stolee <[email protected]>
This test helps inform someone as to the behavior of the name-hash
algorithms for their repo based on the paths at HEAD.

For example, the microsoft/fluentui repo had these statistics at time of
committing:

Test                                              this tree
-----------------------------------------------------------------
5314.1: paths at head                                       19.6K
5314.2: number of distinct name-hashes                       8.2K
5314.3: number of distinct full-name-hashes                 19.6K
5314.4: maximum multiplicity of name-hashes                   279
5314.5: maximum multiplicity of fullname-hashes                 1

That demonstrates that of the nearly twenty thousand path names, they
are assigned around eight thousand distinct values. 279 paths are
assigned to a single value, leading the packing algorithm to sort
objects from those paths together, by size.

In this repository, no collisions occur for the full-name-hash
algorithm.

In a more extreme example, an internal monorepo had a much worse
collision rate:

Test                                              this tree
-----------------------------------------------------------------
5314.1: paths at head                                      221.6K
5314.2: number of distinct name-hashes                      72.0K
5314.3: number of distinct full-name-hashes                221.6K
5314.4: maximum multiplicity of name-hashes                 14.4K
5314.5: maximum multiplicity of fullname-hashes                 2

Even in this repository with many more paths at HEAD, the collision rate
was low and the maximum number of paths being grouped into a single
bucket by the full-path-name algorithm was two.

Signed-off-by: Derrick Stolee <[email protected]>
Repositories registered with Scalar are expected to be client-only
repositories that are rather large. This means that they are more likely to
be good candidates for using the --path-walk option when running 'git
pack-objects', especially under the hood of 'git push'. Enable this config
in Scalar repositories.

Signed-off-by: Derrick Stolee <[email protected]>
In order to debug what is going on during delta calculations, add a
--debug-file=<file> option to 'git pack-objects'. This leads to sending
a JSON-formatted description of the delta information to that file.

Signed-off-by: Derrick Stolee <[email protected]>
These patches introduce the 'git backfill' builtin including its
'--batch-size' and '--sparse' options. While this is the first and
simplest application, it is also the lowest priority in terms of user
need.
These patches reimplement a subset of the functionality of 'git survey'
as well as generalize some of the data structures that Jeff's
implementation made. The flexibility hopefully comes through to show the
potential for future extensions. These patches are quite rough and will
need more attention before they can be sent for full review.
Here is where I think the meat of the RFC really lies. There are still
some rough edges, but the data will show that 'git pack-objects
--path-walk' has the potential to be an extremely effective way to pack
objects. (Caveats will come later in the analysis section.)
This is a simplified version of the patch series that was split out by
itself earlier for full review. This was split out on its own partly
because it doesn't actually use the path-walk API. This has benefits and
drawbacks, but it seems like a quick win for many scenarios.
Copy link
Member Author

@dscho dscho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only got around to looking at the backfill part yet, will continue reviewing later.

path-walk.c Show resolved Hide resolved
path-walk.c Show resolved Hide resolved
list->type = type;
strmap_put(&ctx->paths_to_lists, path.buf, list);
string_list_append(&ctx->path_stack, path.buf);
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to add this, out of an abundance of caution:

else if (list->type != type)
    warning("both blob and tree objects found at '%s'", path.buf);

Comment on lines +147 to +148
for (size_t i = 0; i < list->oids.nr; i++) {
ret |= add_children(ctx,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continuing on errors may be the right thing to do always. On the off-chance that a user may want to return early on error, we may want to add a flag for that.

}

oid_array_clear(&list->oids);
strmap_remove(&ctx->paths_to_lists, path, 1);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether this is safe. If the same path appears multiple times on the stack, e.g. if two merged branches separately removed the same file, wouldn't we then remove the paths_to_lists item prematurely? This might be intentional, though, as we stop once we're hitting a blob that we'd already seen. Therefore in a history like this:

... - A - B - C
        \   /
          D

where file.txt was present in A, deleted in B and D, and missing from C, we would probably hit the file in B first, walk its history all the way until the commit that added it, then remove the item from paths_to_lists, then re-add it when encountering it in D, see that we handled it in A already, and re-remove it. That might be what we need here, but I have not been able to wrap my head around this enough yet to be confident.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is important to realize that a path appears at most once. The walk goes like this:

  1. Walk all commits that we are going to walk. Initialize the "stack" to include only the root path ("")
  2. Walk all root trees that we are going to walk, initializing the "stack" with the list of all paths that appear as entries of those trees.
  3. Continue walking the stack, knowing that we visit each path only once.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the crucial part I was missing is that we first walk all commits, build up the huge, giant list for "" (i.e. all the top-level trees). Then we iterate over all of those before going any further. Crucially, no single tree walk happens until every last of the commits has been walked.

For some reason, I got the impression that we're doing a depth-first search where we immediately iterate into the full tree of the first commit, until we exhausted all of the blobs reachable from that root tree, and only then walk to the next commit. But that's not what we do ;-)

We basically walk all the paths in a depth-first manner, but not in alphabetical order because we end up plucking off the paths from the top of the stack (i.e. the "alphabetically last" item seen so far, but that only holds true if the root tree has no child trees, otherwise the order gets messy real quickly). So the order is somewhat jumbled, I guess ;-)

Thank you for explaining this so patiently. I think my confusion might be a good motivation to add a little primer about the order to the commit message of the "path-walk: introduce an object walk by path" commit.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was working on a technical doc for the API but don't have it committed or pushed anywhere. This confirms that it is necessary.

{
struct tree_desc desc;
struct name_entry entry;
struct strbuf path = STRBUF_INIT;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For performance reasons, it might make sense to pass in base_path as a strbuf, so that only one scratch buffer is used for the entirety of the path walk (at least in single-threaded mode). But we can do that later.

@@ -84,6 +86,12 @@ static int do_backfill(struct backfill_context *ctx)
struct path_walk_info info = PATH_WALK_INFO_INIT;
int ret;

if (ctx->sparse) {
CALLOC_ARRAY(info.pl, 1);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to release the allocated memory somewhere?

path-walk.c Show resolved Hide resolved
This is currently primarily here for me to test out a couple of ideas or
to investigate how this path walk API thing works.

Signed-off-by: Johannes Schindelin <[email protected]>
Copy link
Member Author

@dscho dscho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another batch of my review.

}
}

info->path_fn("initial", &tags, OBJ_TAG, info->path_fn_data);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably say "tags" instead of "initial"? And likewise the other path_fn("initial", ...) call may want to say "commits" instead?

And maybe even "tags" is misleading, because there could be blobs tracked under the file name tags? It is a valid path name, after all. We could play games by using a path that is invalid, such as .git/refs/tags (but what to do for commits? .git/refs/heads is wrong, they are not necessarily branch tips). Maybe better just use "".

The same goes for tagged-trees and tagged-blobs, too.

Also, do we want to prevent the function from being called with zero tags in case the caller asked for tags to be included but there were none to be found?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "initial" string comes from the way that this is emitted in the other object walk. I'll try to find pointers to that or reasoning for that to be important.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that means that you have to look at the object type first in the callback, and only interpret the path parameter as actual path in the OBJ_BLOB case?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, we want the path to matter for trees, too. When considering deltas in the pack-objects case, these paths don't matter for commits or tags because everything is sorted by type before compared by name-hash, so the set of deltas that are considered are the same in each case.

Comment on lines +272 to +276
/*
* Walk any pending objects at this point, but they should only
* be tags.
*/
for (size_t i = 0; i < info->revs->pending.nr; i++) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fear that we have to handle tags in revs->pending first, accumulating the list before walking the commits, but then only processing them afterward (to avoid double-accounting for blobs that are both tagged directly and that are reachable via a tree).

I've just added a test helper in 489ce0c and used the left-over of t8100-*.sh -d:

$ ./bin-wrappers/test-tool -C ./t/trash\ directory.t8100-git-survey/ path-walk --no-blobs --tags --no-commits -v four
visiting path 'initial' (count: 0) of type 'tag'

$ git -C ./t/trash\ directory.t8100-git-survey/ rev-parse four four^0
dcd3534d11452b820a49532b41e7494f3ddfa2a8
5e53728740668d567dad45ca14d258b8f75fb7c1

$ ./bin-wrappers/test-tool -C ./t/trash\ directory.t8100-git-survey/ path-walk --no-blobs --tags --commits -v four
visiting path 'initial' (count: 7) of type 'commit'
        5e53728740668d567dad45ca14d258b8f75fb7c1
        6cf975cad5d4c3afe9d9c1beb519506ab26f2968
        5e51a08864ae54b185d9ccb441e3b84272a1d962
        bd080f6764f92d2e5f3c14d9ab1241495b919820
        123d744603d8c13997f7dd86b4f066cc68305762
        446a0a30049ffec3f6a1af16c07e8c0941b11660
        d8452bb98e031b9f6d1e3804a8cecc14e3124c9e
visiting path 'initial' (count: 0) of type 'tag'

As you can see, there is a tag object four but it is not reported because the revision walk removes it from the pending list.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tag stuff is the weakest part of the path-walk API and demonstrates the need for a git rev-list-like test helper that checks which objects are collected for each path and object type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully the test helper I implemented is a good start?


int cmd_survey(int argc, const char **argv, const char *prefix)
{
if (argc == 2 && !strcmp(argv[1], "-h"))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speaking from experience, we must prefix this with a message that the command is experimental and that its options and its output and operation are subject to frequent change. I.e. something like this.

@dscho
Copy link
Member Author

dscho commented Sep 28, 2024

This PR is superseded by #5171, #5172 and #5174.

@dscho dscho closed this Sep 28, 2024
@dscho dscho deleted the path-walk-api-g4w branch September 28, 2024 13:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants