Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shallowness of git fetching #9402

Open
Ericson2314 opened this issue Nov 20, 2023 · 6 comments
Open

Shallowness of git fetching #9402

Ericson2314 opened this issue Nov 20, 2023 · 6 comments
Labels
fetching Networking with the outside (non-Nix) world, input locking

Comments

@Ericson2314
Copy link
Member

Thanks to @DavHau for bringing up these issues in #9240 & #9376

TODO write problem

Tentative solution (two alternatives):

  1. Benchmark the different options and see whether they are natively doable with libgit
  2. Depending on the results of the above
    1. Go the treeless route if reasonably fast and supported
      • Always eagerly fetch all commit history (the "spine")
        • roberth: Doesn't have to be eager, but eager is easier to implement for now
      • Always fetch just one tree
      • revCount supported because have commit history
      • cannot fetch from shallow repo trivially, but ways around this
        • users encouraged to likewise do treeless clones locally rather than traditional shallow to not hit this restriction.
      • Open Questions:
        1. can libgit do treeless clones? (otherwise we can shell out)? Is libgit open to adding such functionality?
        2. Can libgit manage/manipulate a treeless clone?
    2. Continue to have fetching all the history be conditional.
      1. Optional (because needed for revCount)
      2. Default for fetchTree is no revCount, but for fetchGit for backwards-compat we have to keep revCount by default.
      3. @tomberek: not fetching all the history means to use "shallow = true" / --depth 1
  3. (later) consider using that for fetchFromGithub and friends if the performance is good enough

With either version of (2) we never need to fetch any trees of other commits. The only question is whether we fetch all the commits and support revCount.

@Ericson2314 Ericson2314 changed the title Shallowness fetch of git fetching Shallowness of git fetching Nov 20, 2023
@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/2023-11-20-nix-team-meeting-minutes-105/35902/1

@DavHau
Copy link
Member

DavHau commented Nov 23, 2023

I'll start working on the benchmarks and post here as soon as I have some results.

@DavHau
Copy link
Member

DavHau commented Nov 30, 2023

Git shallow fetching benchmark

Goals

Collect data to decide if fetching the git history can be optimized so that it becomes viable to have this as the only strategy for fetching git repos, or if it is better to make fetching history conditional via a flag.

The two variants that are raced against each other in this benchmark:

  • Variant 1: Fetch only the tree (the files) of a single given revision
  • Variant 2: Like Variant 1 but also fetch the commit history (treeless), in order to be able to return revCount.

Using the git cli tool:

  • Variant 1 requires 1 invocation:
    • fetch tree
  • Variant 2 requires 3 invocations:
    • fetch tree
    • fetch commits
    • compute rev-count

Setup of the benchmark

Repo: nixpkgs

Remotes:

  • github
  • gitlab
  • bitbucket
  • gitea

Tested operations (measured in seconds):

  1. Fetch tree
  2. Fetch commits
  3. Compute rev-count

Test Machines:

  • cloud VM
  • notebook + 5G hotspot

Measured properties

  • tree seconds: Seconds it took to fetch the tree shallowly
  • commit seconds: Seconds it took to fetch the commit history treeless
  • rev-count seconds: Seconds it took to compute the revision count

Computed properties

  • total duration: Sum of all 3 steps
  • slowdown factor: Total fetching time of Variant 2 relative to Variant 1

Results

https://docs.google.com/spreadsheets/d/1yLo4XMCKj-4xZVHBz2z4BwxBWI76xjVe7gtNgaNG7co/edit?usp=sharing

Reproduce

nix build github:DavHau/nix-git-research#benchmark-fod -L

Observations

Support for commit fetching varies

Fetching the commit history in an optimized way requires using a --filter for git fetch. Bitbucket, which was one of the tested providers does not support filtering, therefore git falls back to full cloning, leading to very bad performance just for getting the rev-count.

Overhead of Variant 2 is significant

Even in the best cases, fetching the commit history slows down the fetching by at least a factor of 3.7.

For example, from a cloud VM fetching a nixpkgs revision from github takes 8 seconds, while fetching the commit history and computing the revCount takes an additional 24 seconds, which is a slowdown by factor of 4.

In the case of nixpkgs, the number of objects that need to be fetched for the history is far larger than the number of tree objects for a single revision. It's ~500k objects for the history vs. ~60k for the tree.

Inconsistent overhead of Variant 2

While fetching the tree gives relatively consistent results, fetching the commit history is fast on some runs, but much slower on others, even for the same provider. This was only experienced with github on the 5G connection, and not from within the cloud. It might be an outlier, but it might be possible that github doesn't serve all kinds of requests with the same priority.

Opinion

My opinion after these observations is that we should go with approach 2/ii from above, meaning the history fetching becomes optional and with it the revCount attribute as well.

Always fetching the history eagerly as described in 2/i would impose:

  • significant slowdown (at least 3.7x slower for nixpkgs, adding at least 22s)
  • and either:
    • relying on servers to support filtering
    • or fetching the full history which performs much worse (40x slower for nixpkgs)

Roadmap Proposal

  1. Port the shallow fetching PR over to libgit2
  2. Add a shallowRev = true/false argument to fetchTree which is false by default in fetchTree and true by default in fetchGit. (I think it is useful for this flag to have shallow in its name, because that is what people will most likely search for in manuals).
  3. Add support for submodules
  4. Investigate if caching shallow repos provides any benefit
  5. Add caching or not
  6. Investigate filtering support of libgit2
  7. Add filtering to reduce the overhead when fetching the commit history in case shallowRev = true.

After the nix team adopts and/or confirms the plan, I can start working on it.

EDIT: an alternative plan could be to first optimize the current case, which means adding filtering support for the commit history right away. If libgit2 supports this, then this might be easy to do and it already provides quite a significant performance gain in many cases.

@thufschmitt
Copy link
Member

Thanks for the detailed analysis @DavHau !

Pursuing 2.ii seems to be the right call indeed given your benchmarks (I didn't know about --unshallow, that's pretty neat btw).

The roadmap makes sense too, although I wouldn't be too strict about the ordering of the steps (different people will probably have different priorities here).

I'm not sure about the shallowRev name (at least it doesn't mean much to me), but that can be a separate dicsussion.

@fricklerhandwerk fricklerhandwerk added the fetching Networking with the outside (non-Nix) world, input locking label Dec 18, 2023
@thufschmitt thufschmitt moved this to To triage in Nix team Dec 22, 2023
@DavHau
Copy link
Member

DavHau commented Dec 24, 2023

I was on holiday and will pick this up now.

@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/2023-12-15-nix-team-meeting-minutes-112/38155/1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fetching Networking with the outside (non-Nix) world, input locking
Projects
Status: Defined work
Development

No branches or pull requests

5 participants