-
-
Notifications
You must be signed in to change notification settings - Fork 312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fetch and clone support (bare) #450
Comments
I recently encountered an problems to clone a large repository over a extremely slow data link. After a certain timeout, the server (or intermediate proxy) terminated the connection. Each time, the server generated a huge batch of objects for the head commit (in fact, to get a commit, you need to get all the objects, even those that were made on the lower commits). Git gets an error and doesn't unpack the truncated response. Need unpack it manually. And list of 'have'-directives in protocol request didn't help for me. Please, to implement the feature, optimize the algorithm so that already transmitted data is not thrown out when the connection is broken. (p.s. this shallow cloning took me 24 GB over 1 week) |
Did you try the
That's true - the reason might be that it is unable to validate the received objects as the trailing hash of the received pack would be missing. However, I also have been burned by this which is why there is a special restore mode when receiving a pack. It salvages the received objects at least. However, the way the git protocol works the server still may send all the objects the next time the reference is requested as the algorithm's granularity is only per commit. With partial packs, it' entirely unclear which objects are present and which aren't unless they are all traversed and verified. So, in order to actually have a benefit from keeping a partial pack one would have to see which commits are completely available (while handling That said, there is a bunch of things one could implement to help this case if the client and server would implement some custom extensions.
Awesome, I love it! I would have given up for sure!
It would certainly be interesting to learn more about the algorithm you used to split up big clones into many smaller ones as it wouldn't require a server and client extension to the protocol. Such client-side only algorithm could possibly be implemented in |
That way the most complex thing will be the validation along with the matching.
Some possible states are still missing though, like deletion in pushes.
It's the simplest possible one, but it shows the test framework is up to the task now so it can be test-driven. We should be able to construct a test for each possible instruction and eventually pass all tests, including the baseline ones.
Even though for now everything is without validation
It's not documented in `git-push`, even though git parses it fine for some reason.
…e::Name`. (#450) That way it's made clear the remote can also be a URL, while rejecting illformed UTF8. The latter isn't valid for remote names anyway as these only support a very limited character set. Note that this error currently is degenerated, making it appear if the remote name doesn't exists if illformed UTF-8 is found in what appears to be a symbolic ref.
…#450) We can also parse it, adding yet another variant to `fetch::Refs`.
That way the caller has to be aware of the possibility of an unborn branch (probably the only unborn branch) on the remote.
…stination) (#450) That's exactly what git does, so it's probably the right thing to do if in doubt.
Previously we assumed this could only happen for `HEAD`, but in fact dangling symrefs are possible and they might end up in the server response that way.
Don't use `static` unless it's really needed
However, it's not yet refreshed in the repository we create, so that needs fixing. Implementing `repo.config()` would be too much effort for now, so let's continue forcing it in another way.
gitoxide integration: fetch This PR is the first step towards resolving #1171. In order to get there, we integrate `gitoxide` into `cargo` in such a way that one can control its usage in nightly via `-Zgitoxide` or `Zgitoxide=<feature>[,featureN]`. Planned features are: * **fetch** - all fetches are done with `gitxide` (this PR) * **shallow_index** - the crates index will be a shallow clone (_planned_) * **shallow_deps** - git dependencies will be a shallow clone (_planned_) * **checkout** - plain checkouts with `gitoxide` (_planned_) The above list is a prediction and might change as we understand the requirements better. ### Testing and Transitioning By default, everything stays as is. However, relevant tests can be re-runwith `gitoxide` using ``` RUSTFLAGS='--cfg always_test_gitoxide' cargo test git ``` There are about 200 tests with 'git' in their name and I plan to enable them one by one. That way the costs for CI stay managable (my first measurement with one test was 2min 30s), while allowing to take one step at a time. Custom tests shall be added once we realize that more coverage is needed. That way we should be able to maintain running `git2` and `gitoxide` side by side until we are willing to switch over to `gitoxide` entirely on stable cargo. Then turning on `git2` might be a feature toggle for a while until we finally remove it from the codebase. _Please see the above paragraph as invitation for discussion, it's merely a basis to explore from and improve upon._ ### Tasks * [x] add feature toggle * [x] setup test system with one currently successful test * [x] implement fetch with `gitoxide` (MVP) * [x] fetch progress * [x] detect spurious errors * [x] enable as many git tests as possible (and ignore what's not possible) * [x] fix all git-related test failures (except for 1: built-in upload-pack, skipped for now) * [x] validate that all HTTP handle options that come from `cargo` specific values are passed to `gitoxide` * [x] a test to validate `git2` code can handle crates-index clones created with `gitoxide` and vice-versa * [x] remove patches that enabled `gitoxide` enabled testing - it's not used anymore * [x] ~~remove all TODOs and use crates-index version of `git-repository`~~ The remaining 2 TODO's are more like questions for the reviewer. * [x] run all tests with gitoxide on the fastest platform as another parallel task * [x] switch to released version * [x] [Tasks from first review round](#11448 (comment)) * [x] create a new `gitoxide` release and refer to the latest version from crates.io (instead of git-dependency) * [x] [address 2nd review round comments](#11448 (comment)) ### Postponed Tasks I suggest to go breadth-first and implement the most valuable features first, and then aim for a broad replacement of `git2`. What's left is details and improved compatibility with the `git2` implementation that will be required once `gitoxide` should become the default implementation on stable to complete the transition. * **built-in support for serving the `file` protocol** (i.e. without using `git`). Simple cases like `clone` can probably be supported quickly, `fetch` needs more work though due to negotiation. * SSH name fallbacks via a native (probably ~~libssh~~ (avoid LGPL) `libssh2` based) transport. Look at [this issue](#2399) for some history. * additional tasks from [this tracking issue](GitoxideLabs/gitoxide#450 (comment)) ### Proposed Workflow I am now using [stacked git](https://stacked-git.github.io) to keep commits meaningful during development. This will also mean that before reviews I will force-push a lot as changes will be bucketed into their respective commits. Once review officially begins I will stop force-pushing and create small commits to address review comments. That way it should be easier to understand how things change over time. Those review-comments can certainly be squashed into one commit before merging. _Please let me know if this is feasible or if there are other ways of working you prefer._ ### Development notes * unrelated: [this line](https://github.com/rust-lang/cargo/blob/9827412fee4f5a88ac85e013edd954b2b63f399b/src/cargo/ops/registry.rs#L620) refers to an issue that has since been resolved in `curl`. * Additional tasks related to a correct fetch implementation are collected in this [tracking issue](GitoxideLabs/gitoxide#450). **These affect how well the HTTP transport can be configured, needs work** * _authentication_ [is quite complex](https://github.com/rust-lang/cargo/blob/37cad5bd7f7dcd2f6d3e45312a99a9d3eec1e2a0/src/cargo/sources/git/utils.rs#L490) and centred around making SSH connections work. This feature is currently the weakest in `gitoxide` as it simply uses `ssh` (the program) and calls it a day. No authentication flows are supported there yet and the goal would be to match `git` there at least (which it might already do by just calling `ssh`). Needs investigation. Once en-par with `git` I think `cargo` can restart the whole fetch operation to try different user names like before. - the built-in `ssh`-program based transport can now understand permission-denied errors, but the capability isn't used after all since a builtin ssh transport is required. * It would be possible to implement `git::Progress` and just ignore most of the calls, but that's known to be too slow as the implementation assumes a `Progress::inc()` call is as fast as an atomic increment and makes no attempt to reduce its calls to it. * learning about [a way to get custom traits in `thiserror`](dtolnay/thiserror#212) could make spurious error checks nicer and less error prone during maintenance. It's not a problem though. * I am using `RUSTFLAGS=--cfg` to influence the entire build and unit-tests as environment variables didn't get through to the binary built and run for tests. ### Questions * The way `gitoxide` is configured the user has the opportunity to override these values using more specific git options, for example using url specific http settings. This looks like a feature to me, but if it's not `gitoxide` needs to provide a way to disable applying these overrides. Please let me know what's desired here - my preference is to allow overrides. * `gitoxide` currently opens repositories similar to how `git` does which respects git specific environment variables. This might be a deviation from how it was before and can be turned off. My preference is to see it as a feature. ### Prerequisite PRs * #11602
Is this actually complete despite the unfinished tasks in the OP? |
It works for all intents and purposes but isn’t perfect related to some details. These are still tracked here, maybe they can be moved into a follow-up issue. |
We want shallow clones and this issue tracks what needs to be done to get there.
Prerequisite tasks for bare clones
url.base.insteadOf
and….pushInsteadOf
gix fetch
with fast-forward support #548)clone
forgit-repository
#551unbuffered progress messages- lines are buffered line by line, but that's it. Hence we receive everything in real-time already.naive
negotiation in favor of properconsecutive
one (or else clones from some servers may fail) via integrategix-negotiate
#861Follow-ups of
ditch naive implementation
Most of these are optional, but represent opportunities to make
gix
better, so shouldn't be lost.see if-commit_graph()
can return our own type connected toRepo
, or if the graph can be made to be more convenient to use withgix::Id
not really, but getting traversal with commitgraph support would be great. Probably it can simply be retro-fitted to the existing traversal. But then again, it would speed up generating ids, but most people using that kind of traversal would just want to access commits plainly, which forces loading them anyway. So it's probably OK to keep it as is.- retro-fitted commit-graph support, because it will be useful to somegix corpus
MVP #897 (initial version with tracing)gix corpus
with a little more to doAdditional tasks
These are for correctness, but don't block
cargo
integration as nocargo
tests depend on them.git
itself usescontent-length
as the buffer is pre-created in memory.git fetch --update-head-ok
. Cargo passes it to the CLI and maybe it's something we will need too just to make its updates work.Tasks for proper transport configuration
http.<url>.*
based option overridesTasks for shallow cloning
Research needed, but the libgit2 issue might be helpful for more hints.
Research
.git/shallow
is present (containing the commits that are the shallow boundary, present, but without parents)Watch out
git-repository
, which is tracked ingix
towards 1.0 #470 .depth > 1
or converting it back to having full history.The text was updated successfully, but these errors were encountered: