-
Notifications
You must be signed in to change notification settings - Fork 51
The Omega Repo
An omega-repo is a special type of mono-repo wherein the meta-repo and sub-repos live in one Git repository.
I have under review a command that will enable a full omega-repo workflow. To create a brand-new sub-repo:
$ git meta new my/sub/repo
Created new sub-repo my/sub/repo. It is currently empty. Please
stage changes and/or make a commit before finishing with 'git meta commit';
you will not be able to use 'git meta commit' until you do so.
$ touch my/sub/repo/README.md
$ git meta add .
Or, to create a sub-repo by importing existing history:
$ git meta new -i http://example.com/my-repo.git my/imported/repo
At this point, your new repositories are staged for commit:
On branch master.
Changes to be committed:
(use "git meta reset HEAD <file>..." to unstage)
new file: my/imported/repo (submodule, newly created)
new file: my/sub/repo (submodule, newly created)
new file: my/sub/repo/README.md
You can commit them and push to a branch as normal:
$ git meta commit -m "created some sub-repos"
$ git meta push origin master:new-repos
It's important to note that the new
command does not interact
with any back-end and has no side-effects outside the repository
on which it operates. The changes it makes and any created
repositories (e.g., my/imported/repo
) exist only in your local
repository until pushed upstream.
Implementing this scheme is easy, and doesn't conflict with the existing
proposal for mono-repos: when sub-repos are created, they simply have the same
URL as the meta-repo. That is, all sub-repos and the meta-repo share the same
URL. Specifically, each submodule has an entry of "." in the .gitmodules
file. Existing methods of opening, fetching from, and pushing to sub-repos
remain the same. Alternatively, we could completely ignore URLs configured
with a submodule.
A repo is created by mounting a repository (generated-locally) as a submodule.
It has to have a commit before you can push, but otherwise using, e.g.,
git-meta push
would be sufficient to land a newly-created repository.
Removing a submodule is sufficient to remove a sub-repo.
Rename is effectively implemented as delete + create.
It's important to note that none of the lifecycle operations described above require interacting with the back-end. If a user locally creates a sub-repo then changes his or her mind, there is no "mess" left on the back-end.
This approach solves several problems:
-
Submodule URLS -- Our current prototype relies on a unique capability
of gitolite that allows repositories names to contain nested paths, e.g.
foo/bar/baz
. No other widely-used Git hosting solution supports this feature. Without this feature, we would need a scheme to map nested paths to flat names, or translate URLs somewhere along the way. - Sub-repo lifecycle transitions -- There are many potential corner-cases, especially in the back-end, involved in creating, deleting, and renaming repositories. If they don't exist as repositories, these issues mostly disappear, or at worst are resolved using normal Git conflict-resolution.
- Sub-repo lifecycle APIs -- The Git protocol does not provide for server-side operations such as repository creation or deletion. Thus, such operations are tied to specific Git hosting solutions and fall outside the purview of git-meta. With the new approach, we can create, delete, and rename sub-repos using standard Git operations.
- Sub-repo lifecycle changes are local -- When sub-repos must be backed one-to-one by back-end repos, you cannot make lifecycle changes without interacting with, and manipulating the back-end. With the new approach, sub-repos can be created, removed, renamed, etc., with purely local operations.
- Sub-repo lifecycle changes are first-class Git changes -- The entire history of a sub-repo lives in the mono-repo. If a user creates a sub-repo and never pushes the change that introduced it, it actually never happened.
Additionally, this approach has other advantages and allows for new possibilities:
- Easier management -- Maintaining and working with a single repository on the back-end may dramatically simplify mono-repo maintenance.
- Better mono-repo mobiility -- Because everything lives in a single Git repository, it's extremely easy to move the mono-repo around. If a developer wants to work entirely locally, for example, it's very easy to fetch all necessary refs in one shot rather than issuing potentially thousands of fetch commands.
- Any Git repo can be a mono-repo! -- With the new approach, any individual repository can be a mon-repo. A normal Github repository can be a mono-repo. You don't need to have a managed server instance with programmatic access for repository creation.
- Mono-repos can be distributed again -- Because they are self-contained, mono-repos can be local, and they can be peers. Particularly, if we choose to ignore submodule URLs, we can have true remotes and leverage normal forking.
I think there are more advantages that I haven't considered. This approach gets us a little closer to the ideal of thinking of code as being in a single repository, where sub-repos are defined as an optimization to avoid the need to fetch and check out the whole world.
I can think of a few disadvantages to this approach over having separate back-end repositories for sub-repos, and I'm sure others will find more:
- We lose repo-level permissioning -- One advantage of breaking code into sub-repos is the ability to leverage, e.g., Github's ability to specify permissions on a per-repository basis. By using a single repository, we lose this. However, as we've discussed the mono-repo more, I've come to believe that his ability is not-sufficient; we do want to think of the mono-repo as a single repo, and need the ability to specify permissions at a greater-than-repo level. For example, if all external repositories live in a single tree, we may wish to specify a single user (group) to review all external code additions.
- Performance -- It may be less efficient to put all commits in a single repository, and with truly large systems, it we may lose some ability to distribute work. I do not think this issue is truly a problem; the physical size of even very large repositories can be readily handled by modern distributed filesystems.
- Discoverability -- We may get less use from built-in facilities for searching and navigation from, e.g., Github, if there are no formal sub-repos.
- Cloning -- There is no way to quickly and easily clone a single sub-repo.
-
Submodule Refs -- You lose the ability to mirror meta-level refs inside
the individual submodules, since pushing them to the server would cause a
collision. But someday
git meta
will be able to manage these locally for you.
We may be able to mitigate some of the above disadvantages by using Git namespaces. We could establish a sub-repo namespace, where all branches in the meta-repo are mapped to branches in each sub-repo namespace; these branches and namespaces would be maintained by server-side hooks. Thus, there would be branches available to discovery tools to allow inspection of the heads in each sub-repo, and users could easily make local clones containing only refs from a specific sub-repo.
@novalis and others raised questions about possible server-side performance
problems with this scheme, especially for fetches and pushes. To address those
concerns and see how and see how a repository implemented this way would feel
at scale, I created the generate-repo.js
script and used it to create an
enormous mono-repo, which I have hosted here:
https://github.com/bpeabody/mongo. This repository has 260,000 commits on
master
and around 26,000 submodules.
Unfortunately, you can't clone that repository and begin using it like a
mono-repo because Github doesn't appear to support the
uploadpack.allowReachableSHA1InWant
we need to directly fetch commits
(neither does Gitlab or Bitbucket). You can, however, clone that repository
and host it locally.
Findings so far are very promising. The clone
operation shows the performance
that would be expected from a repository with so many commits and
submodules -- taking about 45s. Similarly, push
and fetch
take just a few
seconds each. I have yet to see any evidence of the supralinear degradation
that we feared, that would have been very evident in a repository of this size.