-
Notifications
You must be signed in to change notification settings - Fork 319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement GC #12
Comments
fix keybinding documentation
Until this is implemented, is there a way to completely eliminate data from the |
I'm afraid not, at least not without a lot of manual steps. You'd need to manually roll back to an earlier operation and manually remove any
|
If someone implements GC, it would also be a good time to "compact" the keep refs (assuming that jj doesn't already do that). Some Git operations scale with the number of refs, so having thousands of |
The current workaround for the refs building up is running It'd be obviously nicer if |
@ilyagr I mention it because, at my work, a certain Phabricator script invokes |
I think they are the cause of #293. We can even create a single commit with all visible heads as parents and attach a ref to that commit. That way we only need one ref ot preserve all commits. |
For the record,
I hope it will be mitigated by gitoxide. Last time I checked, |
I thought the problem was with "ref advertisement", which protocol v2 skips (https://git-scm.com/docs/protocol-v2), but I have done anything to confirm that. |
Good point, good to know.
On one hand, I think it's clever. On the other hand, it reminded me of the following piece of code we recently added, which is quadratic in the number of parents a commit has. It made me wonder how many various git tools might have code like that. In other words, I'm worried some tools won't be optimized to deal smoothly with commits that have hundreds-thousands of parents. Lines 388 to 399 in b37293f
It also might be fine, I'm unsure. |
iirc, libgit2 unpacks commits to sort refs by committer timestamp (and exclude non-commit refs), then picks up first ~500 refs or something. I don't know which phase this function belongs to. I think the |
When I reviewed that, I was also wondering to myself how long it'll take before someone runs into that :)
I suspect it's rare that tools inspect unusual refs like |
This is also done by Git, not for sorting, but for finding a cutoff date (calculated from the remote refs) to determine what commit dates would be considered "recent". See how the Also note that I don't think libgit2 has protocol v2, so there is no reduced list of refs. |
(🤔 out loud) Following the discussion in #2659, I thought of an approach to GC that seems a little naive and simplistic, but also seems conceptually nice (assuming it's correct). Unfortunately, it only makes For each operation in the op log, we can have a Git merge commit. Its first parent will be the merge commit for the previous operation. The rest are the heads of the view for this operation (i.e., the commits needed to preserve this view from garbage collection). The merge commit for the latest operation gets a Now, garbage collection can be thought of as a rebase: pick one of these marge commits and rebase it so that the merge commit for the previous operation is no longer a parent. Now, "normal" git gc will do more or less the right thing. Update: The content of the commits is not important here, so this is more of a "verbatim rebase" AKA "reparenting" from #1027. Similarly, this could look like a rebase in the operation log (create one operation for all the changes up to operation X-1 and rebase operation X onto that). I think operation log ids are randomly generated, so we can consider them as change ids. (We could also have commit ids for them). Alternatively, we could even not do any sort of GC on the operation log, but teach |
Also thinking out loud...
This means that when we GC, we cannot go back to all previous operations anymore - don't we want to save some (say, from the past few weeks)? I guess we can solve this by making a commit that refers to the heads of all the operations we want to save (maybe using Martin's giant merge commit idea, which also means that if/when we develop a way to simplify a list of commits into only the commits necessary to refer to all of them, we can use that optimization when creating the giant merge commit), then when rebasing the latest merge commit, we not only remove one parent, but add this newly-made commit as one of its parents too.
This sounds like it would solve it from the jj side, but on the Git side, the commits are still referred to and cannot be GCed. Having said that, being able to store metadata about commits seems like a useful idea (e.g. going off the partial/shallow clone idea, we could say that we know about this commit because a child commit was fetched from such-and-such remote on such-and-such day) but currently we don't have a design about how to send/receive this data and if we do, whether we need it to have a hash (not sure if it's needed at such an early stage, though). |
Yes, but I didn't think we want to save all of them. In the model I proposed, choosing the cutoff point is simply choosing the merge commit for which operation is "rebased/reparented" (see also the new note I added to my previous comment about reparenting. I just mean that the contents of the merge commit is irrelevant, only the parent). I'll also mention that there's only a minor difference between Martin's huge merge commits and my tree of merge commits; my version simply creates less humongous (but still large) merge commits and makes some of them children of others to achieve the same effect that Martin achieves. OTOH, in Martin's version, only one humongous merge commit is relevant at any one time.
Good point, I guess in my model the GC of the op log has to correspond to the GC of the actual commits in those operations. I think that might be OK, especially in the first version. It could alternatively be worked around. We could remember some (or all) of the older operations without keeping the corresponding "merge commits" around, so the commits for those older operations can be GCed. |
…ns (PoC) In order to implement GC (martinvonz#12), we'll need to somehow prune old operations. Perhaps the easiest implementation is to just remove unwanted operation files and put tombstone file instead (like git shallow.) However, the removed operations might be referenced by another jj process running in parallel. Since the parallel operation thinks all the historical head commits are reachable, the removed operations would have to be resurrected (or fix up index data, etc.) when the op heads get merged. The idea behind this patch is to split the "op log" GC into two steps: 1. recreate operations to be retained and make the old history unreachable, 2. delete unreachable operations if the head was created e.g. 3 days ago. The latter will be run by "jj util gc". I don't think GC can be implemented 100% safe against lock-less append-only storage, and we'll probably need some timestamp-based mechanism to not remove objects that might be referenced by uncommitted operation. FWIW, another nice thing about this implementation is that the index is automatically invalidated as the op id changes. The bad thing is that the "undo" description would contain an old op id. It seems the performance is pretty okay.
In order to implement GC (martinvonz#12), we'll need to somehow prune old operations. Perhaps the easiest implementation is to just remove unwanted operation files and put tombstone file instead (like git shallow.) However, the removed operations might be referenced by another jj process running in parallel. Since the parallel operation thinks all the historical head commits are reachable, the removed operations would have to be resurrected (or fix up index data, etc.) when the op heads get merged. The idea behind this patch is to split the "op log" GC into two steps: 1. recreate operations to be retained and make the old history unreachable, 2. delete unreachable operations if the head was created e.g. 3 days ago. The latter will be run by "jj util gc". I don't think GC can be implemented 100% safe against lock-less append-only storage, and we'll probably need some timestamp-based mechanism to not remove objects that might be referenced by uncommitted operation. FWIW, another nice thing about this implementation is that the index is automatically invalidated as the op id changes. The bad thing is that the "undo" description would contain an old op id. It seems the performance is pretty okay. This patch also includes support for "op abandon root.." because it can be implemented for free. I'm not sure what should be checked as prerequisite. I made it require "op restore root && op abandon root.." for now.
In order to implement GC (martinvonz#12), we'll need to somehow prune old operations. Perhaps the easiest implementation is to just remove unwanted operation files and put tombstone file instead (like git shallow.) However, the removed operations might be referenced by another jj process running in parallel. Since the parallel operation thinks all the historical head commits are reachable, the removed operations would have to be resurrected (or fix up index data, etc.) when the op heads get merged. The idea behind this patch is to split the "op log" GC into two steps: 1. recreate operations to be retained and make the old history unreachable, 2. delete unreachable operations if the head was created e.g. 3 days ago. The latter will be run by "jj util gc". I don't think GC can be implemented 100% safe against lock-less append-only storage, and we'll probably need some timestamp-based mechanism to not remove objects that might be referenced by uncommitted operation. FWIW, another nice thing about this implementation is that the index is automatically invalidated as the op id changes. The bad thing is that the "undo" description would contain an old op id. It seems the performance is pretty okay. This patch also includes support for "op abandon root.." because it can be implemented for free. I'm not sure what should be checked as prerequisite. I made it require "op restore root && op abandon root.." for now.
In order to implement GC (martinvonz#12), we'll need to somehow prune old operations. Perhaps the easiest implementation is to just remove unwanted operation files and put tombstone file instead (like git shallow.) However, the removed operations might be referenced by another jj process running in parallel. Since the parallel operation thinks all the historical head commits are reachable, the removed operations would have to be resurrected (or fix up index data, etc.) when the op heads get merged. The idea behind this patch is to split the "op log" GC into two steps: 1. recreate operations to be retained and make the old history unreachable, 2. delete unreachable operations if the head was created e.g. 3 days ago. The latter will be run by "jj util gc". I don't think GC can be implemented 100% safe against lock-less append-only storage, and we'll probably need some timestamp-based mechanism to not remove objects that might be referenced by uncommitted operation. FWIW, another nice thing about this implementation is that the index is automatically invalidated as the op id changes. The bad thing is that the "undo" description would contain an old op id. It seems the performance is pretty okay. This patch also includes support for "op abandon root.." because it can be implemented for free. I'm not sure what should be checked as prerequisite. I made it require "op restore root && op abandon root.." for now.
In order to implement GC (martinvonz#12), we'll need to somehow prune old operations. Perhaps the easiest implementation is to just remove unwanted operation files and put tombstone file instead (like git shallow.) However, the removed operations might be referenced by another jj process running in parallel. Since the parallel operation thinks all the historical head commits are reachable, the removed operations would have to be resurrected (or fix up index data, etc.) when the op heads get merged. The idea behind this patch is to split the "op log" GC into two steps: 1. recreate operations to be retained and make the old history unreachable, 2. delete unreachable operations if the head was created e.g. 3 days ago. The latter will be run by "jj util gc". I don't think GC can be implemented 100% safe against lock-less append-only storage, and we'll probably need some timestamp-based mechanism to not remove objects that might be referenced by uncommitted operation. FWIW, another nice thing about this implementation is that the index is automatically invalidated as the op id changes. The bad thing is that the "undo" description would contain an old op id. It seems the performance is pretty okay.
In order to implement GC (martinvonz#12), we'll need to somehow prune old operations. Perhaps the easiest implementation is to just remove unwanted operation files and put tombstone file instead (like git shallow.) However, the removed operations might be referenced by another jj process running in parallel. Since the parallel operation thinks all the historical head commits are reachable, the removed operations would have to be resurrected (or fix up index data, etc.) when the op heads get merged. The idea behind this patch is to split the "op log" GC into two steps: 1. recreate operations to be retained and make the old history unreachable, 2. delete unreachable operations if the head was created e.g. 3 days ago. The latter will be run by "jj util gc". I don't think GC can be implemented 100% safe against lock-less append-only storage, and we'll probably need some timestamp-based mechanism to not remove objects that might be referenced by uncommitted operation. FWIW, another nice thing about this implementation is that the index is automatically invalidated as the op id changes. The bad thing is that the "undo" description would contain an old op id. It seems the performance is pretty okay.
In order to implement GC (#12), we'll need to somehow prune old operations. Perhaps the easiest implementation is to just remove unwanted operation files and put tombstone file instead (like git shallow.) However, the removed operations might be referenced by another jj process running in parallel. Since the parallel operation thinks all the historical head commits are reachable, the removed operations would have to be resurrected (or fix up index data, etc.) when the op heads get merged. The idea behind this patch is to split the "op log" GC into two steps: 1. recreate operations to be retained and make the old history unreachable, 2. delete unreachable operations if the head was created e.g. 3 days ago. The latter will be run by "jj util gc". I don't think GC can be implemented 100% safe against lock-less append-only storage, and we'll probably need some timestamp-based mechanism to not remove objects that might be referenced by uncommitted operation. FWIW, another nice thing about this implementation is that the index is automatically invalidated as the op id changes. The bad thing is that the "undo" description would contain an old op id. It seems the performance is pretty okay.
Since new operations and views may be added concurrently by another process, there's a risk of data corruption. The keep_newer parameter is a mitigation for this problem. It's set to preserve files modified within the last 2 weeks, which is the default of "git gc". Still, a concurrent process may replace an existing view which is about to be deleted by the gc process, and the view file would be lost. martinvonz#12
Since new operations and views may be added concurrently by another process, there's a risk of data corruption. The keep_newer parameter is a mitigation for this problem. It's set to preserve files modified within the last 2 weeks, which is the default of "git gc". Still, a concurrent process may replace an existing view which is about to be deleted by the gc process, and the view file would be lost. #12
With my jj repo, the number of jj/keep refs went down from 87887 to 27733. The .git directory size is halved, but we'll need to clean up extra and index files to save disk space. "git gc --prune=now && jj debug reindex" passed, so the repo wouldn't be corrupted. martinvonz#12
With my jj repo, the number of jj/keep refs went down from 87887 to 27733. The .git directory size is halved, but we'll need to clean up extra and index files to save disk space. "git gc --prune=now && jj debug reindex" passed, so the repo wouldn't be corrupted. martinvonz#12
With my jj repo, the number of jj/keep refs went down from 87887 to 27733. The .git directory size is halved, but we'll need to clean up extra and index files to save disk space. "git gc --prune=now && jj debug reindex" passed, so the repo wouldn't be corrupted. martinvonz#12
With my jj repo, the number of jj/keep refs went down from 87887 to 27733. The .git directory size is halved, but we'll need to clean up extra and index files to save disk space. "git gc --prune=now && jj debug reindex" passed, so the repo wouldn't be corrupted. martinvonz#12
With my jj repo, the number of jj/keep refs went down from 87887 to 27733. The .git directory size is halved, but we'll need to clean up extra and index files to save disk space. "git gc --prune=now && jj debug reindex" passed, so the repo wouldn't be corrupted. martinvonz#12
With my jj repo, the number of jj/keep refs went down from 87887 to 27733. The .git directory size is halved, but we'll need to clean up extra and index files to save disk space. "git gc --prune=now && jj debug reindex" passed, so the repo wouldn't be corrupted. martinvonz#12
With my jj repo, the number of jj/keep refs went down from 87887 to 27733. The .git directory size is halved, but we'll need to clean up extra and index files to save disk space. "git gc --prune=now && jj debug reindex" passed, so the repo wouldn't be corrupted. #12
No description provided.
The text was updated successfully, but these errors were encountered: