Implement GC #12

martinvonz · 2021-04-07T04:52:32Z

No description provided.

fix keybinding documentation

BatmanAoD · 2023-11-22T17:54:12Z

Until this is implemented, is there a way to completely eliminate data from the .jj repository, such as an inadvertently-snapshotted key file or target directory? (For colocated repos, I can of course delete .jj itself, but I'd rather not.)

martinvonz · 2023-11-22T18:23:58Z

I'm afraid not, at least not without a lot of manual steps. You'd need to manually roll back to an earlier operation and manually remove any refs/jj/keep/* keeping the unwanted commits alive. It's honestly not that complicated to implement GC. Let us know if you're interested and we can provide some guidance. I actually thought I had already described the basic idea here, but I clearly haven't, so let me at least do that. Here is what I had in mind:

Allow pruning old operations from the operation log. This involves teaching algorithms that walk the operation log to behave reasonably when an older operation is missing. For example, jj op log needs to not crash. I don't know if we should record pruned operation IDs so we can distinguish that case from a broken operation store.
Implement GC by walking the operation log and collecting all commits pointed to by refs and anonymous heads. The pass that set into some new Backend::gc(&self, commits: &[CommitId]) function. The GitBackend will then remove refs/jj/keep/* refs not pointing into that set. It will then call Git's normal GC (which might mean subprocessing to git if neither libgit2 nor gitoxide supports GC).

arxanas · 2023-11-24T21:51:52Z

If someone implements GC, it would also be a good time to "compact" the keep refs (assuming that jj doesn't already do that). Some Git operations scale with the number of refs, so having thousands of jj/keep/* refs can slow them down. As part of GC, we could calculate the set of heads of the live commits and create/preserve only those refs, while deleting all other jj/keep/* refs.

ilyagr · 2023-11-24T22:02:53Z

The current workaround for the refs building up is running git pack-refs --all occasionally. This is mentioned offhandedly in https://martinvonz.github.io/jj/v0.11.0/git-compatibility/#co-located-jujutsugit-repos. I'm guessing (but haven't checked) that after packing, the extraneous jj/keep refs take up disk space, but might not actually slow things down much.

It'd be obviously nicer if jj could do it itself and it was combined with GC.

arxanas · 2023-11-24T23:09:36Z

@ilyagr I mention it because, at my work, a certain Phabricator script invokes git for-each-ref --contains X, which hangs because I have 100k (packed) git-branchless refs 😄.

martinvonz · 2023-11-24T23:09:54Z

I'm guessing (but haven't checked) that after packing, the extraneous jj/keep refs take up disk space, but might not actually slow things down much.

I think they are the cause of #293.

We can even create a single commit with all visible heads as parents and attach a ref to that commit. That way we only need one ref ot preserve all commits.

yuja · 2023-11-24T23:13:18Z

For the record, jj/keep/* refs are now excluded from git::import_refs(). git pack-refs --all will help import large number of tags and branches, though.

I think they are the cause of #293.

I hope it will be mitigated by gitoxide. Last time I checked, libgit2 unpacked commit object for each ref, which was the major CPU overhead of jj git fetch. OTOH, git fetch is reasonably fast.

martinvonz · 2023-11-24T23:18:08Z

I thought the problem was with "ref advertisement", which protocol v2 skips (https://git-scm.com/docs/protocol-v2), but I have done anything to confirm that.

ilyagr · 2023-11-24T23:18:23Z

I think they are the cause of #293.

Good point, good to know. ~~I had a feeling that git pack-refs helps with this, but I'm not entirely sure.~~ (Yuya just posted something much more informed about this point above.)

We can even create a single commit with all visible heads as parents and attach a ref to that commit. That way we only need one ref ot preserve all commits.

On one hand, I think it's clever.

On the other hand, it reminded me of the following piece of code we recently added, which is quadratic in the number of parents a commit has. It made me wonder how many various git tools might have code like that. In other words, I'm worried some tools won't be optimized to deal smoothly with commits that have hundreds-thousands of parents.

jj/lib/src/rewrite.rs

Lines 388 to 399 in b37293f

    
           fn new_parents(&self, old_ids: &[CommitId]) -> Vec<CommitId> { 
        
               // This should be a set, but performance of a vec is much better since we expect 
        
               // 99% of commits to have <= 2 parents. 
        
               let mut new_ids = vec![]; 
        
               let mut add_parent = |id: &CommitId| { 
        
                   // This can trigger if we abandon an empty commit, as both the empty commit and 
        
                   // its parent are succeeded by the same commit. 
        
                   if !new_ids.contains(id) { 
        
                       new_ids.push(id.clone()); 
        
                   } 
        
               }; 
        
               for old_id in old_ids {

It also might be fine, I'm unsure.

yuja · 2023-11-24T23:33:52Z

I thought the problem was with "ref advertisement", which protocol v2 skips (https://git-scm.com/docs/protocol-v2),

iirc, libgit2 unpacks commits to sort refs by committer timestamp (and exclude non-commit refs), then picks up first ~500 refs or something. I don't know which phase this function belongs to. I think the git CLI implements more modern features, so git fetch vs (libgit-based) jj git fetch isn't 1:1 comparison of the same logic.

martinvonz · 2023-11-24T23:36:27Z

On the other hand, it reminded me of the following piece of code we recently added, which is quadratic in the number of parents a commit has.

When I reviewed that, I was also wondering to myself how long it'll take before someone runs into that :)

In other words, I'm worried some tools won't be optimized to deal smoothly with commits that have hundreds-thousands of parents.

I suspect it's rare that tools inspect unusual refs like refs/jj/keep/* to start with, so it's probably not going to happen often. And if it does happen, it should be fixed in the tool. We can create such commits as a tree with maximum 100 parents or something if we're worried about it.

jonathantanmy · 2023-11-27T18:58:12Z

I thought the problem was with "ref advertisement", which protocol v2 skips (https://git-scm.com/docs/protocol-v2),

iirc, libgit2 unpacks commits to sort refs by committer timestamp (and exclude non-commit refs)

This is also done by Git, not for sorting, but for finding a cutoff date (calculated from the remote refs) to determine what commit dates would be considered "recent". See how the cutoff variable in mark_complete_and_common_ref() in fetch-pack.c is used.

Also note that I don't think libgit2 has protocol v2, so there is no reduced list of refs.

ilyagr · 2023-12-03T04:14:45Z

(🤔 out loud)

Following the discussion in #2659, I thought of an approach to GC that seems a little naive and simplistic, but also seems conceptually nice (assuming it's correct). Unfortunately, it only makes git log --all problem discussed in that PR worse.

For each operation in the op log, we can have a Git merge commit. Its first parent will be the merge commit for the previous operation. The rest are the heads of the view for this operation (i.e., the commits needed to preserve this view from garbage collection). The merge commit for the latest operation gets a jj/keep ref.

Now, garbage collection can be thought of as a rebase: pick one of these marge commits and rebase it so that the merge commit for the previous operation is no longer a parent. Now, "normal" git gc will do more or less the right thing.

Update: The content of the commits is not important here, so this is more of a "verbatim rebase" AKA "reparenting" from #1027.

Similarly, this could look like a rebase in the operation log (create one operation for all the changes up to operation X-1 and rebase operation X onto that). I think operation log ids are randomly generated, so we can consider them as change ids. (We could also have commit ids for them).

Alternatively, we could even not do any sort of GC on the operation log, but teach jj that a commit can be in 3 states rather than the current 2: visible, hidden, or unavailable (AKA garbage-collected, though it could also be appropriate for partial/shallow clones).

jonathantanmy · 2023-12-11T22:07:59Z

(🤔 out loud)

Also thinking out loud...

Following the discussion in #2659, I thought of an approach to GC that seems a little naive and simplistic, but also seems conceptually nice (assuming it's correct). Unfortunately, it only makes git log --all problem discussed in that PR worse.

For each operation in the op log, we can have a Git merge commit. Its first parent will be the merge commit for the previous operation. The rest are the heads of the view for this operation (i.e., the commits needed to preserve this view from garbage collection). The merge commit for the latest operation gets a jj/keep ref.

Now, garbage collection can be thought of as a rebase: pick one of these marge commits and rebase it so that the merge commit for the previous operation is no longer a parent. Now, "normal" git gc will do more or less the right thing.

This means that when we GC, we cannot go back to all previous operations anymore - don't we want to save some (say, from the past few weeks)? I guess we can solve this by making a commit that refers to the heads of all the operations we want to save (maybe using Martin's giant merge commit idea, which also means that if/when we develop a way to simplify a list of commits into only the commits necessary to refer to all of them, we can use that optimization when creating the giant merge commit), then when rebasing the latest merge commit, we not only remove one parent, but add this newly-made commit as one of its parents too.

Alternatively, we could even not do any sort of GC on the operation log, but teach jj that a commit can be in 3 states rather than the current 2: visible, hidden, or unavailable (AKA garbage-collected, though it could also be appropriate for partial/shallow clones).

This sounds like it would solve it from the jj side, but on the Git side, the commits are still referred to and cannot be GCed.

Having said that, being able to store metadata about commits seems like a useful idea (e.g. going off the partial/shallow clone idea, we could say that we know about this commit because a child commit was fetched from such-and-such remote on such-and-such day) but currently we don't have a design about how to send/receive this data and if we do, whether we need it to have a hash (not sure if it's needed at such an early stage, though).

ilyagr · 2023-12-12T03:30:44Z

This means that when we GC, we cannot go back to all previous operations anymore - don't we want to save some (say, from the past few weeks)?

Yes, but I didn't think we want to save all of them. In the model I proposed, choosing the cutoff point is simply choosing the merge commit for which operation is "rebased/reparented" (see also the new note I added to my previous comment about reparenting. I just mean that the contents of the merge commit is irrelevant, only the parent).

I'll also mention that there's only a minor difference between Martin's huge merge commits and my tree of merge commits; my version simply creates less humongous (but still large) merge commits and makes some of them children of others to achieve the same effect that Martin achieves. OTOH, in Martin's version, only one humongous merge commit is relevant at any one time.

Alternatively, we could even not do any sort of GC on the
operation log, but teach jj that a commit can be in 3 states

This sounds like it would solve it from the jj side, but on the Git side, the commits are still referred to and cannot be GCed.

Good point, I guess in my model the GC of the op log has to correspond to the GC of the actual commits in those operations. I think that might be OK, especially in the first version. It could alternatively be worked around. We could remember some (or all) of the older operations without keeping the corresponding "merge commits" around, so the commits for those older operations can be GCed.

…ns (PoC) In order to implement GC (martinvonz#12), we'll need to somehow prune old operations. Perhaps the easiest implementation is to just remove unwanted operation files and put tombstone file instead (like git shallow.) However, the removed operations might be referenced by another jj process running in parallel. Since the parallel operation thinks all the historical head commits are reachable, the removed operations would have to be resurrected (or fix up index data, etc.) when the op heads get merged. The idea behind this patch is to split the "op log" GC into two steps: 1. recreate operations to be retained and make the old history unreachable, 2. delete unreachable operations if the head was created e.g. 3 days ago. The latter will be run by "jj util gc". I don't think GC can be implemented 100% safe against lock-less append-only storage, and we'll probably need some timestamp-based mechanism to not remove objects that might be referenced by uncommitted operation. FWIW, another nice thing about this implementation is that the index is automatically invalidated as the op id changes. The bad thing is that the "undo" description would contain an old op id. It seems the performance is pretty okay.

In order to implement GC (martinvonz#12), we'll need to somehow prune old operations. Perhaps the easiest implementation is to just remove unwanted operation files and put tombstone file instead (like git shallow.) However, the removed operations might be referenced by another jj process running in parallel. Since the parallel operation thinks all the historical head commits are reachable, the removed operations would have to be resurrected (or fix up index data, etc.) when the op heads get merged. The idea behind this patch is to split the "op log" GC into two steps: 1. recreate operations to be retained and make the old history unreachable, 2. delete unreachable operations if the head was created e.g. 3 days ago. The latter will be run by "jj util gc". I don't think GC can be implemented 100% safe against lock-less append-only storage, and we'll probably need some timestamp-based mechanism to not remove objects that might be referenced by uncommitted operation. FWIW, another nice thing about this implementation is that the index is automatically invalidated as the op id changes. The bad thing is that the "undo" description would contain an old op id. It seems the performance is pretty okay. This patch also includes support for "op abandon root.." because it can be implemented for free. I'm not sure what should be checked as prerequisite. I made it require "op restore root && op abandon root.." for now.

In order to implement GC (martinvonz#12), we'll need to somehow prune old operations. Perhaps the easiest implementation is to just remove unwanted operation files and put tombstone file instead (like git shallow.) However, the removed operations might be referenced by another jj process running in parallel. Since the parallel operation thinks all the historical head commits are reachable, the removed operations would have to be resurrected (or fix up index data, etc.) when the op heads get merged. The idea behind this patch is to split the "op log" GC into two steps: 1. recreate operations to be retained and make the old history unreachable, 2. delete unreachable operations if the head was created e.g. 3 days ago. The latter will be run by "jj util gc". I don't think GC can be implemented 100% safe against lock-less append-only storage, and we'll probably need some timestamp-based mechanism to not remove objects that might be referenced by uncommitted operation. FWIW, another nice thing about this implementation is that the index is automatically invalidated as the op id changes. The bad thing is that the "undo" description would contain an old op id. It seems the performance is pretty okay.

In order to implement GC (#12), we'll need to somehow prune old operations. Perhaps the easiest implementation is to just remove unwanted operation files and put tombstone file instead (like git shallow.) However, the removed operations might be referenced by another jj process running in parallel. Since the parallel operation thinks all the historical head commits are reachable, the removed operations would have to be resurrected (or fix up index data, etc.) when the op heads get merged. The idea behind this patch is to split the "op log" GC into two steps: 1. recreate operations to be retained and make the old history unreachable, 2. delete unreachable operations if the head was created e.g. 3 days ago. The latter will be run by "jj util gc". I don't think GC can be implemented 100% safe against lock-less append-only storage, and we'll probably need some timestamp-based mechanism to not remove objects that might be referenced by uncommitted operation. FWIW, another nice thing about this implementation is that the index is automatically invalidated as the op id changes. The bad thing is that the "undo" description would contain an old op id. It seems the performance is pretty okay.

Since new operations and views may be added concurrently by another process, there's a risk of data corruption. The keep_newer parameter is a mitigation for this problem. It's set to preserve files modified within the last 2 weeks, which is the default of "git gc". Still, a concurrent process may replace an existing view which is about to be deleted by the gc process, and the view file would be lost. martinvonz#12

Since new operations and views may be added concurrently by another process, there's a risk of data corruption. The keep_newer parameter is a mitigation for this problem. It's set to preserve files modified within the last 2 weeks, which is the default of "git gc". Still, a concurrent process may replace an existing view which is about to be deleted by the gc process, and the view file would be lost. #12

With my jj repo, the number of jj/keep refs went down from 87887 to 27733. The .git directory size is halved, but we'll need to clean up extra and index files to save disk space. "git gc --prune=now && jj debug reindex" passed, so the repo wouldn't be corrupted. martinvonz#12

With my jj repo, the number of jj/keep refs went down from 87887 to 27733. The .git directory size is halved, but we'll need to clean up extra and index files to save disk space. "git gc --prune=now && jj debug reindex" passed, so the repo wouldn't be corrupted. #12

martinvonz added the enhancement New feature or request label Apr 7, 2021

martinvonz mentioned this issue Mar 10, 2022

docs: list supported and unsupported Git features #119

Merged

martinvonz mentioned this issue May 3, 2022

jj git fetch becomes slow over time #293

Open

martinvonz mentioned this issue Nov 26, 2022

When pulling from remote, propagate rewritten/abandoned commits #241

Open

xxxserxxx pushed a commit to xxxserxxx/jj that referenced this issue Mar 14, 2023

Merge pull request martinvonz#12 from ciiron/patch-1

8699e98

fix keybinding documentation

martinvonz mentioned this issue Oct 11, 2023

FR: Is it possible to make the auto anonymous working copy commit optional and configurable? #2358

Closed

martinvonz mentioned this issue Nov 9, 2023

FR: Option to not commit transaction #2562

Open

martinvonz self-assigned this Dec 1, 2023

yuja mentioned this issue Dec 29, 2023

add "op abandon root..head" command that "reparents" operations #2754

Merged

4 tasks

yuja mentioned this issue Jan 8, 2024

op_store: implement GC of unreachble operations and views #2790

Merged

4 tasks

yuja mentioned this issue Jan 17, 2024

git_backend: on gc(), remove unreachable no-gc refs and compact them #2839

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement GC #12

Implement GC #12

martinvonz commented Apr 7, 2021

BatmanAoD commented Nov 22, 2023

martinvonz commented Nov 22, 2023

arxanas commented Nov 24, 2023

ilyagr commented Nov 24, 2023

arxanas commented Nov 24, 2023

martinvonz commented Nov 24, 2023

yuja commented Nov 24, 2023

martinvonz commented Nov 24, 2023

ilyagr commented Nov 24, 2023 •

edited

Loading

yuja commented Nov 24, 2023

martinvonz commented Nov 24, 2023

jonathantanmy commented Nov 27, 2023

ilyagr commented Dec 3, 2023 •

edited

Loading

jonathantanmy commented Dec 11, 2023

ilyagr commented Dec 12, 2023 •

edited

Loading

Implement GC #12

Implement GC #12

Comments

martinvonz commented Apr 7, 2021

BatmanAoD commented Nov 22, 2023

martinvonz commented Nov 22, 2023

arxanas commented Nov 24, 2023

ilyagr commented Nov 24, 2023

arxanas commented Nov 24, 2023

martinvonz commented Nov 24, 2023

yuja commented Nov 24, 2023

martinvonz commented Nov 24, 2023

ilyagr commented Nov 24, 2023 • edited Loading

yuja commented Nov 24, 2023

martinvonz commented Nov 24, 2023

jonathantanmy commented Nov 27, 2023

ilyagr commented Dec 3, 2023 • edited Loading

jonathantanmy commented Dec 11, 2023

ilyagr commented Dec 12, 2023 • edited Loading

ilyagr commented Nov 24, 2023 •

edited

Loading

ilyagr commented Dec 3, 2023 •

edited

Loading

ilyagr commented Dec 12, 2023 •

edited

Loading