Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement GC #12

Open
martinvonz opened this issue Apr 7, 2021 · 15 comments
Open

Implement GC #12

martinvonz opened this issue Apr 7, 2021 · 15 comments
Assignees
Labels
enhancement New feature or request

Comments

@martinvonz
Copy link
Owner

No description provided.

@BatmanAoD
Copy link
Collaborator

Until this is implemented, is there a way to completely eliminate data from the .jj repository, such as an inadvertently-snapshotted key file or target directory? (For colocated repos, I can of course delete .jj itself, but I'd rather not.)

@martinvonz
Copy link
Owner Author

I'm afraid not, at least not without a lot of manual steps. You'd need to manually roll back to an earlier operation and manually remove any refs/jj/keep/* keeping the unwanted commits alive. It's honestly not that complicated to implement GC. Let us know if you're interested and we can provide some guidance. I actually thought I had already described the basic idea here, but I clearly haven't, so let me at least do that. Here is what I had in mind:

  1. Allow pruning old operations from the operation log. This involves teaching algorithms that walk the operation log to behave reasonably when an older operation is missing. For example, jj op log needs to not crash. I don't know if we should record pruned operation IDs so we can distinguish that case from a broken operation store.

  2. Implement GC by walking the operation log and collecting all commits pointed to by refs and anonymous heads. The pass that set into some new Backend::gc(&self, commits: &[CommitId]) function. The GitBackend will then remove refs/jj/keep/* refs not pointing into that set. It will then call Git's normal GC (which might mean subprocessing to git if neither libgit2 nor gitoxide supports GC).

@arxanas
Copy link
Collaborator

arxanas commented Nov 24, 2023

If someone implements GC, it would also be a good time to "compact" the keep refs (assuming that jj doesn't already do that). Some Git operations scale with the number of refs, so having thousands of jj/keep/* refs can slow them down. As part of GC, we could calculate the set of heads of the live commits and create/preserve only those refs, while deleting all other jj/keep/* refs.

@ilyagr
Copy link
Collaborator

ilyagr commented Nov 24, 2023

The current workaround for the refs building up is running git pack-refs --all occasionally. This is mentioned offhandedly in https://martinvonz.github.io/jj/v0.11.0/git-compatibility/#co-located-jujutsugit-repos. I'm guessing (but haven't checked) that after packing, the extraneous jj/keep refs take up disk space, but might not actually slow things down much.

It'd be obviously nicer if jj could do it itself and it was combined with GC.

@arxanas
Copy link
Collaborator

arxanas commented Nov 24, 2023

@ilyagr I mention it because, at my work, a certain Phabricator script invokes git for-each-ref --contains X, which hangs because I have 100k (packed) git-branchless refs 😄.

@martinvonz
Copy link
Owner Author

I'm guessing (but haven't checked) that after packing, the extraneous jj/keep refs take up disk space, but might not actually slow things down much.

I think they are the cause of #293.

We can even create a single commit with all visible heads as parents and attach a ref to that commit. That way we only need one ref ot preserve all commits.

@yuja
Copy link
Collaborator

yuja commented Nov 24, 2023

For the record, jj/keep/* refs are now excluded from git::import_refs(). git pack-refs --all will help import large number of tags and branches, though.

I think they are the cause of #293.

I hope it will be mitigated by gitoxide. Last time I checked, libgit2 unpacked commit object for each ref, which was the major CPU overhead of jj git fetch. OTOH, git fetch is reasonably fast.

@martinvonz
Copy link
Owner Author

I thought the problem was with "ref advertisement", which protocol v2 skips (https://git-scm.com/docs/protocol-v2), but I have done anything to confirm that.

@ilyagr
Copy link
Collaborator

ilyagr commented Nov 24, 2023

I think they are the cause of #293.

Good point, good to know. I had a feeling that git pack-refs helps with this, but I'm not entirely sure. (Yuya just posted something much more informed about this point above.)

We can even create a single commit with all visible heads as parents and attach a ref to that commit. That way we only need one ref ot preserve all commits.

On one hand, I think it's clever.

On the other hand, it reminded me of the following piece of code we recently added, which is quadratic in the number of parents a commit has. It made me wonder how many various git tools might have code like that. In other words, I'm worried some tools won't be optimized to deal smoothly with commits that have hundreds-thousands of parents.

jj/lib/src/rewrite.rs

Lines 388 to 399 in b37293f

fn new_parents(&self, old_ids: &[CommitId]) -> Vec<CommitId> {
// This should be a set, but performance of a vec is much better since we expect
// 99% of commits to have <= 2 parents.
let mut new_ids = vec![];
let mut add_parent = |id: &CommitId| {
// This can trigger if we abandon an empty commit, as both the empty commit and
// its parent are succeeded by the same commit.
if !new_ids.contains(id) {
new_ids.push(id.clone());
}
};
for old_id in old_ids {

It also might be fine, I'm unsure.

@yuja
Copy link
Collaborator

yuja commented Nov 24, 2023

I thought the problem was with "ref advertisement", which protocol v2 skips (https://git-scm.com/docs/protocol-v2),

iirc, libgit2 unpacks commits to sort refs by committer timestamp (and exclude non-commit refs), then picks up first ~500 refs or something. I don't know which phase this function belongs to. I think the git CLI implements more modern features, so git fetch vs (libgit-based) jj git fetch isn't 1:1 comparison of the same logic.

@martinvonz
Copy link
Owner Author

On the other hand, it reminded me of the following piece of code we recently added, which is quadratic in the number of parents a commit has.

When I reviewed that, I was also wondering to myself how long it'll take before someone runs into that :)

In other words, I'm worried some tools won't be optimized to deal smoothly with commits that have hundreds-thousands of parents.

I suspect it's rare that tools inspect unusual refs like refs/jj/keep/* to start with, so it's probably not going to happen often. And if it does happen, it should be fixed in the tool. We can create such commits as a tree with maximum 100 parents or something if we're worried about it.

@jonathantanmy
Copy link
Collaborator

I thought the problem was with "ref advertisement", which protocol v2 skips (https://git-scm.com/docs/protocol-v2),

iirc, libgit2 unpacks commits to sort refs by committer timestamp (and exclude non-commit refs)

This is also done by Git, not for sorting, but for finding a cutoff date (calculated from the remote refs) to determine what commit dates would be considered "recent". See how the cutoff variable in mark_complete_and_common_ref() in fetch-pack.c is used.

Also note that I don't think libgit2 has protocol v2, so there is no reduced list of refs.

@martinvonz martinvonz self-assigned this Dec 1, 2023
@ilyagr
Copy link
Collaborator

ilyagr commented Dec 3, 2023

(🤔 out loud)

Following the discussion in #2659, I thought of an approach to GC that seems a little naive and simplistic, but also seems conceptually nice (assuming it's correct). Unfortunately, it only makes git log --all problem discussed in that PR worse.

For each operation in the op log, we can have a Git merge commit. Its first parent will be the merge commit for the previous operation. The rest are the heads of the view for this operation (i.e., the commits needed to preserve this view from garbage collection). The merge commit for the latest operation gets a jj/keep ref.

Now, garbage collection can be thought of as a rebase: pick one of these marge commits and rebase it so that the merge commit for the previous operation is no longer a parent. Now, "normal" git gc will do more or less the right thing.

Update: The content of the commits is not important here, so this is more of a "verbatim rebase" AKA "reparenting" from #1027.

Similarly, this could look like a rebase in the operation log (create one operation for all the changes up to operation X-1 and rebase operation X onto that). I think operation log ids are randomly generated, so we can consider them as change ids. (We could also have commit ids for them).

Alternatively, we could even not do any sort of GC on the operation log, but teach jj that a commit can be in 3 states rather than the current 2: visible, hidden, or unavailable (AKA garbage-collected, though it could also be appropriate for partial/shallow clones).

@jonathantanmy
Copy link
Collaborator

(🤔 out loud)

Also thinking out loud...

Following the discussion in #2659, I thought of an approach to GC that seems a little naive and simplistic, but also seems conceptually nice (assuming it's correct). Unfortunately, it only makes git log --all problem discussed in that PR worse.

For each operation in the op log, we can have a Git merge commit. Its first parent will be the merge commit for the previous operation. The rest are the heads of the view for this operation (i.e., the commits needed to preserve this view from garbage collection). The merge commit for the latest operation gets a jj/keep ref.

Now, garbage collection can be thought of as a rebase: pick one of these marge commits and rebase it so that the merge commit for the previous operation is no longer a parent. Now, "normal" git gc will do more or less the right thing.

This means that when we GC, we cannot go back to all previous operations anymore - don't we want to save some (say, from the past few weeks)? I guess we can solve this by making a commit that refers to the heads of all the operations we want to save (maybe using Martin's giant merge commit idea, which also means that if/when we develop a way to simplify a list of commits into only the commits necessary to refer to all of them, we can use that optimization when creating the giant merge commit), then when rebasing the latest merge commit, we not only remove one parent, but add this newly-made commit as one of its parents too.

Alternatively, we could even not do any sort of GC on the operation log, but teach jj that a commit can be in 3 states rather than the current 2: visible, hidden, or unavailable (AKA garbage-collected, though it could also be appropriate for partial/shallow clones).

This sounds like it would solve it from the jj side, but on the Git side, the commits are still referred to and cannot be GCed.

Having said that, being able to store metadata about commits seems like a useful idea (e.g. going off the partial/shallow clone idea, we could say that we know about this commit because a child commit was fetched from such-and-such remote on such-and-such day) but currently we don't have a design about how to send/receive this data and if we do, whether we need it to have a hash (not sure if it's needed at such an early stage, though).

@ilyagr
Copy link
Collaborator

ilyagr commented Dec 12, 2023

This means that when we GC, we cannot go back to all previous operations anymore - don't we want to save some (say, from the past few weeks)?

Yes, but I didn't think we want to save all of them. In the model I proposed, choosing the cutoff point is simply choosing the merge commit for which operation is "rebased/reparented" (see also the new note I added to my previous comment about reparenting. I just mean that the contents of the merge commit is irrelevant, only the parent).

I'll also mention that there's only a minor difference between Martin's huge merge commits and my tree of merge commits; my version simply creates less humongous (but still large) merge commits and makes some of them children of others to achieve the same effect that Martin achieves. OTOH, in Martin's version, only one humongous merge commit is relevant at any one time.

Alternatively, we could even not do any sort of GC on the
operation log, but teach jj that a commit can be in 3 states

This sounds like it would solve it from the jj side, but on the Git side, the commits are still referred to and cannot be GCed.

Good point, I guess in my model the GC of the op log has to correspond to the GC of the actual commits in those operations. I think that might be OK, especially in the first version. It could alternatively be worked around. We could remember some (or all) of the older operations without keeping the corresponding "merge commits" around, so the commits for those older operations can be GCed.

yuja added a commit to yuja/jj that referenced this issue Dec 29, 2023
…ns (PoC)

In order to implement GC (martinvonz#12), we'll need to somehow prune old operations.
Perhaps the easiest implementation is to just remove unwanted operation files
and put tombstone file instead (like git shallow.) However, the removed
operations might be referenced by another jj process running in parallel. Since
the parallel operation thinks all the historical head commits are reachable, the
removed operations would have to be resurrected (or fix up index data, etc.)
when the op heads get merged.

The idea behind this patch is to split the "op log" GC into two steps:
 1. recreate operations to be retained and make the old history unreachable,
 2. delete unreachable operations if the head was created e.g. 3 days ago.
The latter will be run by "jj util gc". I don't think GC can be implemented
100% safe against lock-less append-only storage, and we'll probably need some
timestamp-based mechanism to not remove objects that might be referenced by
uncommitted operation.

FWIW, another nice thing about this implementation is that the index is
automatically invalidated as the op id changes. The bad thing is that the
"undo" description would contain an old op id. It seems the performance is
pretty okay.
yuja added a commit to yuja/jj that referenced this issue Jan 2, 2024
In order to implement GC (martinvonz#12), we'll need to somehow prune old operations.
Perhaps the easiest implementation is to just remove unwanted operation files
and put tombstone file instead (like git shallow.) However, the removed
operations might be referenced by another jj process running in parallel. Since
the parallel operation thinks all the historical head commits are reachable, the
removed operations would have to be resurrected (or fix up index data, etc.)
when the op heads get merged.

The idea behind this patch is to split the "op log" GC into two steps:
 1. recreate operations to be retained and make the old history unreachable,
 2. delete unreachable operations if the head was created e.g. 3 days ago.
The latter will be run by "jj util gc". I don't think GC can be implemented
100% safe against lock-less append-only storage, and we'll probably need some
timestamp-based mechanism to not remove objects that might be referenced by
uncommitted operation.

FWIW, another nice thing about this implementation is that the index is
automatically invalidated as the op id changes. The bad thing is that the
"undo" description would contain an old op id. It seems the performance is
pretty okay.

This patch also includes support for "op abandon root.." because it can be
implemented for free. I'm not sure what should be checked as prerequisite.
I made it require "op restore root && op abandon root.." for now.
yuja added a commit to yuja/jj that referenced this issue Jan 2, 2024
In order to implement GC (martinvonz#12), we'll need to somehow prune old operations.
Perhaps the easiest implementation is to just remove unwanted operation files
and put tombstone file instead (like git shallow.) However, the removed
operations might be referenced by another jj process running in parallel. Since
the parallel operation thinks all the historical head commits are reachable, the
removed operations would have to be resurrected (or fix up index data, etc.)
when the op heads get merged.

The idea behind this patch is to split the "op log" GC into two steps:
 1. recreate operations to be retained and make the old history unreachable,
 2. delete unreachable operations if the head was created e.g. 3 days ago.
The latter will be run by "jj util gc". I don't think GC can be implemented
100% safe against lock-less append-only storage, and we'll probably need some
timestamp-based mechanism to not remove objects that might be referenced by
uncommitted operation.

FWIW, another nice thing about this implementation is that the index is
automatically invalidated as the op id changes. The bad thing is that the
"undo" description would contain an old op id. It seems the performance is
pretty okay.

This patch also includes support for "op abandon root.." because it can be
implemented for free. I'm not sure what should be checked as prerequisite.
I made it require "op restore root && op abandon root.." for now.
yuja added a commit to yuja/jj that referenced this issue Jan 3, 2024
In order to implement GC (martinvonz#12), we'll need to somehow prune old operations.
Perhaps the easiest implementation is to just remove unwanted operation files
and put tombstone file instead (like git shallow.) However, the removed
operations might be referenced by another jj process running in parallel. Since
the parallel operation thinks all the historical head commits are reachable, the
removed operations would have to be resurrected (or fix up index data, etc.)
when the op heads get merged.

The idea behind this patch is to split the "op log" GC into two steps:
 1. recreate operations to be retained and make the old history unreachable,
 2. delete unreachable operations if the head was created e.g. 3 days ago.
The latter will be run by "jj util gc". I don't think GC can be implemented
100% safe against lock-less append-only storage, and we'll probably need some
timestamp-based mechanism to not remove objects that might be referenced by
uncommitted operation.

FWIW, another nice thing about this implementation is that the index is
automatically invalidated as the op id changes. The bad thing is that the
"undo" description would contain an old op id. It seems the performance is
pretty okay.

This patch also includes support for "op abandon root.." because it can be
implemented for free. I'm not sure what should be checked as prerequisite.
I made it require "op restore root && op abandon root.." for now.
yuja added a commit to yuja/jj that referenced this issue Jan 3, 2024
In order to implement GC (martinvonz#12), we'll need to somehow prune old operations.
Perhaps the easiest implementation is to just remove unwanted operation files
and put tombstone file instead (like git shallow.) However, the removed
operations might be referenced by another jj process running in parallel. Since
the parallel operation thinks all the historical head commits are reachable, the
removed operations would have to be resurrected (or fix up index data, etc.)
when the op heads get merged.

The idea behind this patch is to split the "op log" GC into two steps:
 1. recreate operations to be retained and make the old history unreachable,
 2. delete unreachable operations if the head was created e.g. 3 days ago.
The latter will be run by "jj util gc". I don't think GC can be implemented
100% safe against lock-less append-only storage, and we'll probably need some
timestamp-based mechanism to not remove objects that might be referenced by
uncommitted operation.

FWIW, another nice thing about this implementation is that the index is
automatically invalidated as the op id changes. The bad thing is that the
"undo" description would contain an old op id. It seems the performance is
pretty okay.
yuja added a commit to yuja/jj that referenced this issue Jan 4, 2024
In order to implement GC (martinvonz#12), we'll need to somehow prune old operations.
Perhaps the easiest implementation is to just remove unwanted operation files
and put tombstone file instead (like git shallow.) However, the removed
operations might be referenced by another jj process running in parallel. Since
the parallel operation thinks all the historical head commits are reachable, the
removed operations would have to be resurrected (or fix up index data, etc.)
when the op heads get merged.

The idea behind this patch is to split the "op log" GC into two steps:
 1. recreate operations to be retained and make the old history unreachable,
 2. delete unreachable operations if the head was created e.g. 3 days ago.
The latter will be run by "jj util gc". I don't think GC can be implemented
100% safe against lock-less append-only storage, and we'll probably need some
timestamp-based mechanism to not remove objects that might be referenced by
uncommitted operation.

FWIW, another nice thing about this implementation is that the index is
automatically invalidated as the op id changes. The bad thing is that the
"undo" description would contain an old op id. It seems the performance is
pretty okay.
yuja added a commit that referenced this issue Jan 4, 2024
In order to implement GC (#12), we'll need to somehow prune old operations.
Perhaps the easiest implementation is to just remove unwanted operation files
and put tombstone file instead (like git shallow.) However, the removed
operations might be referenced by another jj process running in parallel. Since
the parallel operation thinks all the historical head commits are reachable, the
removed operations would have to be resurrected (or fix up index data, etc.)
when the op heads get merged.

The idea behind this patch is to split the "op log" GC into two steps:
 1. recreate operations to be retained and make the old history unreachable,
 2. delete unreachable operations if the head was created e.g. 3 days ago.
The latter will be run by "jj util gc". I don't think GC can be implemented
100% safe against lock-less append-only storage, and we'll probably need some
timestamp-based mechanism to not remove objects that might be referenced by
uncommitted operation.

FWIW, another nice thing about this implementation is that the index is
automatically invalidated as the op id changes. The bad thing is that the
"undo" description would contain an old op id. It seems the performance is
pretty okay.
yuja added a commit to yuja/jj that referenced this issue Jan 8, 2024
Since new operations and views may be added concurrently by another process,
there's a risk of data corruption. The keep_newer parameter is a mitigation
for this problem. It's set to preserve files modified within the last 2 weeks,
which is the default of "git gc". Still, a concurrent process may replace an
existing view which is about to be deleted by the gc process, and the view
file would be lost.

martinvonz#12
yuja added a commit that referenced this issue Jan 9, 2024
Since new operations and views may be added concurrently by another process,
there's a risk of data corruption. The keep_newer parameter is a mitigation
for this problem. It's set to preserve files modified within the last 2 weeks,
which is the default of "git gc". Still, a concurrent process may replace an
existing view which is about to be deleted by the gc process, and the view
file would be lost.

#12
yuja added a commit to yuja/jj that referenced this issue Jan 17, 2024
With my jj repo, the number of jj/keep refs went down from 87887 to 27733.
The .git directory size is halved, but we'll need to clean up extra and index
files to save disk space. "git gc --prune=now && jj debug reindex" passed, so
the repo wouldn't be corrupted.

martinvonz#12
yuja added a commit to yuja/jj that referenced this issue Jan 17, 2024
With my jj repo, the number of jj/keep refs went down from 87887 to 27733.
The .git directory size is halved, but we'll need to clean up extra and index
files to save disk space. "git gc --prune=now && jj debug reindex" passed, so
the repo wouldn't be corrupted.

martinvonz#12
yuja added a commit to yuja/jj that referenced this issue Jan 17, 2024
With my jj repo, the number of jj/keep refs went down from 87887 to 27733.
The .git directory size is halved, but we'll need to clean up extra and index
files to save disk space. "git gc --prune=now && jj debug reindex" passed, so
the repo wouldn't be corrupted.

martinvonz#12
yuja added a commit to yuja/jj that referenced this issue Jan 18, 2024
With my jj repo, the number of jj/keep refs went down from 87887 to 27733.
The .git directory size is halved, but we'll need to clean up extra and index
files to save disk space. "git gc --prune=now && jj debug reindex" passed, so
the repo wouldn't be corrupted.

martinvonz#12
yuja added a commit to yuja/jj that referenced this issue Jan 25, 2024
With my jj repo, the number of jj/keep refs went down from 87887 to 27733.
The .git directory size is halved, but we'll need to clean up extra and index
files to save disk space. "git gc --prune=now && jj debug reindex" passed, so
the repo wouldn't be corrupted.

martinvonz#12
yuja added a commit to yuja/jj that referenced this issue Jan 27, 2024
With my jj repo, the number of jj/keep refs went down from 87887 to 27733.
The .git directory size is halved, but we'll need to clean up extra and index
files to save disk space. "git gc --prune=now && jj debug reindex" passed, so
the repo wouldn't be corrupted.

martinvonz#12
yuja added a commit that referenced this issue Jan 27, 2024
With my jj repo, the number of jj/keep refs went down from 87887 to 27733.
The .git directory size is halved, but we'll need to clean up extra and index
files to save disk space. "git gc --prune=now && jj debug reindex" passed, so
the repo wouldn't be corrupted.

#12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants