Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

revset: add fork_point function #4795

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

bnjmnt4n
Copy link
Collaborator

@bnjmnt4n bnjmnt4n commented Nov 7, 2024

This can be used to find the best common ancestors of a revset with an arbitrary number of commits, which cannot be done currently.

Checklist

If applicable:

  • I have updated CHANGELOG.md
  • I have updated the documentation (README.md, docs/, demos/)
  • I have updated the config schema (cli/src/config-schema.json)
  • I have added tests to cover my changes

@algmyr
Copy link
Collaborator

algmyr commented Nov 8, 2024

I don't think GCA is the term used for this, it's lowest (or sometimes last) common ancestor. I would probably be fine with lca, but maybe others might not be as used to the abbreviations. Expanding it fully feels quite wordy though. Maybe common_ancestors(revset) is precise/descriptive enough?

Copy link
Collaborator

@yuja yuja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, it's the "greatest" in terms of generation numbers? I feel both gca and lca are equally cryptic.

common_ancestor(), best_ancestor(), closest_ancestor()?

lib/src/default_index/revset_engine.rs Show resolved Hide resolved
lib/src/revset.rs Outdated Show resolved Hide resolved
@bnjmnt4n
Copy link
Collaborator Author

bnjmnt4n commented Nov 8, 2024

Git calls it best common ancestor (via git merge-base):

git merge-base finds the best common ancestor(s) between two commits to use in a three-way merge. One common ancestor is better than another common ancestor if the latter is an ancestor of the former. A common ancestor that does not have any better common ancestor is a best common ancestor, i.e. a merge base. Note that there can be more than one merge base for a pair of commits.

@bnjmnt4n bnjmnt4n changed the title revset: add gca function revset: add common_ancestors function Nov 8, 2024
@bnjmnt4n bnjmnt4n force-pushed the bnjmnt4n/push-xykukoyzpukl branch 2 times, most recently from c713fe9 to a1ced5b Compare November 12, 2024 05:17
@bnjmnt4n
Copy link
Collaborator Author

Does anyone else have any thoughts on the naming? Also, should the name be in singular or plural form (common_ancestor vs common_ancestors)? I went with plural because the function can return multiple results if there are multiple common ancestors.

@yuja
Copy link
Collaborator

yuja commented Nov 12, 2024

If we don't say "best", "greatest", etc., a singular form would suggest that it won't include all common ancestors. So I prefer singular common_ancestor().

@bnjmnt4n
Copy link
Collaborator Author

Now that I think about it, I agree that common_ancestors seems to indicate ::a & ::b & ::c & .... What about best_common_ancestors?

@yuja
Copy link
Collaborator

yuja commented Nov 12, 2024

What about best_common_ancestors?

That seems good, but isn't it too wordy? I'm okay with either common_ancestor() (singular), best_common_ancestor[s]() (singular/plural). The return value is usually singular.

@emilazy
Copy link
Collaborator

emilazy commented Nov 12, 2024

Would ancestor be too much? I feel the fact that it’s common is implicit to the fact that you’re passing multiple arguments. Also, I might be misreading, but ancestor() should be root() (it looks like it might currently be none()?).

@bnjmnt4n
Copy link
Collaborator Author

bnjmnt4n commented Nov 12, 2024

Would ancestor be too much? I feel the fact that it’s common is implicit to the fact that you’re passing multiple arguments.

ancestor is way too close to ancestors, which obtains the ancestors of all commits in the given revset. Additionally, both functions technically accept only a single revset argument, but the revset can resolve to multiple commits instead.

Also, I might be misreading, but ancestor() should be root() (it looks like it might currently be none()?).

This is an interesting point, I'm not familiar enough with graph theory (?) to say if the best common ancestors of an empty set of nodes is the root of the graph.

@algmyr
Copy link
Collaborator

algmyr commented Nov 12, 2024

I think the best common ancestor of a single commit should be itself, and the best common ancestor of an empty set ought to be none (it doesn't exist)

@emilazy
Copy link
Collaborator

emilazy commented Nov 12, 2024

Hmm, you might actually be right. We want the property that ancestor(As | Bs) is ancestor(ancestor(As) | ancestor(Bs)), right? Therefore, yes, ancestor(A) (where A is a single commit) should be A, and ancestor(none()) should be the identity element, and I guess none() is that identity?

@martinvonz
Copy link
Owner

Something like fork_point() might also work. I'm not sure I like it better, but figured I'd share it at least.

@joyously
Copy link

How can a commit be its own ancestor?

@martinvonz
Copy link
Owner

How can a commit be its own ancestor?

We simply define it such that a commits ancestor includes itself. We do that because it's often useful in practice. Mercurial does the same.

@emilazy
Copy link
Collaborator

emilazy commented Nov 12, 2024

How can a commit be its own ancestor?

In the same way that you can define the ancestors of a person as:

  1. A person is their own ancestor.
  2. The parents of a person’s ancestor are also their ancestors.

which is consistent but gives more answers than a definition that excludes the first reflexive condition. It’s the same reason that the sum of one number is itself and the product of one number is itself, even though there’s no addition or multiplication going on there; similarly why we define the sum of no numbers to be 0, and the product of no numbers to be 1. (In mathematical terms, we’re trying to make this operation form a monoid, because that results in good properties.)

@emilazy
Copy link
Collaborator

emilazy commented Nov 12, 2024

Or, to motivate it directly in terms of a defining property: the greatest common ancestor of a set of commits is the commit that all elements of the set are descendents of, such that there is no descendent of the ancestor that also satisfies that property. For a single commit, that’s the commit itself. (This still leaves the choice of the ancestor of an empty set somewhat arbitrary, though it already makes it clear that there can be no single unique choice for many commit graphs, which motivates none() as a result.)

@joyously
Copy link

I think if it includes itself, it shouldn't be called ancestor. Commits don't appear out of thin air, do they? Wouldn't there at least be root()?

@algmyr
Copy link
Collaborator

algmyr commented Nov 12, 2024

I think if it includes itself, it shouldn't be called ancestor.

In graph theory, when talking about things like lowest common ancestor (LCA) you consider nodes to be descendants of themselves.

@emilazy
Copy link
Collaborator

emilazy commented Nov 12, 2024

It’s also the same case as the greatest common divisor of a single number being itself (which Python’s math.gcd implements).

@joyously
Copy link

In graph theory, when talking about things like lowest common ancestor (LCA) you consider nodes to be descendants of themselves.

If this is simply for the name of a function in code, call it whatever and explain with a comment.
If it's something a user will see, don't confuse us with "you are your own ancestor", even if another VCS did it.

@emilazy
Copy link
Collaborator

emilazy commented Nov 12, 2024

It’s the standard graph theory terminology, not a VCS‐specific quirk, and the mathematically obvious definition. People anyway aren’t likely to call ancestor on single commits except as a result of computed revsets, where the single‐commit case will provide correct results that are consistent with its behaviour on revsets of other sizes.

@algmyr
Copy link
Collaborator

algmyr commented Nov 13, 2024

Hmm, you might actually be right. We want the property that ancestor(As | Bs) is ancestor(ancestor(As) | ancestor(Bs)), right? Therefore, yes, ancestor(A) (where A is a single commit) should be A, and ancestor(none()) should be the identity element, and I guess none() is that identity?

This is mostly a tangent, but the empty case is a bit iffy and you end up having to be careful in how you define what a common ancestor is (to avoid issues of vacuous truths).

E.g. for a set of nodes S and we want to define the set of all common ancestors (ignoring the lowest/best part), it's tempting to try

"common_ancestors(S) are nodes with all nodes in S as descendants"

But this behaves poorly if S is empty. It's vacuously true that any node is an ancestor of all nodes in the empty set!

If you make the (sensible) constraint that common ancestors need to be an ancestors of some node in S, then you should end up with nicer behavior that gives you the none result.

@emilazy
Copy link
Collaborator

emilazy commented Nov 13, 2024

Some more motivation of the base cases here:

If we desire that ancestor(as | bs) equals ancestor(ancestor(as) | ancestor(bs)), as it does for the >1 commit case, then we immediately get ancestor(as) = ancestor(as | none()) (because as | none() = as) = ancestor(ancestor(as) | ancestor(none())), which can only (as far as I can see) be equal to ancestor(as | none()) if ancestor(none()) = none() and ancestor(ancestor(as)) = ancestor(as), which gives us both the zero and one commit cases (assuming I understand correctly that ancestor() should never return a revset of more than one commit – although I wonder if that’s actually a desirable property, as I think it differs from heads(::a & ::b)?).

More pragmatically, a common practical use of this operation – Git’s git merge-base – is to answer the question “what’s the most ‘recent’ commit I can base my work on if I need it to be able to cleanly merge into these branches without pulling in additional commits only present in some of them?”. For the single‐commit case, the answer is that you can base that work on the commit itself, since there’s nothing “missing” from the non‐existent other branches. (Although, in that case I think it doesn’t give us a clear answer for ancestor(none()), since I think root() would be a valid answer there; it’s other properties that rule that case out.)

Base cases are hard. It wasn’t that long ago that people found the idea of zero very alien, and rejected formulas like n + 0 = n, n × 0 = 0, or the idea that you could sum up no integers at all and get 0, or take the product of no integers at all and get 1. But once you work out the rules it’s clear that there’s only one correct answer, and that skipping defining it because it seems weird and unintuitive at first reduces the power and convenience of your operations and forces you to introduce special cases elsewhere in the system. (Corollary of base cases being hard: I might have the reasoning wrong here! But I’m pretty confident that returning the input for a single commit is the correct behaviour, that any other concrete choice would make things worse and break things you would reasonably want to do, and that declining to define it would make things more awkward and less useful in general.)

@yuja
Copy link
Collaborator

yuja commented Nov 13, 2024

assuming I understand correctly that ancestor() should never return a revset of more than one commit

There's an edge case that ancestor() is resolved to multiple revisions, so I'm afraid that we cannot define ancestor(*xs) based on ancestor(a | b | ...).
a1ced5b#diff-6bd5d4c0fd459e2c1cb3f1e0d880d6f23c4e3b11863adc76db3931e4b94d194dR2524-R2542

Something like fork_point() might also work.

Sounds also good. It would reflect user's intent.

@martinvonz
Copy link
Owner

martinvonz commented Nov 13, 2024

Something like fork_point() might also work.

Sounds also good. It would reflect user's intent.

Actually, maybe even just forks() since we have merges() (not merge_point()). "Forks" and "merges" seem to work well as analogies to roads forking and merging.

EDIT: But an important difference between them is that merges() doesn't take an input set, so maybe forks() (and fork_points()?) would make it sound like forks() returns all commits with more than one child.

@yuja
Copy link
Collaborator

yuja commented Nov 14, 2024

EDIT: But an important difference between them is that merges() doesn't take an input set, so maybe forks() (and fork_points()?) would make it sound like forks() returns all commits with more than one child.

Yes, maybe we can add merge_point(xs) that finds the best common descendants, but I don't know if this name makes sense.

@martinvonz
Copy link
Owner

There doesn't seem to be a clear consensus and no one seems to feel strongly. To avoid this just getting stalled, I'm fine with @bnjmnt4n making the final decision on the name. Sounds good? Or do other feel like there's a name that's clearly best (and or names that they really don't want)?

@bnjmnt4n bnjmnt4n changed the title revset: add common_ancestors function revset: add fork_point function Nov 15, 2024
@bnjmnt4n
Copy link
Collaborator Author

I like the suggested fork_point, which is relatively short and doesn't have any immediately ambiguous/conflicting meanings. I'm not too sure about the preferred definition of fork_point in docs/revsets.md: would anyone have any suggestions for the definition?

@algmyr
Copy link
Collaborator

algmyr commented Nov 15, 2024

Maybe it makes sense to give the example of the two commit case (which I think is heads(::x & ::y)) and just say fork_points generalizes this to an arbitrary number of commits?

@yuja
Copy link
Collaborator

yuja commented Nov 15, 2024

We have "examples" section in the doc. fork_point() should probably be added there.

This can be used to find the fork point (best common ancestors) of a
revset with an arbitrary number of commits, which cannot be expressed
currently in the revset language.
* `fork_point(D|B)` ⇒ `{A}`
* `fork_point(B|C)` ⇒ `{A}`
* `fork_point(A)` ⇒ `{A}`
* `fork_point()` ⇒ `{}`
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe fork_point(none()) so users don't think it can be called with no arguments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants