Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit amount of tokens used to calculate LCS distance #68151

Merged
merged 3 commits into from
May 11, 2023

Conversation

tmat
Copy link
Member

@tmat tmat commented May 10, 2023

Reduces the amount of buffers allocated on LOH during calculation of EnC/Hot Reload delta.

Improves AB#1537348

@tmat tmat requested a review from a team as a code owner May 10, 2023 00:47
@dotnet-issue-labeler dotnet-issue-labeler bot added Area-Interactive untriaged Issues and PRs which have not yet been triaged by a lead labels May 10, 2023
@tmat
Copy link
Member Author

tmat commented May 10, 2023

@davidwengier PTAL

@CyrusNajmabadi
Copy link
Member

Not sure how these ar stored today. IN the future, woudl using SegmentedList be appropriate, if the goal is to avoid the LOH?

@tmat
Copy link
Member Author

tmat commented May 10, 2023

We only need to avoid LOH for segments that are not pooled as those will become garbage once he calculation is finished. The pooled ones will be reused.

Copy link
Contributor

@davidwengier davidwengier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have 0 confidence that I understand what is going on here. or how it works. eg. I'm guessing this doesn't mean that we simply throw away tokens/character that are beyond the maximum, but it sure looks that way!

@tmat
Copy link
Member Author

tmat commented May 10, 2023

I'm guessing this doesn't mean that we simply throw away tokens/character that are beyond the maximum, but it sure looks that way!

Yes, we do throw them away but only when we are comparing similarity of sequences of tokens ("weights" or "distances"). Not when calculating actual edits.
E.g. when we have a method body that has multiple if blocks each with 100s of statements (which is real code customers have) the tree diffing algorithm is trying to figure out which if blocks in the old version match if blocks in the new version. For that we calculate a "distance" of sequences of tokens comprising the if blocks using the LCS algorithm. The alg needs O(n^2) memory, which can easily run into 10s of MBs.

This change limits the number of tokens we look at when calculating the distance. If first ~900 tokens are the same we consider the sequence exactly the same.

@CyrusNajmabadi
Copy link
Member

This change limits the number of tokens we look at when calculating the distance. If first ~900 tokens are the same we consider the sequence exactly the same.

What's the fallout of this though. Would that mean we miss an actual change, if not within that token count? Say someone edits the last statement?

(Just trying to figure out the implication of assuming a match). Thanks!

@tmat
Copy link
Member Author

tmat commented May 10, 2023

@CyrusNajmabadi Fallout is decreased precision of matching heuristic. We would still detect any change made anywhere in the tree.

Say you have blocks A, B, C and the user makes a change that reorders B and C, so the new version is A, C, B. If C and B are large and only differ in the last token we would produce matches: A1 ~ A2, B1 ~ C2, C1 ~ B2, instead of more correct one:
A1 ~ A2, B1 ~ B2, C1 ~ C2.

This matters for example, if B contains a lambda, a local variable definition, etc. Then we would try to match the lambda/variable/etc. in B1 to the corresponding lambda in C2 instead of B2.

It seems unlikely though that this would happen often in practice. Reordering large blocks of code is not very common during EnC. Besides the blocks need to have the same 900 leading tokens in order for the matching alg to be affected.

@CyrusNajmabadi
Copy link
Member

@CyrusNajmabadi Fallout is decreased precision of matching heuristic. We would still detect any change made anywhere in the tree.

Cool, that's what i wanted to know. If this is just about a heuristic that says "more likely than not this is a move, given that 900 tokesn match" that seems totally reasonable to me.

@davidwengier
Copy link
Contributor

Thank you for that explanation @tmat, makes sense.

@tmat tmat merged commit b404fbe into dotnet:main May 11, 2023
@tmat tmat deleted the LcsDistance branch May 11, 2023 15:51
@ghost ghost added this to the Next milestone May 11, 2023
@Cosifne Cosifne modified the milestones: Next, 17.7 P2 May 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-Interactive untriaged Issues and PRs which have not yet been triaged by a lead
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants