Skip to content

Latest commit

 

History

History
89 lines (46 loc) · 2.91 KB

tree-analysis.md

File metadata and controls

89 lines (46 loc) · 2.91 KB

Analysis of GitHub repository tree structure

The analysis of tree structure on real repositories. The description of the concept is in TrigramRevisionTree.md.

In order to determine how the future algorithm will perform, it was decided to perform an analysis of some open-source repositories. Those included xodus, commons-lang, soot, intellij-community.

Analysis of tree structure

For every beforementioned repository we constructed a tree of all the commits, including all branches. To analyze this tree we calculated the degree distribution of vertices of these trees. The results are as follows (degree means the number of children of a commit in the tree):

Xodus

Degree 0: 12

Degree 1: 2924

Degree 2: 5

Degree 3: 3

Commons-lang

Degree 0: 80

Degree 1: 7437

Degree 2: 77

Degree 3: 1

Soot

Degree 0: 42

Degree 1: 7308

Degree 2: 37

Degree 3: 2

IntelliJ-Community

Degree 0: 1571

Degree 1: 477143

Degree 2: 1346

Degree 3: 80

Degree 4: 8

Degree 5: 4

Degree 6: 3

Degree 10: 1

Conclusions

Based on the results, we conclude that most of the vertices have degree equal to one, so our tree looks like a bunch of chains. The developed algorithm should account for this fact.

Analysis of path compression on GitHub repositories

In order to determine whether the algorithm of compressing vertical paths can be applied, it was decided to measure how much performance is gained on real-world open-source repositories. The testing was performed on the xodus repository.

Method

The analysis works as follows. For every length L of compressed path, we pick some number of random vertices in commit tree. For every picked vertex v we measure the ratio P/Q, defined as follows:

P = sum of number of trigram/file entries in all edges from vertex v to it's L'th parent. Q = number of trigram/file entries in a total delta from vertex v to it's L'th parent.

Basically, ratio P denotes how faster it is to use a compressed edge instead of naively processing the whole path.

Then we take average of P/Q over all picked vertices, and consider the resulting value as a compression value for picked L. Here is the plot of P/Q versus L (Lengths up to 30):

Plot

Here is the same plot with all observations in log-scale:

Plot-log

Orange curves in both plots show the best approximation for the given points. We can see that points fill well in the curve of order of cubic root of length, which is a line in log scale.

Conclusions

Based on the results, we conclude that compression of vertical paths leads to a significant optimization.