The analysis of tree structure on real repositories. The description of the concept is in TrigramRevisionTree.md.
In order to determine how the future algorithm will perform, it was decided to perform an analysis of some open-source repositories. Those included xodus, commons-lang, soot, intellij-community.
For every beforementioned repository we constructed a tree of all the commits, including all branches. To analyze this tree we calculated the degree distribution of vertices of these trees. The results are as follows (degree means the number of children of a commit in the tree):
Degree 0: 12
Degree 1: 2924
Degree 2: 5
Degree 3: 3
Degree 0: 80
Degree 1: 7437
Degree 2: 77
Degree 3: 1
Degree 0: 42
Degree 1: 7308
Degree 2: 37
Degree 3: 2
Degree 0: 1571
Degree 1: 477143
Degree 2: 1346
Degree 3: 80
Degree 4: 8
Degree 5: 4
Degree 6: 3
Degree 10: 1
Based on the results, we conclude that most of the vertices have degree equal to one, so our tree looks like a bunch of chains. The developed algorithm should account for this fact.
In order to determine whether the algorithm of compressing vertical paths can be applied, it was decided to measure how much performance is gained on real-world open-source repositories. The testing was performed on the xodus repository.
The analysis works as follows. For every length L of compressed path, we pick some number of random vertices in commit tree. For every picked vertex v we measure the ratio P/Q, defined as follows:
P = sum of number of trigram/file entries in all edges from vertex v to it's L'th parent. Q = number of trigram/file entries in a total delta from vertex v to it's L'th parent.
Basically, ratio P denotes how faster it is to use a compressed edge instead of naively processing the whole path.
Then we take average of P/Q over all picked vertices, and consider the resulting value as a compression value for picked L. Here is the plot of P/Q versus L (Lengths up to 30):
Here is the same plot with all observations in log-scale:
Orange curves in both plots show the best approximation for the given points. We can see that points fill well in the curve of order of cubic root of length, which is a line in log scale.
Based on the results, we conclude that compression of vertical paths leads to a significant optimization.