Faster commit lookup #53

ethantkoenig · 2017-05-23T01:29:13Z

A faster implementation of GetCommitsInfo, addresses go-gitea/gitea#491 and go-gitea/gitea#502.

The previous implementation made a call to git log for each entry, each of which required scanning through the commit history. This new implementation instead makes a single call to git log. This is faster, because it involves scanning the commit history only once.

BENCHMARK RESULTS:
Shoutout to @sapk for the benchmark tests he wrote (#54), which I have stolen here.

Old implementation:

BenchmarkEntries_GetCommitsInfo/gitea-4         	     200	  70670229 ns/op
BenchmarkEntries_GetCommitsInfo/manyfiles-4     	       1	35907609610 ns/op
BenchmarkEntries_GetCommitsInfo/moby-4          	      50	 204585469 ns/op
BenchmarkEntries_GetCommitsInfo/go-4            	      20	 640497304 ns/op
BenchmarkEntries_GetCommitsInfo/linux-4         	      20	 829981474 ns/op

New implementation:

BenchmarkEntries_GetCommitsInfo/gitea-4         	     500	  34284443 ns/op
BenchmarkEntries_GetCommitsInfo/manyfiles-4     	      30	 424753714 ns/op
BenchmarkEntries_GetCommitsInfo/moby-4          	     100	 109430547 ns/op
BenchmarkEntries_GetCommitsInfo/go-4            	      30	 521336982 ns/op
BenchmarkEntries_GetCommitsInfo/linux-4         	      20	 813171326 ns/op

IMPLEMENTATION DETAILS:
Gets the 16 latest commits affecting the relevant entries (git log --name-only -16 HEAD -- <treePath>). The output of this command containing which files were affect by each commit. Scan through this list of cmmit, and stop once a commit has been found for each entry.

If you go through the first batch of 16 commits, you get then next 32 commits (git log --name-only -32 <last-commit-from-first-batch>^ -- entry1 entry2 ...); each time you double the number of commits. This ensures that in the common case you'll only have to read in a small number (16) of commits, but you also won't have to make too many calls to git log if you need to go further into the commit history.

Finally, if you are looking for 32 or fewer entries, manually list out each entry in the git log command (git log --name-only -- entry1 entry2...) to support a more targeted search.

lunny · 2017-05-23T01:49:19Z

It seems there are some trace lines.

ethantkoenig · 2017-05-23T01:51:16Z

Trace lines removed, sorry about that

lunny · 2017-05-23T01:56:41Z

Are there some performance tests?

ethantkoenig · 2017-05-23T01:57:59Z

Other than what I've run manually, no.

sapk · 2017-05-24T09:59:19Z

We could add golang benchmark or more simple benchmark like a script in a folder contrib ?

sapk · 2017-05-24T13:40:17Z

Maybe I have made a mistake in my benchmark (#54) but I don't find optimization on some big repo.

Before PR :

After PR :

Edit:
For information: those tests were run on a server with a heavy load on fs in parallel maybe that the cause.

ethantkoenig · 2017-05-24T15:04:14Z

@sapk Thanks for writing the tests, I'll look into it

ethantkoenig · 2017-05-24T15:39:51Z

If you try a repo with many files (e.g. https://github.com/ethantkoenig/manyfiles), you should see a noticeable speed-up (old: 50 seconds, new: 5 seconds on my laptop).

I'll try to look for how to make my implementation comparable to the old implementation for non-pathological cases.

ethantkoenig · 2017-05-25T15:56:21Z

@sapk I've found a faster implementation that improves all 5 benchmarks test (see PR description). Let me know what you think

sapk · 2017-05-25T22:15:02Z

tree_entry_test.go

+			panic(err)
+		}
+		entries.Sort()
+		b.Run(benchmark.name, func(b *testing.B) {


This need Go1.7. I think we support at least Go1.6 for gitea but I can't find where it is written ^^

I think that this repo already doesn't support Go1.6, since we use the "context" package in command.go.

was thinking this change wasn't merge because of that.

sapk · 2017-05-25T23:08:24Z

tree_entry_test.go

+		{url: "https://github.com/torvalds/linux.git", name: "linux"},
+	}
+	for _, benchmark := range benchmarks {
+		var commit *Commit


You should still b.StopTimer() ... b.StartTimer() around init.

sapk · 2017-05-25T23:09:16Z

tree_entry_test.go

+		var commit *Commit
+		var entries Entries
+		if repoPath, err := setupGitRepo(benchmark.url, benchmark.name); err != nil {
+			panic(err)


I have found that it is cleaner to use : b.Fatal(err)

You must have forget to commit ? there were no change maid here.

Whoops 🤦‍♂️, see #55

sapk · 2017-05-25T23:11:45Z

tree_entry_test.go

+const benchmarkReposDir = "benchmark_repos/"
+
+func setupGitRepo(url string, name string) (string, error) {
+	repoDir := filepath.Join(benchmarkReposDir, name)


I prefer to ask for tempdir via ioutil.TempDir but migth not be the choice of everyone.

I didn't want to have to re-clone the repositories each time the tests run; it took me several minutes to clone them all.

ok, so maybe /benchmark/repos and exclude /benchmark, if we include later other benchmark with other resources.

sapk · 2017-05-25T23:18:11Z

Globally, I haven't review implementation (yet), benchmark show some little improvement that would be good to have. I made some comments on the benchmark part. I would have prefer that you cherry-pick my commits/PR and made change after but it could pass for this PR ^^.

ethantkoenig · 2017-05-26T02:32:02Z

@sapk Rebased to include your commits, sorry about that

sapk · 2017-05-26T08:07:45Z

@ethantkoenig couldn't this have some concurrency like old code in order to speed up again ? I know that more routine is not always the solution and that maybe you already test that.

ethantkoenig · 2017-05-26T13:34:30Z

@sapk I don't think this implementation is as amenable to concurrency as the old one. As far as I can tell, the main benefit from using concurrency in the old implementation was to make calls to git log ... in parallel. The new implementation makes far fewer calls to git log ..., and these calls must occur in succession (each one depends on the result of the previous one), so I don't think we'd see much gain from introducing concurrency.

sapk · 2017-05-26T14:27:54Z

@ethantkoenig looking at it could have some concurrency but with a different format/strategy. For example a controller starting routines executing git log history and returning by chan the path matching. The controller stop starting new routine when all path are completed.

This could be done later. This PR give an improvement and LGTM. (except panic() that need to be changed)

This PR could have a test to check regression but I haven't a good idea how to do that so maybe another time. ^^

lunny · 2017-05-26T15:08:13Z

LGTM

lunny added the status/wip label May 23, 2017

tboerger added the lgtm/need 2 label May 23, 2017

lunny removed the status/wip label May 23, 2017

sapk mentioned this pull request May 24, 2017

Add some benchmark #54

Closed

sapk added 4 commits May 24, 2017 15:10

Add bench task

c84e4aa

Create tree_entry_test.go

96af092

Remove init time

eef76cd

Add TODO information

ae8ee36

Add linux repo

dcfda73

ethantkoenig force-pushed the commit_lookup branch from d78c76e to ab9e103 Compare May 25, 2017 15:50

ethantkoenig force-pushed the commit_lookup branch from ab9e103 to f5f6f1b Compare May 25, 2017 16:02

sapk reviewed May 25, 2017

View reviewed changes

Faster implementation of GetCommitsInfo

76cec74

ethantkoenig force-pushed the commit_lookup branch from f5f6f1b to 76cec74 Compare May 26, 2017 02:31

Start/stop timer

e8bc37c

Use benchmark/ directory for benchmark repos

f22ce90

tboerger added lgtm/need 1 and removed lgtm/need 2 labels May 26, 2017

tboerger added lgtm/done and removed lgtm/need 1 labels May 26, 2017

lunny merged commit ec4446b into go-gitea:master May 26, 2017

ethantkoenig deleted the commit_lookup branch May 26, 2017 15:49

This was referenced May 27, 2017

Update code.gitea.io/git go-gitea/gitea#1824

Merged

Fix bug in GetCommitInfos #57

Merged

ethantkoenig mentioned this pull request Jun 27, 2017

Revert to old implementation of GetCommitsInfo #73

Merged

ethantkoenig mentioned this pull request Nov 19, 2017

Faster commit lookup #91

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster commit lookup #53

Faster commit lookup #53

ethantkoenig commented May 23, 2017 •

edited

Loading

lunny commented May 23, 2017

ethantkoenig commented May 23, 2017

lunny commented May 23, 2017

ethantkoenig commented May 23, 2017

sapk commented May 24, 2017

sapk commented May 24, 2017 •

edited

Loading

ethantkoenig commented May 24, 2017

ethantkoenig commented May 24, 2017

ethantkoenig commented May 25, 2017 •

edited

Loading

sapk May 25, 2017

ethantkoenig May 25, 2017

sapk May 25, 2017

sapk May 25, 2017

ethantkoenig May 26, 2017

sapk May 25, 2017

ethantkoenig May 26, 2017

sapk May 26, 2017

ethantkoenig May 26, 2017

sapk May 25, 2017

ethantkoenig May 25, 2017

sapk May 26, 2017

ethantkoenig May 26, 2017

sapk commented May 25, 2017 •

edited

Loading

ethantkoenig commented May 26, 2017

sapk commented May 26, 2017

ethantkoenig commented May 26, 2017

sapk commented May 26, 2017 •

edited

Loading

lunny commented May 26, 2017

Faster commit lookup #53

Faster commit lookup #53

Conversation

ethantkoenig commented May 23, 2017 • edited Loading

lunny commented May 23, 2017

ethantkoenig commented May 23, 2017

lunny commented May 23, 2017

ethantkoenig commented May 23, 2017

sapk commented May 24, 2017

sapk commented May 24, 2017 • edited Loading

ethantkoenig commented May 24, 2017

ethantkoenig commented May 24, 2017

ethantkoenig commented May 25, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sapk commented May 25, 2017 • edited Loading

ethantkoenig commented May 26, 2017

sapk commented May 26, 2017

ethantkoenig commented May 26, 2017

sapk commented May 26, 2017 • edited Loading

lunny commented May 26, 2017

ethantkoenig commented May 23, 2017 •

edited

Loading

sapk commented May 24, 2017 •

edited

Loading

ethantkoenig commented May 25, 2017 •

edited

Loading

sapk commented May 25, 2017 •

edited

Loading

sapk commented May 26, 2017 •

edited

Loading