Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stats for the measurements? #796

Open
filiph opened this issue Sep 9, 2020 · 3 comments
Open

Stats for the measurements? #796

filiph opened this issue Sep 9, 2020 · 3 comments

Comments

@filiph
Copy link
Contributor

filiph commented Sep 9, 2020

For the benchmark measurements to be useful when comparing two or more versions of some code, we need to know the margin of error (MoE). Otherwise, we can't know whether an optimization is actually, significantly better than the base.

Here's what I mean:

Commit Mean
e11fe3f0 14.91
bab88227 14.64

Without MoE, this looks good. We made the code almost 2% faster with the second commit, right? No:

Commit Mean MoE
e11fe3f0 14.91 0.17
bab88227 14.64 0.14

We actually have no idea if the new code is faster. But we wouldn't know this without the MoE column, and we might prematurely pick the wrong choice.

Right now benchmark_harness only gives a single number. I often resort to running the benchmark many times, in order to ascertain the variance of measurements. This is slow and wasteful, because it's basically computing a mean of means. A measurement that could last ~2 seconds takes X * ~2 seconds, where X is always >10 and sometimes ~100.

I'm not sure this is in scope of this package, seeing as this one seems to be focused on really tight loops (e.g. forEach vs addAll) and long-term tracking of the SDK itself. Maybe it should be a completely separate package?

I'm proposing something like:

  1. Create a list for individual measurements (e.g. List.generate(n * batchIterations, () => -1))
  2. Warmup
  3. Execute n batches, each with batchIterations of the actual measured code, and put the measured time into the list.
  4. Tear down
  5. Compute the mean and the margin of error. Optionally, print all the measurements or provide an object with all the statistics. (I'm personally using my own t_stats package but there are many others on pub, including @kevmoo's stats, plus this is simple enough to be simply implemented without any external dependency.)

PROs:

  • Developers can make better-informed decisions about the optimizations the do
  • This works out of the box instead of being an exercise in statistic for each developer or company

CONs:

  • We need many measurements in order to compute the margin of error. That doesn't mean we need to necessarily save the time for every single run (which would add a lot of overhead), but we need at least tens, ideally hundreds of batches (e.g. measure the entirety of for (int i = 0; i < batchIterations; i++) { run();} many times).
  • Which means we probably need to know the number of batches and runs in advance, so that we don't need to dynamically add to a growable list of measurements during benchmarking.

I know this package is in flux now. Even a simple "no, not here" response is valuable for me.

@kevmoo
Copy link
Member

kevmoo commented Sep 11, 2020

It's easy enough to copy-paste the needed code here.

@filiph
Copy link
Contributor Author

filiph commented Sep 11, 2020

I went down a rabbit's hole of research on how to best present variance in benchmarks (there's a lot of prior art). I have a lot of notes. The gist is that even with MoE / standard deviation, comparing averages is too crude and leads to confusion. I'll investigate further.

My question above still stands: is this is in scope of this package?

@MelbourneDeveloper
Copy link

The other useful numbers would be standard deviation, median, min and max

@mosuem mosuem transferred this issue from dart-archive/benchmark_harness Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants