x/perf/benchstat: bogus "no statistical difference" report when times are the same #19634

mvdan · 2017-03-21T11:37:44Z

Moving from rsc/benchstat#7, which now appears to be a dead issue tracker.

In short - if you happen to get benchmark time numbers that happen to be the same, benchstat seems to discard them.

old.txt:

BenchmarkFloatSub/100-4         20000000           115 ns/op          64 B/op          1 allocs/op
BenchmarkFloatSub/100-4         20000000           114 ns/op          64 B/op          1 allocs/op
BenchmarkFloatSub/100-4         20000000           115 ns/op          64 B/op          1 allocs/op
BenchmarkFloatSub/100-4         20000000           115 ns/op          64 B/op          1 allocs/op
BenchmarkFloatSub/100-4         20000000           115 ns/op          64 B/op          1 allocs/op

new.txt (note that all the times are the same: 78.8 ns/op):

BenchmarkFloatSub/100-4         20000000            78.8 ns/op         0 B/op          0 allocs/op
BenchmarkFloatSub/100-4         20000000            78.8 ns/op         0 B/op          0 allocs/op
BenchmarkFloatSub/100-4         20000000            78.8 ns/op         0 B/op          0 allocs/op
BenchmarkFloatSub/100-4         20000000            78.8 ns/op         0 B/op          0 allocs/op
BenchmarkFloatSub/100-4         20000000            78.8 ns/op         0 B/op          0 allocs/op

benchstat old.txt new.txt gives:

name            old time/op    new time/op    delta
FloatSub/100-4     115ns ± 0%      79ns ± 0%      ~     (p=0.079 n=4+5)

i.e. reports "no statistically significant improvement", which is clearly wrong.

cc @ALTree @rsc

The text was updated successfully, but these errors were encountered:

mvdan · 2017-03-21T11:38:10Z

Also see Alberto's suggested fix at rsc/benchstat#7 (comment).

mvdan · 2017-08-23T16:47:44Z

Friendly ping @rsc - do you think Alberto's suggested fix above would make sense?

@ALTree have you considered posting your patch to Gerrit? I'd be happy to help if you want, but I assume you'd want to do it yourself to keep authorship.

ALTree · 2017-08-23T17:08:03Z

@mvdan I never sent a patch because I'm not 100% sure the explanation I gave on the old repo's issues tracker is correct; anyway if you are convinced it is correct you can send a change, if you'd like to.

mvdan · 2017-08-23T17:10:28Z

Ah, best to not send it then.

rsc · 2017-08-24T14:17:32Z

/cc @aclements

aclements · 2017-08-24T14:45:00Z

The problem here isn't the significance test, it's the outlier rejection combined with the small sample size.

As an order test, Mann-Whitney has a floor on the p value that depends on the number of samples (not their values). p=0.079 is simply the lowest p-value you can get with n=4, m=5. The significance test isn't failing. It's genuinely saying that with so few samples, the chance of getting that order randomly is 0.079.

If you change the 114 ns/op to 115 ns/op, there's even less variance, but now outlier rejection doesn't kick in, so you get n=5, m=5 and a p-value of 0.008, which is considered significant.

I think the real bug here is that we're doing outlier rejection before computing an order statistic. We probably shouldn't do that. But if we still want to do outlier rejection for computing the mean ± x%, then I'm not sure how to present the sample size. Maybe we shouldn't be doing outlier rejection for that either. Perhaps we should be reporting a trimmed mean and its standard error?

Independently, perhaps benchstat should report when the sample sizes are too small to ever get a significant result.

AlekSi · 2018-01-18T14:22:31Z

I hit the same or similar problem with "macro"-benchmark. My data:

old.txt:

BenchmarkDecode-4             	       5	2203036637 ns/op	1885067576 B/op	13009444 allocs/op
BenchmarkDecodeConcurrent-4   	       5	2258279089 ns/op	1885066763 B/op	13009433 allocs/op

new.txt:

BenchmarkDecode-4             	      10	1543669126 ns/op	1917568089 B/op	13009415 allocs/op
BenchmarkDecodeConcurrent-4   	      10	1561361127 ns/op	1917567531 B/op	13009413 allocs/op

benchcmp works as expected:

-> benchcmp old.txt new.txt
benchmark                       old ns/op      new ns/op      delta
BenchmarkDecode-4               2203036637     1543669126     -29.93%
BenchmarkDecodeConcurrent-4     2258279089     1561361127     -30.86%

benchmark                       old allocs     new allocs     delta
BenchmarkDecode-4               13009444       13009415       -0.00%
BenchmarkDecodeConcurrent-4     13009433       13009413       -0.00%

benchmark                       old bytes      new bytes      delta
BenchmarkDecode-4               1885067576     1917568089     +1.72%
BenchmarkDecodeConcurrent-4     1885066763     1917567531     +1.72%

benchstat doesn't think that -30% is significant:

-> benchstat old.txt new.txt
name                old time/op    new time/op    delta
Decode-4               2.20s ± 0%     1.54s ± 0%   ~     (p=1.000 n=1+1)
DecodeConcurrent-4     2.26s ± 0%     1.56s ± 0%   ~     (p=1.000 n=1+1)

name                old alloc/op   new alloc/op   delta
Decode-4              1.89GB ± 0%    1.92GB ± 0%   ~     (p=1.000 n=1+1)
DecodeConcurrent-4    1.89GB ± 0%    1.92GB ± 0%   ~     (p=1.000 n=1+1)

name                old allocs/op  new allocs/op  delta
Decode-4               13.0M ± 0%     13.0M ± 0%   ~     (p=1.000 n=1+1)
DecodeConcurrent-4     13.0M ± 0%     13.0M ± 0%   ~     (p=1.000 n=1+1)

mvdan · 2018-01-18T14:31:09Z

@AlekSi that is expected - you'll need to run each benchmark multiple times - that's what the n at the end of the line means, before+after. Usually, you'll need at least 4 or 5 runs to get a p-value low enough. If the numbers vary quite a bit, you might need up to 10 or 20 runs for benchstat to be happy with the p-value.

AlekSi · 2018-01-18T14:45:03Z

Thank you. I wish it were more clear in the command output and documentation, though.

mvdan · 2018-01-18T15:00:06Z

@AlekSi good point - I also struggled to use benchstat at first. I have filed #23471 for better docs.

gopherbot · 2021-11-04T02:24:44Z

Change https://golang.org/cl/309969 mentions this issue: cmd/benchstat: new version of benchstat

This is a complete rewrite of benchstat. Basic usage remains the same, as does the core idea of showing statistical benchmark summaries and A/B comparisons in a table, but there are several major improvements. The statistics is now more robust. Previously, benchstat used IQR-based outlier rejection, showed the mean of the reduced sample, its range, and did a non-parametric difference-of-distribution test on reduced samples. Any form of outlier rejection must start with distributional assumptions, in this case assuming normality, which is generally not sound for benchmark data. Hence, now benchstat does not do any outlier rejection. As a result, it must use robust summary statistics as well, so benchstat now uses the median and confidence interval of the median as summary statistics. Benchstat continues to use the same Mann-Whitney U-test for the delta, but now does it on the full samples since the U-test is already non-parametric, hence increasing the power of this test. As part of these statistical improvements, benchstat now detects and warns about several common mistakes, such as having too few samples for meaningful statistical results, or having incomparable geomeans. The output format is more consistent. Previously, benchstat transformed units like "ns/op" into a metric like "time/op", which it used as a column header; and a numerator like "sec", which it used to label each measurement. This was easy enough for the standard units used by the testing framework, but was basically impossible to generalize to custom units. Now, benchstat does unit scaling, but otherwise leaves units alone. The full (scaled) unit is used as a column header and each measurement is simply a scaled value shown with an SI prefix. This also means that the text and CSV formats can be much more similar while still allowing the CSV format to be usefully machine-readable. Benchstat will also now do A/B comparisons even if there are more than two inputs. It shows a comparison to the base in the second and all subsequent columns. This approach is consistent for any number of inputs. Benchstat now supports the full Go benchmark format, including sophisticated control over exactly how it structures the results into rows, columns, and tables. This makes it easy to do meaningful comparisons across benchmark data that isn't simply structured into two input files, and gives significantly more control over how results are sorted. The default behavior is still to turn each input file into a column and each benchmark into a row. Fixes golang/go#19565 by showing all results, even if the benchmark sets don't match across columns, and warning when geomean sets are incompatible. Fixes golang/go#19634 by no longer doing outlier rejection and clearly reporting when there are not enough samples to do a meaningful difference test. Updates golang/go#23471 by providing more through command documentation. I'm not sure it quite fixes this issue, but it's much better than it was. Fixes golang/go#30368 because benchstat now supports filter expressions, which can also filter down units. Fixes golang/go#33169 because benchstat now always shows file configuration labels. Updates golang/go#43744 by integrating unit metadata to control statistical assumptions into the main tool that implements those assumptions. Fixes golang/go#48380 by introducing a way to override labels from the command line rather than always using file names. Change-Id: Ie2c5a12024e84b4918e483df2223eb1f10413a4f

mvdan added this to the Unreleased milestone Mar 21, 2017

gopherbot closed this as completed in golang/perf@02c5517 Jan 13, 2023

golang locked and limited conversation to collaborators Jan 13, 2024

gopherbot added the FrozenDueToAge label Jan 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x/perf/benchstat: bogus "no statistical difference" report when times are the same #19634

x/perf/benchstat: bogus "no statistical difference" report when times are the same #19634

mvdan commented Mar 21, 2017

mvdan commented Mar 21, 2017

mvdan commented Aug 23, 2017

ALTree commented Aug 23, 2017 •

edited

Loading

mvdan commented Aug 23, 2017

rsc commented Aug 24, 2017

aclements commented Aug 24, 2017

AlekSi commented Jan 18, 2018 •

edited

Loading

mvdan commented Jan 18, 2018

AlekSi commented Jan 18, 2018

mvdan commented Jan 18, 2018

gopherbot commented Nov 4, 2021

x/perf/benchstat: bogus "no statistical difference" report when times are the same #19634

x/perf/benchstat: bogus "no statistical difference" report when times are the same #19634

Comments

mvdan commented Mar 21, 2017

mvdan commented Mar 21, 2017

mvdan commented Aug 23, 2017

ALTree commented Aug 23, 2017 • edited Loading

mvdan commented Aug 23, 2017

rsc commented Aug 24, 2017

aclements commented Aug 24, 2017

AlekSi commented Jan 18, 2018 • edited Loading

mvdan commented Jan 18, 2018

AlekSi commented Jan 18, 2018

mvdan commented Jan 18, 2018

gopherbot commented Nov 4, 2021

ALTree commented Aug 23, 2017 •

edited

Loading

AlekSi commented Jan 18, 2018 •

edited

Loading