Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testing: benchmark unit properties #43744

Closed
aclements opened this issue Jan 16, 2021 · 15 comments
Closed

testing: benchmark unit properties #43744

aclements opened this issue Jan 16, 2021 · 15 comments

Comments

@aclements
Copy link
Member

I propose extending the Go benchmark format with a way to specify properties of units. Especially now that the testing package supports reporting custom metrics, it would be useful for tools that process benchmark results to understand basic properties of custom units in benchmark results. For example, there are two unit properties I would like for benchstat: 1. whether higher or lower values are better, and 2. whether measurements come from some underlying distribution (e.g., performance measurements) or are exact (e.g., file sizes).

I don't have strong opinions about the syntax. I would like the syntax for this to make sense both in the benchmark format itself, and on the command line of tools. To have something concrete, I propose the following for the benchmark format

Unit unit key=value key=value ...

and the following for command-line flags

-unit "unit key=value key=value ..."

Starting the line with Unit parallels how benchmark result lines start with Benchmark. Using key=value parallels both flag syntax and benchmark name configuration. An obvious alternate would be key:value, but this seems too similar without being the same as file-level benchmark configuration, which requires whitespace after the colon (it's unfortunate that benchmark name configuration uses = instead of :). Another option is -key=value to exactly parallel flag syntax, but that gets awkward when nested on the command line and raises the question of whether we also support -key value.

I propose it is an error to specify conflicting properties about a unit.

I also propose that a unit property applies to all following benchmark results, and that it is unspecified whether it applies to earlier benchmark results. The tools I have built that would benefit from unit properties aggregate all results before presenting them, so unit properties naturally apply to all results. But I don't want to force purely streaming tools to read all results before emitting anything just in case there's a later unit property line. Alternatively, we could say it's an error if a property is supplied for any unit that's already been seen, but this seems overly restrictive for aggregating tools.

For specific properties, I propose

  1. {higher,lower}={better,worse} for indicating direction. Graphs and tables are often labeled "higher is better" or "lower is better", and this is meant to parallel that. I propose supporting all four combinations so people don't have to remember which works.

  2. assume={nothing,exact} for specifying statistical assumptions. nothing means to make no statistical assumptions (use non-parametric methods) and exact means to assume measurements are exact (e.g., repeated measurement is not necessary to increase confidence). The default is nothing. In the future we could also support normal, but that's almost never the right assumption for benchmarks, so I'm omitting it for now.

Some concrete examples:

  • The default units could be specified as Unit ns/op lower=better assume=nothing, Unit MB/s higher=better assume=nothing, etc. These would be built-in to the tooling.
  • Various compiler benchmarking tools report the size of compiled binaries. This doesn't require repeated measurement, so these units could be specified like Unit text-bytes lower=better assume=exact or on the command-line as -unit "text-bytes lower=better assume=exact".

/cc @rsc @mknyszek @dr2chase @jeremyfaller

@josharian
Copy link
Contributor

Other useful Unit metadata would be scaling (see benchstat’s newScaler) and metric names (see benchstat’s metricOf).

@rsc rsc changed the title proposal: benchmark unit properties proposal: testing: benchmark unit properties Oct 6, 2021
@rsc
Copy link
Contributor

rsc commented Oct 6, 2021

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@ChrisHines
Copy link
Contributor

and the following for command-line flags

-unit "unit key=value key=value ..."

I don't understand how this flag would be used. What tools do you envision supporting this flag?

Should this proposal address how benchmark code can control the properties for the units it uses?

I am wondering how a benchmark that doesn't use the default units would communicate its unit properties under this proposal. Take, for example, BenchmarkParallelTimerLatency in the time package.

go/src/time/sleep_test.go

Lines 709 to 711 in be571a3

b.ReportMetric(0, "ns/op")
b.ReportMetric(total/samples, "avg-late-ns")
b.ReportMetric(float64(max.Nanoseconds()), "max-late-ns")

I am reminded of this conversation: https://go-review.googlesource.com/c/go/+/232298/8..12/src/time/sleep_test.go#b585

@aclements
Copy link
Member Author

I don't understand how this flag would be used. What tools do you envision supporting this flag?

benchstat, namely. I don't plan on adding support for this to the current version of benchstat, but I have a complete rewrite almost done that I would like to add support to. We're also working on automated performance monitoring (#48803), which wouldn't use the command-line flag specifically, but we would like to have unit properties for.

Support for "exact" measurements in particular has been a sore spot in benchstat for benchmarks that record things like "binary size", for example. Since that's not going to change with repeated measurement, we only report one sample, but that means that benchstat will refuse to show you deltas because there isn't enough data for it to statistically significant under general distributional assumptions.

I am wondering how a benchmark that doesn't use the default units would communicate its unit properties under this proposal.

They can actually just fmt.Print a "Unit" line in a sync.Once from a benchmark function.

I am reminded of this conversation: https://go-review.googlesource.com/c/go/+/232298/8..12/src/time/sleep_test.go#b585

Heh, indeed. Benchstat v2 will handle this much better, without the need for unit properties. It parses units, and will normalize "ns" to "sec", then apply reasonable scaling to the result.

@ChrisHines
Copy link
Contributor

The improvements to benchstat sound really nice.

They can actually just fmt.Print a "Unit" line in a sync.Once from a benchmark function.

That seems ... unsatisfying. It also doesn't help discoverability much. Can we do anything more ergonomic?

@rsc
Copy link
Contributor

rsc commented Oct 13, 2021

Based on the discussion above, this proposal seems like a likely accept.
— rsc for the proposal review group

@aclements
Copy link
Member Author

@ChrisHines , I agree it's not particularly ergonomic, though I think it's ultimately a follow-on issue to this proposal. We added format and tooling support for custom metrics long before the testing package had B.ReportMetric (and there was no workaround there!).

I would add the unit properties specification to the Go benchmark format spec and could certainly mention there that testing.B benchmarks can just print these Units lines. Maybe it's worth mentioning in the doc for ReportMetric?

@ChrisHines
Copy link
Contributor

@aclements What is the target user base is for this feature?

If we want people writing benchmarks for their own code to know about and use this feature then I think the testing package docs are the place to put the information. That's my reference point when writing benchmarks. If it is anywhere else I suspect most people will never learn about it.

If the target user base is narrower for some reason, perhaps just those contributing to the Go itself, then maybe it belongs in a different place.

... I think it's ultimately a follow-on issue to this proposal. We added format and tooling support for custom metrics long before the testing package had B.ReportMetric ...

From that I think I understand that this proposal is limited to the Go benchmark data format and benchstat CLI flags. The proposal title "testing: benchmark unit properties" gave me the impression that the testing package itself was in scope. My apologies if I got too excited and jumped the gun.

@aclements
Copy link
Member Author

@aclements What is the target user base is for this feature?

I am intending this for anyone who writes Go benchmarks, generally more sophisticated ones (certainly if you're not using custom units you don't have much need for this!). Curiously, we don't actually reference the Go benchmark format or benchstat anywhere in the testing package documentation. It's really quite non-prescriptive. I think you're right that documenting this in the testing package is the right thing to do, though I think we'll have to pull in a bit more of the bigger Go benchmarking picture at the same time (which isn't a bad thing).

The proposal title "testing: benchmark unit properties" gave me the impression that the testing package itself was in scope.

That's fair. Maybe I should have titled this "x/perf:" or something. But I'm not sure you need to curb your enthusiasm. These things are all intentionally weakly coupled: as you point out, it won't initially be super ergonomic to access this from the testing package, but you can, and if this proposal is accepted, the broader tooling will make use of this.

@rsc
Copy link
Contributor

rsc commented Oct 20, 2021

No change in consensus, so accepted. 🎉
This issue now tracks the work of implementing the proposal.
— rsc for the proposal review group

@rsc rsc changed the title proposal: testing: benchmark unit properties testing: benchmark unit properties Oct 20, 2021
@rsc rsc modified the milestones: Proposal, Backlog Oct 20, 2021
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/357530 mentions this issue: 14313-benchmark-format: add unit metadata

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/357549 mentions this issue: benchfmt: add support for unit metadata

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/357590 mentions this issue: testing: document custom units

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/309969 mentions this issue: cmd/benchstat: new version of benchstat

zchee pushed a commit to zchee/golang-perf that referenced this issue Nov 28, 2021
This adds support for tracking unit metadata in Result, reading it in
the Reader, and writing it back out in the Writer.

Updates golang/go#43744.

Change-Id: Id3311340b6e6992c149f9bca44c2e1e442b53eb4
zchee pushed a commit to zchee/golang-perf that referenced this issue Nov 28, 2021
This is a complete rewrite of benchstat. Basic usage remains the same,
as does the core idea of showing statistical benchmark summaries and
A/B comparisons in a table, but there are several major improvements.

The statistics is now more robust. Previously, benchstat used
IQR-based outlier rejection, showed the mean of the reduced sample,
its range, and did a non-parametric difference-of-distribution test on
reduced samples. Any form of outlier rejection must start with
distributional assumptions, in this case assuming normality, which is
generally not sound for benchmark data. Hence, now benchstat does not
do any outlier rejection. As a result, it must use robust summary
statistics as well, so benchstat now uses the median and confidence
interval of the median as summary statistics. Benchstat continues to
use the same Mann-Whitney U-test for the delta, but now does it on the
full samples since the U-test is already non-parametric, hence
increasing the power of this test.

As part of these statistical improvements, benchstat now detects and
warns about several common mistakes, such as having too few samples
for meaningful statistical results, or having incomparable geomeans.

The output format is more consistent. Previously, benchstat
transformed units like "ns/op" into a metric like "time/op", which it
used as a column header; and a numerator like "sec", which it used to
label each measurement. This was easy enough for the standard units
used by the testing framework, but was basically impossible to
generalize to custom units. Now, benchstat does unit scaling, but
otherwise leaves units alone. The full (scaled) unit is used as a
column header and each measurement is simply a scaled value shown with
an SI prefix. This also means that the text and CSV formats can be
much more similar while still allowing the CSV format to be usefully
machine-readable.

Benchstat will also now do A/B comparisons even if there are more than
two inputs. It shows a comparison to the base in the second and all
subsequent columns. This approach is consistent for any number of
inputs.

Benchstat now supports the full Go benchmark format, including
sophisticated control over exactly how it structures the results into
rows, columns, and tables. This makes it easy to do meaningful
comparisons across benchmark data that isn't simply structured into
two input files, and gives significantly more control over how results
are sorted. The default behavior is still to turn each input file into
a column and each benchmark into a row.

Fixes golang/go#19565 by showing all results, even if the benchmark
sets don't match across columns, and warning when geomean sets are
incompatible.

Fixes golang/go#19634 by no longer doing outlier rejection and clearly
reporting when there are not enough samples to do a meaningful
difference test.

Updates golang/go#23471 by providing more through command
documentation. I'm not sure it quite fixes this issue, but it's much
better than it was.

Fixes golang/go#30368 because benchstat now supports filter
expressions, which can also filter down units.

Fixes golang/go#33169 because benchstat now always shows file
configuration labels.

Updates golang/go#43744 by integrating unit metadata to control
statistical assumptions into the main tool that implements those
assumptions.

Fixes golang/go#48380 by introducing a way to override labels from the
command line rather than always using file names.

Change-Id: Ie2c5a12024e84b4918e483df2223eb1f10413a4f
gopherbot pushed a commit to golang/perf that referenced this issue Mar 17, 2022
This adds support for tracking unit metadata in Result, reading it in
the Reader, and writing it back out in the Writer.

Updates golang/go#43744.

Change-Id: Id3311340b6e6992c149f9bca44c2e1e442b53eb4
Reviewed-on: https://go-review.googlesource.com/c/perf/+/357549
Trust: Austin Clements <[email protected]>
Run-TryBot: Austin Clements <[email protected]>
TryBot-Result: Gopher Robot <[email protected]>
Reviewed-by: Michael Knyszek <[email protected]>
@rsc rsc moved this to Accepted in Proposals Aug 10, 2022
@rsc rsc added this to Proposals Aug 10, 2022
skyzyx pushed a commit to northwood-labs/go-x-perf that referenced this issue Apr 12, 2024
This is a complete rewrite of benchstat. Basic usage remains the same,
as does the core idea of showing statistical benchmark summaries and
A/B comparisons in a table, but there are several major improvements.

The statistics is now more robust. Previously, benchstat used
IQR-based outlier rejection, showed the mean of the reduced sample,
its range, and did a non-parametric difference-of-distribution test on
reduced samples. Any form of outlier rejection must start with
distributional assumptions, in this case assuming normality, which is
generally not sound for benchmark data. Hence, now benchstat does not
do any outlier rejection. As a result, it must use robust summary
statistics as well, so benchstat now uses the median and confidence
interval of the median as summary statistics. Benchstat continues to
use the same Mann-Whitney U-test for the delta, but now does it on the
full samples since the U-test is already non-parametric, hence
increasing the power of this test.

As part of these statistical improvements, benchstat now detects and
warns about several common mistakes, such as having too few samples
for meaningful statistical results, or having incomparable geomeans.

The output format is more consistent. Previously, benchstat
transformed units like "ns/op" into a metric like "time/op", which it
used as a column header; and a numerator like "sec", which it used to
label each measurement. This was easy enough for the standard units
used by the testing framework, but was basically impossible to
generalize to custom units. Now, benchstat does unit scaling, but
otherwise leaves units alone. The full (scaled) unit is used as a
column header and each measurement is simply a scaled value shown with
an SI prefix. This also means that the text and CSV formats can be
much more similar while still allowing the CSV format to be usefully
machine-readable.

Benchstat will also now do A/B comparisons even if there are more than
two inputs. It shows a comparison to the base in the second and all
subsequent columns. This approach is consistent for any number of
inputs.

Benchstat now supports the full Go benchmark format, including
sophisticated control over exactly how it structures the results into
rows, columns, and tables. This makes it easy to do meaningful
comparisons across benchmark data that isn't simply structured into
two input files, and gives significantly more control over how results
are sorted. The default behavior is still to turn each input file into
a column and each benchmark into a row.

Fixes golang/go#19565 by showing all results, even if the benchmark
sets don't match across columns, and warning when geomean sets are
incompatible.

Fixes golang/go#19634 by no longer doing outlier rejection and clearly
reporting when there are not enough samples to do a meaningful
difference test.

Updates golang/go#23471 by providing more through command
documentation. I'm not sure it quite fixes this issue, but it's much
better than it was.

Fixes golang/go#30368 because benchstat now supports filter
expressions, which can also filter down units.

Fixes golang/go#33169 because benchstat now always shows file
configuration labels.

Updates golang/go#43744 by integrating unit metadata to control
statistical assumptions into the main tool that implements those
assumptions.

Fixes golang/go#48380 by introducing a way to override labels from the
command line rather than always using file names.

Change-Id: Ie2c5a12024e84b4918e483df2223eb1f10413a4f
Reviewed-on: https://go-review.googlesource.com/c/perf/+/309969
Run-TryBot: Austin Clements <[email protected]>
Reviewed-by: Michael Pratt <[email protected]>
TryBot-Result: Gopher Robot <[email protected]>
gopherbot pushed a commit to golang/proposal that referenced this issue Oct 29, 2024
The predefined keys differ slightly from what I proposed in golang/go#43744. I
tried specifying {higher,lower}={better,worse} like I originally
proposed and it just got really messy. Turning it around to
better={higher,lower} means its somewhat backwards from what you might
expect from English phrasing, but it lets us use a single key because
I don't think anyone is going to accidentally write
worse={higher,lower}, and this avoids any annoying questions about
what happens if a user specifies both "higher" and "lower".

For golang/go#43744.

Change-Id: I895914b179c291003e76f897cabbcbdb2381f163
Reviewed-on: https://go-review.googlesource.com/c/proposal/+/357530
Reviewed-by: Michael Knyszek <[email protected]>
@aclements
Copy link
Member Author

This got lost in the mists of time, but has now been submitted to the benchmark format spec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Accepted
Development

No branches or pull requests

5 participants